Many a times I’ve stared at Explored Flickr Photos and tried grokking its artistic nuances. My lack of artistic sensibility, at times causes me to fail to understand the photography techniques or properties that the photographer used or intended to capture. But the Flickr community is brimming with experts who often chime in about what they like/see in comments. My #nlproc hack (for the upcoming Yahoo! Winter Hackday) aims to solve this by summarizing this expert knowledge (wisdom of crowd) for a photograph.
What You Comment Is What You See (WYCIWYS) is a Flickr hack that harnesses the comments of photos to determine the attributes/properties of the photo that people are talking about. It also gives a sentiment score (+ve) for each attribute to help a user gauge what other users find most interesting about a photo. Following are some outputs for WSCIWYS (click to zoom):

Posted in Hacking, Natural Language Processing, Yahoo! | 1 Comment »
Profanity is often prevalent in user generated content (like comments). Websites that do not want to display such profane comments/content currently employ masking as a solution to get rid of profanity. Masking replaces the profanity in the content with characters like ####. The masked content still though conveys the existence of profanity to the user. Humans have built up a great language model to infer missing words. Try it yourself – it should be easy for you to guess a bunch of profanity words for the following sentence:
What the ####!
My hack (Bleep) for the Yahoo! Spring ’11 Hackday is yet another natural language hack that tries to remove the profanity from a comment without altering the semantics of the content. In brief, removing the profanity word from the content makes the parse tree less probable. The algorithm tries to alter this improbable parse tree to find the best local parse tree.
Following are some corrections suggested by Bleep:





Posted in Abuse, Hacking, Natural Language Processing, Research, Yahoo! | No Comments »
Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).

The discriminator referred above is typing speed. The graph above plots the content length of a comment posted by a user against the (approximate) time he took to write it. If a user posts more than one comment in window of 5-10 minutes, we can consider those comments as consecutive posts. Any comment falling outside this window is ignored. In the above plot, the time is inferred by [time_of_current_post - time_of_previous_post] for consecutive posts. Thus typing speed is estimated as (content_lenght)/(time_delta_between_two_posts). This is a non-intrusive way of measuring typing speed from the logs. Though for a more accurate number one could always instrument the page with javascript (see example here). The dataset was manually labeled as spam (depicted in red) and ham (depicted in blue).
Wikipedia lists the average words per minute (wpm) for a regular internet user at around 30 wpm. With a conversion factor of 5 to characters per minute (cpm), this amounts to ~2.5 characters per second. The green line in the plot depicts a projection of the content length a user could have typed in the given time with an average typing speed (of ~2.5 chars per sec). We observe that this line clearly separates out most spam from ham. The ham posts that fall above this line are usually trolls (as observed).
This turns out to be a nice feature to tell spammers (bots and non-bots), trolls, and regular users apart. Bots often fail the turing test and don’t try hard enough to be more human like. Non-bot spammers on the other hand have to take the pains to type their spam comment repeatedly and usually end up pasting it.Try out the example here .
So spammers fix yourselves cause we have the speed gun to pull you over.
Posted in Abuse, Research | 2 Comments »
Though I’ve only recently started tackling it (spam), what I hear from veterans is that spam is hard problem. It is so not because its difficult to model (unlike some sub-domains in NLP) but because essentially it is a battle of human-vs-human. The opponent is now a constantly evolving machine. They learn and they learn fast. This keeps those fighting spam on their toes and you need to react to new techniques that they learn to get past the filters. Most of the work thus involved is on a reactive basis. Basically you keep iterating the following cycle: deploy -> observe -> learn -> model -> deploy
Now lets consider a sample spam text: “Find sexy girls and guys at xyz.com”. The simplest classifier (lets assume Bayesian text classifier) will start to crumble once the spammer changes the text to “fin d sex y girl s an d guy s a t xyz.co m”. So you will label and retrain your classifier to catch this new trick.
To get out of this vicious reactive cycle, you need to test your model proactively against the possible techniques a spammer could come up with to get away. This is where comes in YODA (acronym for Overly Determined Abuser), a genetic programming based model of a spammer I built (yes we have 20% time as well) to break our spam detection models. As any other genetic algorithm framework, it needs implementations of fitness functions and genome functions. The idea is to model characteristics of a spammer (variables that a spammer can manipulate) as genome functions. The genome functions represent the minimalist change that can be made to the text. For instance, changing the case of characters, modifying sentence delimiters, modifying word delimiters etc. The genome functions need not be just text modification functions but could also represent other attributes of a spammer (like IP etc). The fitness functions represent the criteria the spammer is trying to optimize i.e. to get past the filters with minimal distortions to the spam text. This could be the edit distance combined with the score returned by the model/filter.
Once the fitness function and many such genome functions have been defined, you can set these spam bots free to undergo selection, crossover and mutation. In the end (when you decide to stop the evolution), you will end up with bots that are far more complex than just the basic genome functions defined. The transformations to the original text might be beyond what you could have thought of testing against.
Following are some results of this model on the same spam text using the above mentioned basic genome functions:
- F.ind.s.exy g.irls.a.nd.g.uys.a.t.x.yz.com
- f iñd s exy gi rls ã ñd g úys a t xy z.çom
- FI ND sE Xy Gir Ls anD gu yS AT XyZ. COM
- Find_sexy girls and_guys at_xyz.com
Posted in Research, Yahoo! | No Comments »