a speed gun for spam

| February 24th, 2011

Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).

The discriminator referred above is typing speed. The graph above plots the content length of a comment posted by a user against the (approximate) time he took to write it. If a user posts more than one comment in window of 5-10 minutes, we can consider those comments as consecutive posts. Any comment falling outside this window is ignored. In the above plot, the time is inferred by [time_of_current_post - time_of_previous_post] for consecutive posts. Thus typing speed is estimated as (content_lenght)/(time_delta_between_two_posts). This is a non-intrusive way of measuring typing speed from the logs. Though for a more accurate number one could always instrument the page with javascript (see example here). The dataset was manually labeled as spam (depicted in red) and ham (depicted in blue).

Wikipedia lists the average words per minute (wpm) for a regular internet user at around 30 wpm. With a conversion factor of 5 to characters per minute (cpm), this amounts to ~2.5 characters per second. The green line in the plot depicts a projection of the content length a user could have typed in the given time with an average typing speed (of ~2.5 chars per sec). We observe that this line clearly separates out most spam from ham. The ham posts that fall above this line are usually trolls (as observed).

This turns out to be a nice feature to tell spammers (bots and non-bots), trolls, and regular users apart. Bots often fail the turing test and don’t try hard enough to be more human like. Non-bot spammers on the other hand have to take the pains to type their spam comment repeatedly and usually end up pasting it.Try out the example here .

So spammers fix yourselves cause we have the speed gun to pull you over.

I came across this interesting pattern while trying to visualize some of the Twitter streaming data. The following charts plot the ‘following’ counts vs the ‘followers’ counts (for ~200K user accounts). The data represents one hours worth of data obtained via the streaming API. User accounts falling around the line y ~= 0 tend to generally be celebrities (musicians, sportsmen etc), companies, news and info bots (like the WSJ, CNN etc). The general population usually falls around the line y = x (the ‘I follow you, You follow me’ kind). But thats not whats interesting here (we all knew that). Looking at the zoomed in plots (figure 2 and figure 5), we see a distinct square formed by at (0,0) (2000,2000). This is also observed in another days data (figure 5) so its not just an anomaly. The plateau formed at y=2000 is a bit perplexing. I can’t seem to get my head around that. Figure (3) tries to look at the user accounts with ~2000 ‘following’ – a large number of these users turn out to be spam bots. I suspect most spam account (bots) are concentrated around this region. Its as if the spam bots tend to follow around 2000 users at max so as to not alert the spam controls by mass following users.

Any hypothesis that comes to your mind?

Figure 1: plot for day 1

Figure 2: plot for day 1 (zoomed)

Figure 3: plot for day 1 with y ~ 2000

Figure 4: plot for day 2

Figure 5: plot for day 2

Web Content Extraction Dataset

| August 22nd, 2009

For a recent project, we (sudheer_624 and I) have had to deal with developing algorithms to extract the true content from any given web page. By true content I mean the text excluding the ads, navigational links/text, etc even excluding comments (if any). Thus, given a blog post we are interested in extracting just the content of the post and not the comments and other surrounding text. We did not come across any dataset for the given task that would let us evaluate our algorithms. We recently generated our own dataset for this purpose and would like to share it with anyone tackling a similar problem.

The dataset contains the html source and text content (true content) for around ~4000 webpages. One metric to measure your algorithm against this dataset could be the edit distance. If you do use this dataset, it would be great if you could share the results of your algorithms for benchmarks to compare against. I’ll be updating this post with the accuracy of our algorithm soon enough.

Download the dataset here (gzipped)