<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>semanticvoid &#187; Search</title>
	<atom:link href="http://semanticvoid.com/blog/index.php/category/search/feed/" rel="self" type="application/rss+xml" />
	<link>http://semanticvoid.com/blog</link>
	<description>extracting the semantics from the void</description>
	<lastBuildDate>Thu, 22 Sep 2011 21:05:48 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>`Fact`orize Your Search</title>
		<link>http://semanticvoid.com/blog/2009/08/14/factorize-your-search/</link>
		<comments>http://semanticvoid.com/blog/2009/08/14/factorize-your-search/#comments</comments>
		<pubDate>Fri, 14 Aug 2009 07:37:08 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Yahoo!]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=308</guid>
		<description><![CDATA[Dygest and a hackday later, @sudheer_624 and I (@semanticvoid) are back with &#8216;dfacto&#8217;, codename for our latest search hack for Yahoo! Hackday Summer 2009. I think that search is undergoing a paradigm shift &#8211; its no longer about who presents the best ten blue links but now more about presenting the answers upfront. Dfacto (pronounced [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://semanticvoid.com/blog/2009/03/19/dygest-your-search/">Dygest</a></strong> and a hackday later, <a href="http://twitter.com/sudheer_624">@sudheer_624</a> and I (<a href="http://twitter.com/semanticvoid">@semanticvoid</a>) are back with <strong>&#8216;dfacto&#8217;</strong>, codename for our latest search hack for Yahoo! Hackday Summer 2009.</p>
<p>I think that search is undergoing a paradigm shift &#8211; its no longer about who presents the best ten blue links but now more about presenting the answers upfront. <strong>Dfacto</strong> (pronounced as &#8216;<em>de facto</em>&#8216;, Latin for &#8216;<em>by [the] fact</em>&#8216;) is aimed at addressing this issue. A large percentage (nearly 68%) of queries are informational queries &#8211; one where the searcher knows what she&#8217;d like to do or find but does not know how this can be achieved. <strong>Dfacto</strong> is aimed primarily at addressing this class of queries by presenting a set of facts associated with the query/topic to the searcher. It uses natural language algorithms to get facts that are most &#8220;semantically&#8221; related to the query. In lay terms, it literally tries to understand your query and the results. I&#8217;ll save the algorithmic details for another post. The few examples below show how it works:</p>
<p><em>Disclaimer: This is a work in progress, so you might notice a few &#8216;facts&#8217; that are irrelevant to the query.</em></p>
<p>Lets say the searcher is (losing hair and) looking for causes of hair loss. Normally he/she would need to click through a bunch of links to get an overview on the causes. This hack on the other hand makes life a bit easier by presenting the causes upfront (click to enlarge):</p>
<p><center><a href="http://farm3.static.flickr.com/2525/3819295965_c7f9c3a651_o.png">click to enlarge<br /><img src="http://farm3.static.flickr.com/2525/3819295965_d8d3055f49.jpg" alt="'hair loss cause'" /></a><br /></center></p>
<p>Along with the facts, we also list the source from where it was extracted. Alternatively, the searcher can also select a bunch of facts he/she thinks are relevant and refine the search. This in turn would yield a new set of &#8216;web results&#8217; along with new refined and related &#8216;facts&#8217;.</p>
<p>Another example (one which I particularly like) is a query about &#8216;table manners&#8217;. This precisely lists a set of etiquette&#8217;s to follow at the table (click to enlarge).</p>
<p><center><a href="http://farm3.static.flickr.com/2587/3820121342_ac99f01072_o.png"> click to enlarge<br /> <img src="http://farm3.static.flickr.com/2587/3820121342_543ae9bb92.jpg" alt="'table manners'" /></a></center></p>
<p>Alternatively, <strong>Dfacto</strong> also serves well as a product research tool. A query for &#8216;iphone 3gs&#8217; yeilds (click to enlarge):</p>
<p><center><a href="http://farm3.static.flickr.com/2595/3820128618_cfbc2db7d6_o.png"> click to enlarge<br /> <img src="http://farm3.static.flickr.com/2595/3820128618_5fb29f2762.jpg" alt="'iphone 3gs'" /></a></center></p>
<p>On another note, if you have a date in the coming weeks you might be interested in reading the list below (:</p>
<p><center><a href="http://farm3.static.flickr.com/2669/3819328509_59c127b413_o.png"> click to enlarge<br /> <img src="http://farm3.static.flickr.com/2669/3819328509_ba08fe9e02.jpg" alt="'first date tips'" /></a></center></p>
<p>Happy hacking!</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/08/14/factorize-your-search/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Dygest Your Search</title>
		<link>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/</link>
		<comments>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 06:56:36 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Project]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Yahoo!]]></category>
		<category><![CDATA[Add new tag]]></category>
		<category><![CDATA[summarization]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=256</guid>
		<description><![CDATA[Update: This hack won the coveted &#8216;Search&#8217; category award. For the last couple of days, I and @sudheer_624 have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with. Dygest [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Update:</strong> This hack won the coveted &#8216;Search&#8217; category award.</p>
<p>For the last couple of days, I and <a href="http://twitter.com/sudheer_624">@sudheer_624</a> have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with.</p>
<p><strong>Dygest</strong> (pronounced as &#8216;digest&#8217; &#8211; thanks to <a href="http://twitter.com/bluesmoon">@bluesmoon</a>) is aimed at changing the conventional way of displaying search context via a snippet to a more informative, machine generated document summary. There two kinds of relevance for evaluating search results:</p>
<ul>
<li>Vertical relevance: determined by the ranking algorithms.</li>
<li>Horizontal relevance: the contextual information made available to the user about the result &#8211; Searchmonkey is a good initiative on this front.</li>
</ul>
<p>
The current way of displaying this context is via a snippet of text under every result. This snippet shows the neighborhood of the occurrence of the query terms. Usually this information is not rich enough for a searcher to make the right judgement about the result. This causes the searcher to switch back and forth between the documents and the search results if the the page is not relevant. This can be frustrating at times.</p>
<p>
<strong>Dygest</strong> aims to solve this by either replacing or enhancing the current search snippet with a summary of the result page. At its core lies a summarization engine which figures out what the *real* content of the page is (distinguishing it from the other junk like surrounding text, navigational text, comments etc) and then performs text summarization on this content. The summary of the page is then displayed to the user via the appropriate interface. How cool is that?</p>
<p>
The user no longer needs to click on irrelevant links. He/She can perceive the theme/important facts of the page from right within the results page. The other advantage of this is that it gives the user a good overview of the query topic &#8211; he no longer needs to spend time reading many long documents but rather read a few summaries from the top results to get a good overview of the subject. This is particularly well suited for mobile devices where its frustrating to switch back and forth between pages and the search results. This is also fit for news articles where we just need the important facts about the story. </p>
<p>
Well, here is an example to convince you. A search for &#8216;Carol Bartz&#8217; yields the following result which at the first glance is not at all informative.</p>
<p><center> <img alt="" border="2" src="http://farm4.static.flickr.com/3456/3369960208_48edc07644_o.png" title="search snippet for Carol Bartz" /> </center></p>
<p>
Enhancing the existing view with an abstract of the page helps gauge the content and theme of the document. This would now look like:</p>
<p><center> <img alt="" src="http://farm4.static.flickr.com/3637/3369975750_f0b313ae61_o.png" title="summarized view" /> </center></p>
<p><strong>Dygest</strong> outputs the following summaries for the query &#8216;<a href="http://datacracy.info/cgi-bin/dygest/search.py?q=iran+site%3Anews.yahoo.com">Iran</a>&#8216; restricted to Yahoo! News:</p>
<p><center><img alt="" src="http://farm4.static.flickr.com/3658/3370011200_a757dc42d8_o.png" title="Query for Iran" /></center></p>
<p>And following for &#8216;<a href="http://datacracy.info/cgi-bin/dygest/search.py?q=obama+stimulus+plan">Obama stimulus plan</a>&#8216;:</p>
<p><center><img alt="" src="http://farm4.static.flickr.com/3578/3370098322_1a73cd285b_o.png" title="obama stimulus plan"  /></center></p>
<p>Currently, <strong>Dygest</strong> has two interfaces &#8211; (1) a search interface powered by yahoo boss and (2) a searchmonkey plugin. Its just a prototype so be kind and don&#8217;t be too judgmental.</p>
<p>Start dygest<em>ing</em> <a href="http://datacracy.info/dygest/">here</a>.</p>
<p><center><br />
<script src="http://pipes.yahoo.com/js/imagebadge.js">{"pipe_id":"3hCWTB0Y3hG3E9xK6ycw5g","_btype":"image"}</script><br />
</center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Monkey Just Got Delicious &#8211; II</title>
		<link>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/</link>
		<comments>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/#comments</comments>
		<pubDate>Tue, 12 Aug 2008 06:11:35 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Yahoo!]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/</guid>
		<description><![CDATA[[UPDATE] Try the search monkey app here This is a follow-up post of The Monkey Just Got Delicious &#8211; I. The app is not yet public for the reasons mentioned in part I. As I had mentioned, my goal was to generate a tag cloud for the search results. Well, search monkey does not allow [...]]]></description>
			<content:encoded><![CDATA[<p><b>[UPDATE]</b> Try the search monkey app <a href="http://gallery.search.yahoo.com/application?smid=YLs.s">here</a></p>
<p>This is a follow-up post of <a href="http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/">The Monkey Just Got Delicious &#8211; I</a>. The app is not yet public for the reasons mentioned in <a href="http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/">part I</a>. As I had mentioned, my goal was to generate a tag cloud for the search results. Well, search monkey does not allow you to spit out arbitrary html, thus making it difficult to render a tag cloud. After much thought I settled for a color coded tag cloud (as in the screenshot below). You will notice the color of the tags gradually fading (darker shade means that the tag is more popular). </p>
<p>Got feedback, will listen.</p>
<p><center><br />
<table>
<tr>
<td><img src="http://farm4.static.flickr.com/3105/2756185874_3907a8df26_o.png" alt="New Deliciousify" /></td>
<td> <img src="http://farm4.static.flickr.com/3043/2756301394_021e904c67_o.png"/></td>
</tr>
</table>
<p></center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Monkey Just Got Delicious</title>
		<link>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/</link>
		<comments>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 09:04:56 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Yahoo!]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/</guid>
		<description><![CDATA[[UPDATE] Try the search monkey app here [UPDATE] New tag cloud UI for deliciousify can be viewed here [UPDATE] The search monkey app is currently disabled for public use as it was hitting the delicious rate limit. Hence it will remain as a prototype for now. BTW the delicious team is working on their own [...]]]></description>
			<content:encoded><![CDATA[<p><strong>[UPDATE]</strong> Try the search monkey app <a href="http://gallery.search.yahoo.com/application?smid=YLs.s">here</a></p>
<p><strong>[UPDATE]</strong> New tag cloud UI for deliciousify can be viewed <a href="http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious-ii/">here</a></p>
<p><strong>[UPDATE]</strong> The search monkey app is currently disabled for public use as it was hitting the delicious rate limit. Hence it will remain as a prototype for now. BTW the delicious team is working on their own search monkey app and I bet its going to be much cooler.</p>
<p><img src="http://developer.search.yahoo.com/images/searchmonkeyLogo147x150.gif" alt="" />I&#8217;m a big fan and an avid user of <a href="http://developer.search.yahoo.com">Yahoo! Search Monkey</a>. So this weekend I decided to write myself a search monkey application that I have always wished for. Well, we all will agree that nothing beats human created metadata and what better metadata about search results can there be than the vast and rich knowledge stored in bookmarking services. My search monkey application deals with enriching the organic search results from Yahoo! with metadata from del.icio.us.</p>
<p><center>[LINK DISABLED] <a href="">Try Deliciousify Search Monkey App here</a></center></p>
<p>Sometimes the search summary does not provide a useful insight into the contents of the search result (as seen below). The only way users ascertain relevance is by clicking on the result and figuring it out themselves. Wouldn&#8217;t it be better if the contents of the result could be summarized by just a few words &#8211; keywords that highlight broadly what the document talks about. Deliciousify (as seen below) aims to solve this problem by listing the top tags about a search result from del.icio.us, along with its popularity (number of people who have bookmarked it). In the future, I plan to display a tag cloud for the results. Give it a try and send any comments/feedback my way.</p>
<p><center>[LINK DISABLED] Make your search results more delicious &#8211; <a title="Add the deliciousify Enhanced Result to your Search preferences" href=""> click here </a></center></p>
<p><center><img src="http://farm4.static.flickr.com/3024/2753013032_2393cce7b0_o.png" alt="" /></center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/08/11/the-monkey-just-got-delicious/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Performance Juice</title>
		<link>http://semanticvoid.com/blog/2008/07/02/performance-juice/</link>
		<comments>http://semanticvoid.com/blog/2008/07/02/performance-juice/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 02:16:01 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Performance]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2008/07/02/performance-juice/</guid>
		<description><![CDATA[In the company of *exceptional* people at the Exceptional Performance Group, I too have caught on the performance bug. I asked myself today &#8211; &#8220;Wouldn&#8217;t it be great to have all the performance blogs consolidated at one place&#8221;. There are a number of performance experts (both frontend and backend) outside of Yahoo!. I do have [...]]]></description>
			<content:encoded><![CDATA[<p>In the company of *exceptional* people at the <a href="http://developer.yahoo.com/performance/">Exceptional Performance Group</a>, I too have caught on the performance bug. I asked myself today &#8211; &#8220;Wouldn&#8217;t it be great to have all the performance blogs consolidated at one place&#8221;. There are a number of performance experts (both frontend and backend) outside of Yahoo!. I do have a couple of such blogs on my feedreader and I&#8217;m sure you to have some to share. </p>
<p>I&#8217;m trying to consolidate all such performance centric blogs at <a href="http://performance.semanticvoid.com">Performance Juice</a>. While I&#8217;m busy scouring the web for some great blogs I may have missed, I would love to include any links you might have into Performance Juice. You can contribute by listing blogs into the spreadsheet below. You can edit the spreadsheet <a href="http://spreadsheets.google.com/ccc?key=p6fdU2ybZP59XWq6qfz3wKg">here</a>. Lets make Performance Juice a community driven vertical search for anything performance.</p>
<p>Be sure to quench your performance thirst at <a href="http://performance.semanticvoid.com">Performance Juice</a> or just subscribe to this <a href="http://pipes.yahoo.com/pipes/pipe.run?_id=JLOquzdH3RGK8k1pj9zu1g&#038;_render=rss">feed</a> and stay up to date on the performance karma.</p>
<p>There are many performance related posts we come across everyday which are not necessarily from performance centric blogs. If you come across any such great posts, just tag them as &#8220;performancejuice&#8221; on del.icio.us and I&#8217;ll make sure they are included.</p>
<p>Any volunteers to manage the search engine? Head <a href="http://www.google.com/coop/manage/cse/volunteer?cx=004770276603218297414%3A_x666hcwuom&#038;continue=http%3A%2F%2Fwww.google.com%2Fcoop%2Fcse%3Fcx%3D004770276603218297414%3A_x666hcwuom&#038;sig=__XNZFUQYya5eAvTycFiBvO32Yi4w=">here</a>.</p>
<p>[UPDATE] The custom search is now also powered by performance bookmarks on del.icio.us.</p>
<p><iframe width='450' height='300' frameborder='0' src='http://spreadsheets.google.com/pub?key=p6fdU2ybZP59XWq6qfz3wKg&#038;output=html&#038;gid=0&#038;single=true&#038;widget=true'></iframe></p>
<p>You can edit the spreadsheet <a href="http://spreadsheets.google.com/ccc?key=p6fdU2ybZP59XWq6qfz3wKg">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2008/07/02/performance-juice/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Utilizing Google &#8216;Unavailable After&#8217; Tag For Search Engine Ranking</title>
		<link>http://semanticvoid.com/blog/2007/07/24/utilizing-google-unavailable-after-tag-for-search-engine-ranking/</link>
		<comments>http://semanticvoid.com/blog/2007/07/24/utilizing-google-unavailable-after-tag-for-search-engine-ranking/#comments</comments>
		<pubDate>Mon, 23 Jul 2007 20:34:41 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/07/24/utilizing-google-unavailable-after-tag-for-search-engine-ranking/</guid>
		<description><![CDATA[There was a speculation about Google planning to introduce an &#8216;unavailable after&#8217; meta tag. It would probably look something like this: < META name='unavailable_after' content='Wed, 01 Aug 2007 00:00:01 GMT'> By specifying this tag, webmasters can tell Google not to index a particular page after the specified date OR consider the given page as stale. [...]]]></description>
			<content:encoded><![CDATA[<p>There was a <a target="_blank" href="http://searchengineland.com/070712-093059.php">speculation</a> about Google planning to introduce an &#8216;unavailable after&#8217; meta tag. It would probably look something like this:</p>
<p><strong>< META name='unavailable_after' content='Wed, 01 Aug 2007 00:00:01 GMT'></strong></p>
<p>By specifying this tag, webmasters can tell Google not to index a particular page after the specified date OR consider the given page as stale. This would be appropriate for promotional pages where the promotions expire after a given time. This would help unclog the search engine indexes of irrelevant data.</p>
<p>This valuable piece of information, provided by the &#8216;unavailable_after&#8217; tag, would not only be used to clear up the Google&#8217;s index but could also make its way into Google&#8217;s ranking algorithms. There are two perspectives to how a search engine could use this data for ranking:</p>
<ul>
<li>When a page specifies its expiry/unavailability date, it implicitly tells the search engine of the period for which it would be most relevant. Hence, as the unavailability date of a page approaches it should start becoming less irrelevant to a users query. For example: The user query for &#8216;<a target="_blank" title="Google search for 'fedora release notes'" href="http://www.google.com/search?hl=en&#038;q=fedora+release+notes">fedora release notes</a>&#8216;, currently, has the top 3 results pointing to the notes of FC7, whereas the other results have a random mix of release notes for FC3, FC2, FC4 and FC5 (with FC3 and FC2 pages being ranked higher than FC5 and FC4 respectively). Lets say that FC8 was going to be released this November [<a target="_blank" href="http://fedoraproject.org/wiki/Releases/8/Schedule">schedule</a>]. Assume that the Release Notes page for FC7 has the unavailable_after tag set for sometime around December. Thus, as December approaches, FC7 pages would start losing their relevancy for such queries and gradually transition lower in the search rankings, making FC8 the most relevant result. This would resolve the current inaccurate ordering of results obtained for the query &#8216;<a target="_blank" title="Google search for 'fedora release notes'" href="http://www.google.com/search?hl=en&#038;q=fedora+release+notes">fedora release notes</a>&#8216; on Google. This could be achieved in a manner similar to proposed in the following paper: <a target="_blank" href="http://semanticvoid.com/papers/time_damping_of_textual_relevance.pdf">Time Damping Of Textual Relevance.</a></li>
<li>This perspective (inverse to the above perspective) would be very specific to promotional/shopping related pages. Most shopping promotions/offers are valid for a given period. Hence, such pages should become more relevant to a users query as they approach their expiry date. For example: Consider a user query for &#8216;<a target="_blank" title="Google search for '20% discount shoes'" href="http://www.google.com/search?hl=en&#038;q=20%25+discount+shoes">20% discount shoes</a>&#8216;. Lets assume that the results have pages from Zappos as well as Shoebuy both offering a 20% discount on shoes. The Shoebuy sale is going to last for about two more weeks from today (as specified by the unavailable_after tag) whereas the Zappos sale would be ending in another two days. Since both the stores are offering the same percentage discount, it would be more appropriate to rank the Zappos page higher as its offer would be ending soon. From the point of a user (shopper), he would be more interested in looking at offers ending soon, as he can always checkout the other long lasting offers at some later time.</li>
</ul>
<p>The perspectives above represent only a few of the conceivable usages of the unavailable_after tag. There could be a numerous other perspectives to how this data could be utilized to improve search rankings.</p>
<p>I would lover to hear your take on the unavailable tag, particularly if you can provide another perspective to utilizing this data.</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2007/07/24/utilizing-google-unavailable-after-tag-for-search-engine-ranking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tag Search: Inferring Relevance From User Authority</title>
		<link>http://semanticvoid.com/blog/2007/03/04/tag-search-inferring-relevance-from-user-authority/</link>
		<comments>http://semanticvoid.com/blog/2007/03/04/tag-search-inferring-relevance-from-user-authority/#comments</comments>
		<pubDate>Sat, 03 Mar 2007 20:27:48 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Social Networks]]></category>
		<category><![CDATA[Tagging]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/03/04/tag-search-inferring-relevance-from-user-authority/</guid>
		<description><![CDATA[Search has always been an integral part of any tagging system. Such systems need to make sense out of the abundant user generated metadata such that the documents/items can be ranked in some order. However, very little has been said or written openly about such ranking algorithms for tagging systems. Conventional Methods Most systems, that [...]]]></description>
			<content:encoded><![CDATA[<p>Search has always been an integral part of any tagging system. Such systems need to make sense out of the abundant user generated metadata such that the documents/items can be ranked in some order. However, very little has been said or written openly about such ranking algorithms for tagging systems.</p>
<p><span style="font-weight: bold">Conventional Methods</span></p>
<p>Most systems, that allow tag search, base their rankings on factors like simply the &#8216;number of unique users&#8217; or on ratios like &#8216;number of unique users for tag t / number of unique users for all tags&#8217; etc. These conventional algorithms do work, but not quite so well for large datasets where they can be exploited. They also often do not represent the true relevance. Reminds me often of the pre-<a target="_blank" href="http://en.wikipedia.org/wiki/Page_Rank">PageRank </a>era of information retrieval systems.</p>
<p><span style="font-weight: bold">So, which relevance algorithm do I use?</span></p>
<p>Well, you can always use the conventional methods, but then you can always try the algorithm I devised. This algorithm seems to capture the true essence of relevance in tagging systems. I call it the <span style="font-weight: bold">WisdomRank </span>as it is truly based on the &#8216;wisdom&#8217; of the crowds, the fundamental part of any social system. Read along to understand it in detail (or download the <a target="_blank" title="wisdom rank" href="http://docs.semanticvoid.com/wisdomRank.pdf">pdf</a>).</p>
<hr />
<p align="center" class="MsoNormal" style="text-align: center"><span style="font-size: 16pt; font-family: Tahoma">Inferring relevance for tag search</span></p>
<p align="center" class="MsoNormal" style="text-align: center"><span style="font-size: 16pt; font-family: Tahoma"> from user authority – Abstract</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Tagging is an act of imparting human knowledge/wisdom to objects. Thus a tag, a one word interpretation/categorization of the object by the user, fundamentally represents the basic unit of human wisdom for any object. This wisdom is difficult to quantify as it is relative for every user. One approach to quantify this would be to use the wisdom of the other users to define this for us. This can be done by assuming that every tag corresponds to a topic for which every user has some authority. Also, every tag added to an object corresponds to a vote, similar to the Digg model, asserting that the object belongs to that topic (tag).</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Let us consider a user Ui who has tagged object Oj with the tag Tk. Whenever other users in the system tag Oj with Tk, they are implicitly affirming Ui’s wisdom for tag Tk.</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Thus, we define the function <strong>affirmation</strong> for the <strong>tuple(u, d, t)</strong> as the number of other users who have also tagged document d with tag t:</span></p>
<p align="center" class="MsoNormal" style="text-align: center"><strong><span style="font-family: Tahoma">affirmation(u, d, t) = ∑<sub>i=All users except ‘u’</sub> tagged(u<sub>i</sub>, d, t)</span></strong></p>
<p><span style="font-family: Tahoma">where,</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">          u – the user<br />
d – the document/object<br />
t – the tag<br />
tagged – 1 if the user Ui has tagged d with t<br />
-  0 otherwise</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Hence, we can proceed to define the wisdom of the user for a topic (tag) t as the sum of all such assertions by other users,</span><strong><span style="font-family: Tahoma"><br />
</span></strong></p>
<p align="center" class="MsoNormal" style="text-align: center"><strong><span style="font-family: Tahoma">wisdom(u, t) = ∑<sub>x=For all documents d tagged with tag t by U </sub>affirmation(u, d, t)</span><span style="font-family: Tahoma"><br />
</span></strong></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Likewise, we can now define the <strong>authority</strong> of a user for the topic <strong>t</strong>, as the ratio of the user’s wisdom to the collective wisdom for <strong>t</strong>. Hence,<br />
</span>
</p>
<p align="center" class="MsoNormal" style="text-align: center; text-indent: 0.5in"><strong><span style="font-family: Tahoma">authority(u, t) = wisdom(u, t) / ∑ wisdom(u<sub>i</sub>, t)</span></strong></p>
<p class="MsoNormal" style="text-indent: 0.5in"><strong><span style="font-family: Tahoma"> </span></strong><span style="font-family: Tahoma">For example: Let us determine the authority of user u1 for tag t1</span><br />
<strong><span style="font-family: Tahoma" /></strong><br />
<span style="font-family: Tahoma">          Object d1:    Object d2:    Object d3:<br />
t1 by u1                    </span><span style="font-family: Tahoma">    </span><span style="font-family: Tahoma">t1 by u1                   </span><span style="font-family: Tahoma">    </span><span style="font-family: Tahoma">  t1 by u2<br />
t1 by u2                   </span><span style="font-family: Tahoma">     </span><span style="font-family: Tahoma">t3 by u1                  </span><span style="font-family: Tahoma">       </span><span style="font-family: Tahoma">t1 by u3<br />
t1 by u3                    </span><span style="font-family: Tahoma">    </span><span style="font-family: Tahoma">t3 by u1<br />
t2 by u1</span><br />
<span style="font-family: Tahoma" /></p>
<p class="MsoNormal" style="text-indent: 0.5in"><span style="font-family: Tahoma">affirmation(u1, d1, t1) = 2          affirmation(u1, d2, t1) = 0<br />
Hence, wisdom(u1, t1) = 2</span></p>
<p>Likewise for other users,</p>
<p class="MsoNormal" style="text-indent: 0.5in"><span style="font-family: Tahoma">wisdom(u2, t1) = 3<br />
wisdom(u3, t1) = 3</span>
</p>
<p class="MsoNormal"><span style="font-family: Tahoma"> Hence the authority of user u1 for t1 is as follows:</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">    authority(u1, t1) = 2 / (2 + 3 + 3) = 2 / 8 = 0.25</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Whenever a user tags an object with a tag, he does so with the authority he possesses for that tag. Thus as compared to conventional methods, where the objects are usually ranked on the number of instances of the tags, in this method the measure of the relevance of a tag for an object is equivalent to the sum of all such user authorities. Thus,</span></p>
<p align="center" class="MsoNormal" style="text-align: center"><strong><span style="font-family: Tahoma">relevance_metric(d, t) = ∑<sub>i= all user who have tagged document d with t </sub>authority(u, t)</span></strong></p>
<p class="MsoNormal"><strong><span style="font-family: Tahoma" /></strong><span style="font-family: Tahoma">This relevance score, when calculated for every tag would provide an accurate measure for ranking the objects. As compared to the conventional methods where more number of instances of a tag for an object ensured a higher relevance for that tag, here the number of authoritative users counts.</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">Let us consider the following example:</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma"> </span><span style="font-family: Tahoma">          Object d1:</span><span style="font-family: Tahoma">    </span><span style="font-family: Tahoma">Object d2:<br />
t1 by u1                    </span><span style="font-family: Tahoma">    </span><span style="font-family: Tahoma"> t1 by u2<br />
t2 by u5                  </span><span style="font-family: Tahoma">       </span><span style="font-family: Tahoma">t1 by u3<br />
t1 by u4</span>
</p>
<p class="MsoNormal"><span style="font-family: Tahoma"> Let us assume that u1 has a very high authority for tag t1. Hence in the above scenario, a search for tag t1 may rank d1 higher than d2, if </span></p>
<p class="MsoNormal"><span style="font-family: Tahoma"> authority(u1, t1) <strong>></strong> authority(u2, t1) + authority(u3, t1) + authority(u4, t1)</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">This result is with the assumption that u1’s authority is greater than those of u2,u3 and u4 combined.</span></p>
<p class="MsoNormal"><span style="font-family: Tahoma">On the other hand, d2 would be ranked higher than d1 if the combined authorities of u2, u3 and u4 exceed that of u1. If the majority of the users are suggesting something, it indicates that their suggestion is far more valuable than that of an individual user or a subset of users.</span></p>
<p><strong><span style="font-family: Tahoma">Future Enhancements</span></strong></p>
<p class="MsoNormal"><strong><span style="font-family: Tahoma" /></strong><span style="font-family: Tahoma">While calculating the user assertions this algorithm currently considers all such users as equal even though they may have varying authorities for the corresponding tag. As a future enhancement, I plan to incorporate the authorities of the users as well into the affirmation calculations. </span></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2007/03/04/tag-search-inferring-relevance-from-user-authority/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/</link>
		<comments>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comments</comments>
		<pubDate>Thu, 22 Feb 2007 20:23:43 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/</guid>
		<description><![CDATA[This weekend I sat down experimenting with my project data to see if I could generate &#8216;related&#8216; documents. At first, the cosine similarity seemed very promising. The results seemed awfully similar and I was overjoyed to have completed such a cool feature in about an hours time. But then it struck me, the usual feeling [...]]]></description>
			<content:encoded><![CDATA[<p>This weekend I sat down experimenting with my project data to see if I could generate &#8216;<em>related</em>&#8216; documents. At first, the cosine similarity seemed very promising. The results seemed awfully similar and I was overjoyed to have completed such a cool feature in about an hours time. But then it struck me, the usual feeling you get that something is wrong when everything is working out smoothly. I realized that cosine similarity alone was not sufficient for finding similar documents.</p>
<p><strong>So, what is cosine similarity?</strong></p>
<p>Those not acquainted with Term Vector Theory and Cosine similarity can read <a href="http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html">this article</a>.</p>
<p><strong>Why does cosine similarity fail to capture the whole picture?</strong></p>
<p align="center"><img src="http://semanticvoid.com/images/cosine_similarity.png" /></p>
<p align="left">Let us consider two documents A and B represented by the vectors in the above figure. The cosine treats both vectors as unit vectors by normalizing them, giving you a measure of the angle between the two vectors. It does provide an accurate measure of similarity but with no regard to magnitude. But magnitude is an important factor while considering similarity.</p>
<p align="left">For example, the cosine between a document which has &#8216;machine&#8217; occurring 3 times and &#8216;learning&#8217; 4 times and another document which has &#8216;machine&#8217; occurring 300 times and &#8216;learning&#8217; 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.</p>
<p align="left"><strong>How do I get the accurate measure of similarity?</strong></p>
<p align="left">We have at our disposal two factors: one the cosine which gives us a measure of how similar two documents are, and the second the (euclidean) distance which gives us the magnitude of difference between the two documents. There could be a number of ways you could combine the two to determine the similarity measure.</p>
<p align="left"><strong>Conclusion</strong></p>
<p align="left">The magnitude and cosine both provide us with a different aspect of similarity between two entities. It is upto us to either use them individually or in unison (as above) depending upon our application needs.</p>
<p align="left"><strong>[Update]</strong></p>
<p align="left">As pointed out be Dr. E. Garcia (in the comments), similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways.</p>
<p align="left">
<p align="center">
<p align="left">
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/feed/</wfw:commentRss>
		<slash:comments>32</slash:comments>
		</item>
		<item>
		<title>Whats The Buzz Of The Shoposphere?</title>
		<link>http://semanticvoid.com/blog/2006/12/11/whats-the-buzz-in-the-shoposphere/</link>
		<comments>http://semanticvoid.com/blog/2006/12/11/whats-the-buzz-in-the-shoposphere/#comments</comments>
		<pubDate>Mon, 11 Dec 2006 15:16:11 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[Social Networks]]></category>
		<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[shopping]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2006/12/11/whats-the-buzz-in-the-shoposphere/</guid>
		<description><![CDATA[That certainly was a tough question to answer, but not anymore. Whatsbuzzing, which released a few days back (reminds me of the sleepless nights :-)), helps you do just that. The brainchild of Anand Jagannathan, Whatsbuzzing is aimed at solving the online shopping woes of users. As described in Anand&#8217;s blog: Whatsbuzzing is a destination [...]]]></description>
			<content:encoded><![CDATA[<p>That certainly <strong>was</strong> a tough question to answer, but not anymore. <a target="_blank" href="http://www.whatsbuzzing.com">Whatsbuzzing</a>, which released a few days back (reminds me of the sleepless nights :-)), helps you do just that. The brainchild of <a target="_blank" href="http://kriyari.com/company_management.html">Anand Jagannathan</a>, Whatsbuzzing is aimed at solving the online shopping woes of users. As described in <a target="_blank" href="http://whatsbuzzing.wordpress.com">Anand&#8217;s blog</a>:</p>
<blockquote><p><a title="Whatsbuzzing" target="_blank" href="http://www.whatsbuzzing.com/">Whatsbuzzing</a> is a destination site for online shopping. The site offers a one-stop service where consumers can browse across hundreds of storefronts, view the latest trends and find the hottest deals. In contrast to comparison shopping or product information sites, <a title="Whatsbuzzing" target="_blank" href="http://www.whatsbuzzing.com/">Whatsbuzzing</a> provides visitors with the experience of a shopping mall. Visitors can browse storefronts by content, category or store name. A visitor can also tag storefronts so other consumers can find storefronts that are interesting. The storefronts are fully interactive and are constantly being updated with fresh content and timely offers.</p></blockquote>
<p>As stated above, what makes it different from the plethora of shopping services is the unique content. Instead of just showcasing product details and prices, it also helps you keep track of the latest deals/discounts/offers &#8211; capturing the buzz in its true essence.</p>
<p>Another factor that makes it stand apart is its foray into being a <strong>browsing engine</strong> as compared to the omnipotent search engines. Although search is an integral part of Whatsbuzzing, it is just another feature to help assist the users to find products quickly.</p>
<p>It is surely the panacea to all my shopping woes. With the season of Christmas setting in why don&#8217;t you give it a try and come back with some feedback.</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2006/12/11/whats-the-buzz-in-the-shoposphere/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Lucene As A Database</title>
		<link>http://semanticvoid.com/blog/2006/06/06/using-lucene-as-a-database/</link>
		<comments>http://semanticvoid.com/blog/2006/06/06/using-lucene-as-a-database/#comments</comments>
		<pubDate>Tue, 06 Jun 2006 04:32:56 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Search]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/2006/06/06/using-lucene-as-a-database/</guid>
		<description><![CDATA[Many atimes we index fields in a document which contribute only to classify/distinguish documents and not to its relevance. An analogy would be documents in a library. Here &#8216;category&#8216; could be the field which classifies the domain within which the document belongs. Therefore a typical text search query would go as: content:neural network AND (category:biology [...]]]></description>
			<content:encoded><![CDATA[<p>Many atimes we index fields in a document which contribute only to classify/distinguish documents and not to its relevance. An analogy would be documents in a library. Here &#8216;<em>category</em>&#8216; could be the field which classifies the domain within which the document belongs. Therefore a typical text search query would go as:</p>
<p>content:neural network AND (category:biology OR category:AI)</p>
<p>Everything seems to work fine. Well not yet. In the above query we are trying to retrieve documents contaning the words &#8216;<em>neural network</em>&#8216;. But if you look closely (try getting an explanation of the score in lucene), although the category sub-query seems to be used only for limiting the range of documents to particular domains, it contributes to the relevance as well.</p>
<p>So you must be wondering &#8220;How do I get documents from Biology or AI with ranking based on their relevance with &#8216;neural network&#8217;?&#8221;. Here is how. You dont need to hack around the lucene source code. All you have to do is to give a <em><strong>nullifying boost</strong></em> (thats a cool oxymoron (-;) to the respective sub-query. By <em><strong>nullifying boost</strong></em> I mean, a boost value so small that in effect nullifies the score of the sub-query (something like 0.00001). Therefore the revised query would look like:</p>
<p>content:neural network AND (category:biology OR category:AI)^0.00001</p>
<p>Thus although the category sub-query is a <strong>must match</strong> for a document, inorder to be a part of the resultset, it does not contribute to the score of the document. I like to term such queries <em><strong>non-relevant booleans</strong></em>. <em>Non-relevant</em> as it does not contribute to relevance and <em>boolean</em> as in the condition (AND or OR) as per which it must match.</p>
<p>This lets us harness the querying capabilities of a database from within a search engine.</p>
<p>[UPDATE] A <em><strong>nullifying boost</strong></em> of zero would be the ideal case wherein you don&#8217;t want the subquery to contribute to the score at all. A non-zero value for the same would give you more control over the subquery&#8217;s contribution to the score.</p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2006/06/06/using-lucene-as-a-database/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.673 seconds -->

