<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>semanticvoid &#187; Search</title>
	<atom:link href="http://semanticvoid.com/blog/index.php/tag/search/feed/" rel="self" type="application/rss+xml" />
	<link>http://semanticvoid.com/blog</link>
	<description>extracting the semantics from the void</description>
	<lastBuildDate>Thu, 22 Sep 2011 21:05:48 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
		<item>
		<title>stop words</title>
		<link>http://semanticvoid.com/blog/2010/08/24/stop-words/</link>
		<comments>http://semanticvoid.com/blog/2010/08/24/stop-words/#comments</comments>
		<pubDate>Wed, 25 Aug 2010 04:00:16 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[text]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=447</guid>
		<description><![CDATA[In a recent implementation for a near duplicate detection task I relied on stop words as key features in extracting signatures from text. The results turned out to be good but that&#8217;s not what I&#8217;m focusing on here. This was quite contrary to the mindset in the IR/NLP domain we have been accustomed to, where [...]]]></description>
			<content:encoded><![CDATA[<p>In a recent implementation for a near duplicate detection task I relied on stop words as key features in extracting signatures from text. The results turned out to be good but that&#8217;s not what I&#8217;m focusing on here. This was quite contrary to the mindset in the IR/NLP domain we have been accustomed to, where these words are considered meaningless and need to be got rid of before building any model/index. These word on the other hand encode a plethora of information like tense, plurality, (un)certainty, subjectivity and more. They bind the semantics of a sentence together and give them context. Yet (atleast in the IR sense) we give them a negative connotation (<em>STOP/NN -0.140192 sentiment</em>). I would go a step ahead by saying that we should stop calling them *stop* words and instead accept the inability of some IR systems of making correct use of them. How about *glue* words for a change? Or maybe not.</p>
<p>PS: Incase you are looking for a list of stop words for different languages here is a good list &#8211;  <a href=" http://members.unine.ch/jacques.savoy/clef/">http://members.unine.ch/jacques.savoy/clef/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2010/08/24/stop-words/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Dygest Your Search</title>
		<link>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/</link>
		<comments>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 06:56:36 +0000</pubDate>
		<dc:creator>Anand Kishore</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Project]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Yahoo!]]></category>
		<category><![CDATA[Add new tag]]></category>
		<category><![CDATA[summarization]]></category>

		<guid isPermaLink="false">http://semanticvoid.com/blog/?p=256</guid>
		<description><![CDATA[Update: This hack won the coveted &#8216;Search&#8217; category award. For the last couple of days, I and @sudheer_624 have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with. Dygest [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Update:</strong> This hack won the coveted &#8216;Search&#8217; category award.</p>
<p>For the last couple of days, I and <a href="http://twitter.com/sudheer_624">@sudheer_624</a> have been busy working on this hack for a Yahoo! Hackday. Although still a prototype, the hack has turned out to be interesting so we thought of putting it out for others to play around with.</p>
<p><strong>Dygest</strong> (pronounced as &#8216;digest&#8217; &#8211; thanks to <a href="http://twitter.com/bluesmoon">@bluesmoon</a>) is aimed at changing the conventional way of displaying search context via a snippet to a more informative, machine generated document summary. There two kinds of relevance for evaluating search results:</p>
<ul>
<li>Vertical relevance: determined by the ranking algorithms.</li>
<li>Horizontal relevance: the contextual information made available to the user about the result &#8211; Searchmonkey is a good initiative on this front.</li>
</ul>
<p>
The current way of displaying this context is via a snippet of text under every result. This snippet shows the neighborhood of the occurrence of the query terms. Usually this information is not rich enough for a searcher to make the right judgement about the result. This causes the searcher to switch back and forth between the documents and the search results if the the page is not relevant. This can be frustrating at times.</p>
<p>
<strong>Dygest</strong> aims to solve this by either replacing or enhancing the current search snippet with a summary of the result page. At its core lies a summarization engine which figures out what the *real* content of the page is (distinguishing it from the other junk like surrounding text, navigational text, comments etc) and then performs text summarization on this content. The summary of the page is then displayed to the user via the appropriate interface. How cool is that?</p>
<p>
The user no longer needs to click on irrelevant links. He/She can perceive the theme/important facts of the page from right within the results page. The other advantage of this is that it gives the user a good overview of the query topic &#8211; he no longer needs to spend time reading many long documents but rather read a few summaries from the top results to get a good overview of the subject. This is particularly well suited for mobile devices where its frustrating to switch back and forth between pages and the search results. This is also fit for news articles where we just need the important facts about the story. </p>
<p>
Well, here is an example to convince you. A search for &#8216;Carol Bartz&#8217; yields the following result which at the first glance is not at all informative.</p>
<p><center> <img alt="" border="2" src="http://farm4.static.flickr.com/3456/3369960208_48edc07644_o.png" title="search snippet for Carol Bartz" /> </center></p>
<p>
Enhancing the existing view with an abstract of the page helps gauge the content and theme of the document. This would now look like:</p>
<p><center> <img alt="" src="http://farm4.static.flickr.com/3637/3369975750_f0b313ae61_o.png" title="summarized view" /> </center></p>
<p><strong>Dygest</strong> outputs the following summaries for the query &#8216;<a href="http://datacracy.info/cgi-bin/dygest/search.py?q=iran+site%3Anews.yahoo.com">Iran</a>&#8216; restricted to Yahoo! News:</p>
<p><center><img alt="" src="http://farm4.static.flickr.com/3658/3370011200_a757dc42d8_o.png" title="Query for Iran" /></center></p>
<p>And following for &#8216;<a href="http://datacracy.info/cgi-bin/dygest/search.py?q=obama+stimulus+plan">Obama stimulus plan</a>&#8216;:</p>
<p><center><img alt="" src="http://farm4.static.flickr.com/3578/3370098322_1a73cd285b_o.png" title="obama stimulus plan"  /></center></p>
<p>Currently, <strong>Dygest</strong> has two interfaces &#8211; (1) a search interface powered by yahoo boss and (2) a searchmonkey plugin. Its just a prototype so be kind and don&#8217;t be too judgmental.</p>
<p>Start dygest<em>ing</em> <a href="http://datacracy.info/dygest/">here</a>.</p>
<p><center><br />
<script src="http://pipes.yahoo.com/js/imagebadge.js">{"pipe_id":"3hCWTB0Y3hG3E9xK6ycw5g","_btype":"image"}</script><br />
</center></p>
]]></content:encoded>
			<wfw:commentRss>http://semanticvoid.com/blog/2009/03/19/dygest-your-search/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.436 seconds -->

