<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title>
	<atom:link href="http://semanticvoid.com/blog/index.php/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/feed/" rel="self" type="application/rss+xml" />
	<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/</link>
	<description>extracting the semantics from the void</description>
	<lastBuildDate>Sat, 28 Jan 2012 05:39:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.4</generator>
	<item>
		<title>By: Hamsalekha Sr</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-183</link>
		<dc:creator>Hamsalekha Sr</dc:creator>
		<pubDate>Sat, 28 Jan 2012 05:39:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-183</guid>
		<description>can i get the perl code for finding cosine similarity of two documents on a windows machine?</description>
		<content:encoded><![CDATA[<p>can i get the perl code for finding cosine similarity of two documents on a windows machine?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rashmileos</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-156</link>
		<dc:creator>Rashmileos</dc:creator>
		<pubDate>Tue, 01 Nov 2011 01:07:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-156</guid>
		<description>If I have to find cosine similarity between a query and a document, Should I consider all words in the document? Or just the words which appear in the query?
Thanks. </description>
		<content:encoded><![CDATA[<p>If I have to find cosine similarity between a query and a document, Should I consider all words in the document? Or just the words which appear in the query?<br />
Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Antoine Imbert</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-76</link>
		<dc:creator>Antoine Imbert</dc:creator>
		<pubDate>Wed, 27 Apr 2011 03:34:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-76</guid>
		<description>If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents.  You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents. </description>
		<content:encoded><![CDATA[<p>If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents.  You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Antoine Imbert</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-143</link>
		<dc:creator>Antoine Imbert</dc:creator>
		<pubDate>Wed, 27 Apr 2011 03:34:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-143</guid>
		<description>If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents.  You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents. </description>
		<content:encoded><![CDATA[<p>If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents.  You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Yaroslav Bulatov</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-75</link>
		<dc:creator>Yaroslav Bulatov</dc:creator>
		<pubDate>Mon, 14 Mar 2011 05:39:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-75</guid>
		<description>An interesting variation on cosine similarity is the &quot;Fisher metric on multinomial manifold&quot;. The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf</description>
		<content:encoded><![CDATA[<p>An interesting variation on cosine similarity is the &#8220;Fisher metric on multinomial manifold&#8221;. The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in <a href="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf" rel="nofollow">http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Yaroslav Bulatov</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-142</link>
		<dc:creator>Yaroslav Bulatov</dc:creator>
		<pubDate>Mon, 14 Mar 2011 05:39:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-142</guid>
		<description>An interesting variation on cosine similarity is the &quot;Fisher metric on multinomial manifold&quot;. The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf</description>
		<content:encoded><![CDATA[<p>An interesting variation on cosine similarity is the &#8220;Fisher metric on multinomial manifold&#8221;. The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in <a href="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf" rel="nofollow">http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: diya</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-74</link>
		<dc:creator>diya</dc:creator>
		<pubDate>Thu, 28 Oct 2010 04:21:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-74</guid>
		<description>hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..</description>
		<content:encoded><![CDATA[<p>hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: diya</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-139</link>
		<dc:creator>diya</dc:creator>
		<pubDate>Thu, 28 Oct 2010 04:21:00 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-139</guid>
		<description>hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..</description>
		<content:encoded><![CDATA[<p>hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Justin Washtell</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-73</link>
		<dc:creator>Justin Washtell</dc:creator>
		<pubDate>Tue, 17 Aug 2010 23:53:34 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-73</guid>
		<description>Hi Anand, et al,

Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.

Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others - the distance information has been effectively discarded (this is your normalization factor).

If you&#039;re sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you&#039;re navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.

Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions - when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore - all other things being equal - the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance - so again, check the logic of your app).

Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors - which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might - but probably shouldn&#039;t - say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).

While we&#039;re on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I&#039;ll leave you to work out whether that is a good thing or not. :-)</description>
		<content:encoded><![CDATA[<p>Hi Anand, et al,</p>
<p>Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.</p>
<p>Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others &#8211; the distance information has been effectively discarded (this is your normalization factor).</p>
<p>If you&#8217;re sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you&#8217;re navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.</p>
<p>Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions &#8211; when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore &#8211; all other things being equal &#8211; the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance &#8211; so again, check the logic of your app).</p>
<p>Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors &#8211; which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might &#8211; but probably shouldn&#8217;t &#8211; say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).</p>
<p>While we&#8217;re on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I&#8217;ll leave you to work out whether that is a good thing or not. :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ahmed</title>
		<link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-72</link>
		<dc:creator>ahmed</dc:creator>
		<pubDate>Wed, 04 Aug 2010 14:01:02 +0000</pubDate>
		<guid isPermaLink="false">http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-72</guid>
		<description>I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data.  How can i do that. I need your advice. please if you have any idea let me know.

 
Regards,
Ahmed</description>
		<content:encoded><![CDATA[<p>I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data.  How can i do that. I need your advice. please if you have any idea let me know.</p>
<p>Regards,<br />
Ahmed</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.634 seconds -->
<!-- Cached page served by WP-Cache -->

