This weekend I sat down experimenting with my project data to see if I could generate ‘related‘ documents. At first, the cosine similarity seemed very promising. The results seemed awfully similar and I was overjoyed to have completed such a cool feature in about an hours time. But then it struck me, the usual feeling you get that something is wrong when everything is working out smoothly. I realized that cosine similarity alone was not sufficient for finding similar documents.

So, what is cosine similarity?

Those not acquainted with Term Vector Theory and Cosine similarity can read this article.

Why does cosine similarity fail to capture the whole picture?

Let us consider two documents A and B represented by the vectors in the above figure. The cosine treats both vectors as unit vectors by normalizing them, giving you a measure of the angle between the two vectors. It does provide an accurate measure of similarity but with no regard to magnitude. But magnitude is an important factor while considering similarity.

For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.

How do I get the accurate measure of similarity?

We have at our disposal two factors: one the cosine which gives us a measure of how similar two documents are, and the second the (euclidean) distance which gives us the magnitude of difference between the two documents. There could be a number of ways you could combine the two to determine the similarity measure.

Conclusion

The magnitude and cosine both provide us with a different aspect of similarity between two entities. It is upto us to either use them individually or in unison (as above) depending upon our application needs.

[Update]

As pointed out be Dr. E. Garcia (in the comments), similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways.

  • http://www.miislita.com Dr. E. Garcia

    Correction.

    Somehow you are mistaken the definition for cosine similarity.

    Cosine-based similarity values already incorporates the magnitudes (lengths) of the vectors. Let q be a query and d a document then by definition

    sim(q,d) = cosine theta = Q*D/(| Q |*| D |)

    where Q*D = dot product between q and d
    | Q | = magnitude of query vector
    | D | = magnitude of document vector

    If you prenormalize all vectors (docs and q) before doing the calculations then the dot product between unit vectors equals the cosine angle since vector lengths will be unit lengths.

    Regarding similarity. There are many ways to define similarity in IR. The most common is using cosines, but this is not the only way. Similarity can be defined using Jaccard’s Coefficients, dot products, etc, even weight products are often taken for similarity estimates. A typical example of the later is an LSI scheme wherein entries of a term-doc matrix are weights defined as occurrences (term frequencies). When using other scoring schemes with LSI, we no longer talk about occurrences, but about weights, so the similarity is taken for weight products.

    Regards

    Dr. E. Garcia

  • Anand Kishore

    Hi Dr. E. Garcia,

    >> “If you prenormalize all vectors (docs and q) before doing the
    >> calculations then the dot product between unit vectors equals the
    >> cosine angle since vector lengths will be unit lengths.”

    This is exactly what I tried to explain in my post. Let me explain this with an example:

    Let us consider two documents represented by the vectors A and B.
    |A| = 5 |B| = 4
    Thus, dot-product of A and B is as follows:
    A.B = |A| |B| cos(theta)
    = 5 * 4 * cos(theta)
    = 20 * cos(theta)

    Hence, in order to find the cosine we need to divide the above by |A| |B|. This reduces to:
    A.B / |A||B| = 20 * cos(theta) / |A||B|
    = 20 * cos(theta) / 20
    = cos(theta)

    Now, what I wanted to say was that cosine between two vectors is equivalent to the cosine between two such unit vectors. Consider two unit vectors Ua and Ub with the same theta as above. Then,
    Ua.Ub = |Ua| |Ub| cos(theta)
    = 1 * cos(theta)
    = cos(theta)

    Hence, in order to find the cosine we need to divide the above by |Ua| |Ub|. This reduces to:
    Ua.Ub / |Ua||Ub| = 1 * cos(theta) / |Ua||Ub|
    = 1 * cos(theta) / 1
    = cos(theta)

    This proves that cosine doesn’t account for the magnitude of the vectors. This can be accounted for by taking the euclidean distance between the vectors.

    Hopefully this is sufficient to ascertain my claim that cosine alone does not suffice as a good similarity measure.

  • http://www.miislita.com Dr. E. Garcia

    No. This is not the correct way of computing cosine similarities.

    Your equation

    Sim = cosine(Θ) / norm(M)

    is simply incorrect. It should be

    sim = cosine(Θ) = dot product/norm product. When we compute cosine angles by definition we divide a dot product by the length of the vectors involved.

    In addition in the example above you to know a pair of coordinates for each document A and B. Let say these are A(x1, y1) and B(x2, y2)

    Then the dot product is

    A*B = x1*x2 + y1*y2

    the norm of each vector (their length in this case) is

    |A|= (x1*x1 + y1*y1)^1/2
    |B| = (x2*x2 + y2*y2)^1/2

    the length product (norm product) is

    | A |*| B |

    The cosine similarity, sim, IS the cosine angle:

    sim = cosine(theta) = A*B/(|A|*|B|)

    So, to compute the cosine angle we must divided by the length by definition. This cosine angle IS the similarity score. Therefore, your equation

    Sim = cosine(Θ) / norm(M)

    is incorrect.

  • Anand Kishore

    What I have tried above is to suggest a function that incorporates both cosine (numerator) and euclidean distance (denominator). It is not an attempt to derive another formula for cosine altogether. The cosine formula remains unchanged, but after determining the cosine and euclidean, individually by their respective formulas, we can combine to two to get a variable that accounts for both the similarity (represented by cosine) and the magnitude of difference (euclidean).

  • http://www.miislita.com Dr. E. Garcia

    The equation for computing cosine angles already incorporates the euclidean distance of the vectors in the denominator.

    In linear algebra this is actually the magnitude of the vectors, also called their norm. There is no need for normalizing the cosine angle since this is already normalized.

    In IR similarity is often expressed as a cosine angle since this precisely accounts for the dot products and euclidean distances. There is a reason for this: the cosine angle of data point are equivalent to the product moment correlation coefficient found in statistics books.

    Similarity is not a metric, a distance is. One can convert a similarity into distances and viceversa, thought not that obvious because of the triangular inequality that must be meet by a distance metric. Expressing similarity as cosines only to next divide the cosine angle once again by the distance (magnitude) of the vectors goes against this cardinal relationships.

    Similarities can be expressed by cosines, dot products, Jaccard’s Coefficients and in many other ways. But dividing a cosine similarity once again by a distance metric (euclidean in this case) results in a quantity with no significance for IR purposes whatsoever since then it will not be possible to make the conversion.

    If you want to incorporate proximity between any two points (the relative neighborhood), this is already done in IR via scalar clustering theory.

    Here you just need to express entries of a mxm matrix as Jaccard’s Coefficients and store these in a new matrix. Then from this matrix compute cosine similarities. The new cosine similarity scores then will incorporate the induced effect of neighboring points.

    Dr. E. Garcia

  • Anand Kishore

    I get your point. Thanks. I’ll modify my post accordingly.

  • sandhya

    Hello,

    Got a clear understanding of calcualtion of the similarity of the documents. Looks like a lot of people have implemented this cosine stuff. Even I’m going to implement the same for finding the document similarity. But I wonder, how should it be implemented in an efficient way. If I use a term document matrix and just the presence/absence as the matrix values, the matrix size wil be Dn*Tm. I feel processing such a huge matrix would be greatly inefficient. Any pointers in this direction??

    Thanks
    Sandhya

  • Anand Kishore

    Since you intent to store presence/absence values, it would be more memory efficient to store values for only terms that are present. If
    |Tm| = the maximum number of words in any document
    then,
    if a document contains only say 20 words, you would still store absence values for |Tm|-20 words.

    I computed similarity for documents indexed in Lucene [http://lucene.apache.org/]. Lucene stores the term vector for every document in the index itself.

    If you do need to store a Dn*Tm matrix (with absence/presence values), assuming that a presence value is represented by 1 and an absence by 0, you could take the vectors of the two documents under consideration and ‘AND’ them. That would help you ease out the DOT product calculation.

  • big_lemma

    Dr. E. Garcia

    You just need to test your math using real numbers to know that you are indeed in correct.

    A = {1, 1, 1}
    B = {2, 2, 2}

    Using your definition of similarity [which by the way is the right definition], you get:
    Dot Product:
    A*B = x1*x2 + y1*y2 + z1*z2 = 1*2+1*2+1*2 = 6

    the norm of each vector (their length in this case) is

    |A|= (x1*x1 + y1*y1)^1/2 = (1+1+1)^1/2 = 1.732050807569
    |B| = (x2*x2 + y2*y2)^1/2 = (4+4+4)^1/2 = 3.464101615138

    |A|*|B| = 5.999999

    sim = cosine(theta) = A*B/(|A|*|B|) = 6/5.99999 which is basically 1!!!

    Therefore, I cannot but conclude that you have missed the point of the original post.

  • big_lemma

    site owner,

    I question your wisdom in worshiping google.

    Google is the best of what we have today. It is so far from great that it isn’t even funny.

    it is like worshiping yahoo in 1995, or lycos in 1998.

  • http://www.miislita.com egarcia

    The example you give, of course will give a value of 1 in the cosine similarity scale. It is not surprising.

    Can you tell what I am missing? I standby about what I have stated. The original post was about defining similarity.

    As mentioned, there are many ways of defining similarity. One way is using just dot products, another using cosine angles. Even we can express it as Jaccard’s Coefficient.

    This is what we teach at graduate school. I recommend you to get a good book on data mining.

    The original poster presented an incorrect definition for similarity and later kindly recognized this.

  • nipun

    I am new in this field of mining and was confused about a thing after reading all the posts. If cosine measure is a distance measure and euclidean measure is a magnitude measure then can any mathematical relationship exists between the two?

  • Robbie Haertel

    If vectors X and Y are L2 normalized, it is possible to prove that using the cosine distance (1 – cosine similarity) is interchangeable with the euclidean distance.

  • nthio

    Hi Anand and Dr. E.Garcia,
    May I quote the original post:
    “For example, the cosine between a document which has ‘machine’ occurring 3 times and ‘learning’ 4 times and another document which has ‘machine’ occurring 300 times and ‘learning’ 400 times will hint that the two documents are pointing in almost the same direction. If magnitude (euclidean distance) was taken into account, the results would be quite different.”

    I’m kind of getting what you (Anand) intended to say from the early post, which is the cosine similarity is kind of normalizing the magnitude of the vector in its calculation. For example the coordinates A(3,4), B(5,12) and B’(500,1200); the cosine similarity between A and B, and A and B’ are the same; however, the euclidean distance is of course different.
    In a context of query and document searhing, the larger the document tends to scales up (in ideal case while it may not always practical; scales up the vector dimension equally, so although the magnitude i.e. the size of the document increases, the vector direction remains the same). In this case, cosine similarity may have a desirable normalizing effect compared to using the euclidean distance, which will tends to favor document with similar scale (as well as the direction) from the query. In such application, we may expect that the magnitude of query can be much less than the document searched – thus using euclidean will tend to favor smaller documents.
    Perhaps if the application is different, such as comparing the strict similarity between two documents, euclidean distance may have advantage if we intend to account similar magnitude of both documents.

  • Jack

    I’m not an expert but I’m not sure what point Dr. E. Garcia is trying to make here…cosine similarity does use magnitudes in the calculation but the calculation in fact normalizes the magnitudes, and the output of that calculation is an angle, and nothing else.

  • Anand Kishore

    Jack: Well she was making the point that the two cannot be combined together to give you a number that takes both similarity and magnitude into account.

  • http://www.miislita.com e.garcia

    To all above:

    You cannot blindfold exchange distance and similarity without knowing what is the model and the subject of that model.
    http://www.miislita.com/searchito/binary-similarity-calculator.html

  • James

    The first document has only 3/300 or 1 percent the number of mentions of the word “machine” and likewise only 4/400 or 1 percent the number of mentions of the word “learning” as compared to the second document. This small frequency is within the margin of error of most statistical measurements and therefore I recommend any supposedly “similarity” be expressed as a probability. To see my observation in a different light, imagine a triangle with only 3 points or vertices expressed. Now imagine a triangle with 300 points expressed filling out the 3 sides. Now introduce the same random point in both sets. In the first set, the “triangle” will most likely becomes a square or rhombus (quite a different classification) while the random point in the second set doesn’t greatly disturb our picture. It remains a triangle with some “noise”. Consequently, it would make sense to limit so-called “comparisons” or “similarity” to sets with the same order of magnitude of points to preserve statistical accuracy.

  • James
  • Hooman

    Dear All
    As you know, cosine similarity measure loses the word order – while is exploited for measuring the similarity of texts. What other methods exist that integrate some syntactic information like word order in measuring the similarity?
    Thanks very much

  • egarcia

    Anand: I’m a he, not a she.

    Hooman: With a positional inverted index, word orden and positional information can be stored into vectors and all kind of similarity scores computed.

    BTW, at http://irthoughts.wordpress.com/2008/10/29/similarity-pearson-and-spearman-coefficients/ we show the connection between Pearson and Spearman’s Correlation Coefficients with cosine angles and dot products. None of these are distances. The links listed in that post might help to clear up the difference between similarity and distance.

    Another thing. Few IRs have, unfortunately, used the expression ‘Similarity Distance’. Avoid it. This expression is an oxymoron as Distance is Dissimilarity.

    I hope this help.

  • http://none Richard

    Hi All,
    Thanks for providing this greate tutorial. I am new and just start learning about data mining. I am sorry if my question sounds childish. I know how to calculate tf/idf. Thanks for egarcia, now i also know how to calculate cosine similarity. But I have a question.
    Lets say i have two documents. each document has 2 words. 1 word is similar in both documents. So the tf/idf would be like doc A (0,0.5) doc B (0,0.5). Should I calculate cosine similarity on tf/idf values? If I do it return 1 while i can see both documents are 50% same not 100%.
    Am I missing some thing? Should I calculate cosine similarity only on IDF values or Tf values?
    Thanks

  • ahmed

    I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data. How can i do that. I need your advice. please if you have any idea let me know.

    Regards,
    Ahmed

  • http://www.comp.leeds.ac.uk/washtell Justin Washtell

    Hi Anand, et al,

    Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.

    Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others – the distance information has been effectively discarded (this is your normalization factor).

    If you’re sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you’re navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.

    Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions – when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore – all other things being equal – the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance – so again, check the logic of your app).

    Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors – which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might – but probably shouldn’t – say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).

    While we’re on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I’ll leave you to work out whether that is a good thing or not. :-)

  • diya

    hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..

  • diya

    hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..

  • http://yaroslavvb.blogspot.com/ Yaroslav Bulatov

    An interesting variation on cosine similarity is the “Fisher metric on multinomial manifold”. The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf

  • http://yaroslavvb.blogspot.com/ Yaroslav Bulatov

    An interesting variation on cosine similarity is the “Fisher metric on multinomial manifold”. The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf

  • http://twitter.com/ant0ine Antoine Imbert

    If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents. You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.

  • http://twitter.com/ant0ine Antoine Imbert

    If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents. You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.

  • Rashmileos

    If I have to find cosine similarity between a query and a document, Should I consider all words in the document? Or just the words which appear in the query?
    Thanks.

  • Hamsalekha Sr

    can i get the perl code for finding cosine similarity of two documents on a windows machine?