When dealing with a large amount of text, it is essential to have tools that can help computers recognize and evaluate the similarity between documents. One of the most effective methods in this field is cosine similarity.

Cosine similarity is a technique that can measure the proximity between two documents by transforming words into vectors within a vector space. This transformative approach allows for the semantic interpretation of human language in a format that machines can easily understand.

The idea behind this concept is that words can be represented as vectors, where each dimension corresponds to a unique feature of the text, such as the frequency of a word or its contextual relevance. Calculating the cosine similarity between two vectors becomes a way to measure how documents are similar in content and context, disregarding length and focusing the analysis on the structure of the vectors representing them.

Although more complex methods exist for analyzing text similarity, such as neural networks or advanced clustering algorithms, cosine similarity offers an ideal balance between simplicity and effectiveness for analyzing moderate-sized documents. It is precious in applications such as recommendation systems, automatic text classification, and semantic search, where quickly understanding the relationship between different documents is crucial.

Below, with a simple example, we will see how it is possible to determine the similarity between various documents, starting from the definition of cosine similarity.


Formula

The cosine similarity between two vectors is calculated using the following formula:

\[ \text{cosine similarity} \ (V_x, V_y) = \frac{\sum_{i=1}^{n} V_{x_i} \cdot V_{y_i}}{\sqrt{\sum_{i=1}^{n} (V_{x_i})^2} \times \sqrt{\sum_{i=1}^{n} (V_{y_i})^2}} \]

or in compact form:

\[\text{cosine similarity} \ (V_x, V_y) = \frac{V_x \cdot V_y}{||V_x|| \ ||V_y||}\]

where:

  • \( V_x \cdot V_y \) is the dot product of the vectors \( A \) and \( B \).
  • \( ||V_x|| \) and \( ||V_y|| \) are the norms (lengths) of the vectors \( V_x \) and \( V_y \).

  • The value of cosine similarity ranges from 0 to 1.
  • A value close to 1 means that the angle between the two vectors is minimal; therefore, the vectors are very similar. Conversely, a value close to 0 means that the angle between the two vectors approaches \( \ \frac {\pi}{2} \), and therefore, the two vectors have a low degree of similarity.

Example

Let’s consider a scenario in which we aim to evaluate the similarity between various documents. For clarity, we will explore a straightforward example involving three brief sentences from which we seek to determine their respective degrees of similarity:

  • \(x\) = I am fond of reading thriller novels.
  • \(y\) = I prefer reading thriller novels.
  • \(z\) = Yesterday, I arrived late.

Upon a closer look, it’s clear that sentences \(x\) and \(y\) have similarities, while sentence \(z\) is unrelated.

The initial step in our analysis involves transforming the sentences into vectors, extracting all words, and computing their frequencies within the sentences. Subsequently, we will refine the data by eliminating words that contribute little to no meaningful information, such as the conjunction of, the pronoun I and the verb to be. This process is crucial, particularly in large corpora, to ensure that the dataset is qualitatively significant and focuses on the most impactful elements for our analysis. Here is the result:

arrivedfondlatenovelspreferreadingthrilleryesterday
\(V_x\)01010110
\(V_y\)00011110
\(V_z\)10100001

The vector representation of the three sentences is as follows:

  • \(V_x = [0, 1, 0, 1, 0, 1, 1, 0]\)
  • \(V_y = [0, 0, 0, 1, 1, 1, 1, 0]\)
  • \(V_z = [1, 0, 1, 0, 0, 0, 0, 1]\)

Let us now compute the cosine similarity between vector \(V_x\) and vector \(V_y\), which appear significantly alike upon initial inspection.

Let’s use the previously seen cosine similarity formula and obtain:

\[\text{cosine similarity} \ (V_x, V_y) = \frac{V_x \cdot V_y}{||V_x|| \ ||V_y||}\]

The dot product \(V_x \cdot V_y\) between the vectors \(V_x\) and \(V_y\) is given by: \[ V_x \cdot V_y = (0 \times 0) + (1 \times 0) + (0 \times 0) + (1 \times 1)\] \[ + (0 \times 1) + (1 \times 1) + (1 \times 1) + (0 \times 0) = 3 \]


Let’s calculate the denominator of the formula, \( ||V_x|| \ ||V_y|| \), given by the product of the lengths of the two vectors. We obtain:
\[||V_x|| = \sqrt{0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 1^2 + 0^2} = \sqrt{4} = 2\] \[||V_y|| = \sqrt{0^2 + 0^2 + 0^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2} = \sqrt{4} = 2\]

Therefore, we obtain a cosine similarity value of: \[\text{cosine similarity} \ (V_x, V_y) = \frac{3}{2 \times 2} = \frac{3}{4} = 0.75\]

As cosine similarity ranges between 0 and 1, where 1 indicates maximum similarity, a value of 0.75 suggests significant similarity between the two vectors, indicating that they have similar content.

To find the angle \( \theta \) between the two vectors \(V_x\) and \(V_y\) from the value of the cosine similarity, we proceed with the calculation of the arccosine function. The relation between the cosine similarity and the angle \( \theta \) is given by the formula:

\[ \theta = \arccos(0.75) \approx 41.4^\circ \]

In general, as the angle magnitude approaches zero, the cosine similarity value increases, indicating greater similarity between the vectors.


Below is an example of Python code for calculating the cosine similarity of vectors \(V_x\) and \(V_y\) that you can test on an online IDE.

import numpy as np
# Define the vectors
Vx = np.array([0, 1, 0, 1, 0, 1, 1, 0])
Vy = np.array([0, 0, 0, 1, 1, 1, 1, 0])
# Function to calculate cosine similarity
def cosine_similarity(vector1, vector2):
    # Calculate the dot product
    dot_product = np.dot(vector1, vector2)
    # Calculate the norms of each vector
    norm1 = np.linalg.norm(vector1)
    norm2 = np.linalg.norm(vector2)
    # Calculate the cosine similarity
    cosine_sim = dot_product / (norm1 * norm2)
    return cosine_sim
# Calculate and print the cosine similarity
similarity = cosine_similarity(Vx, Vy)
print(f"Cosine similarity (Vx,Vy): {similarity}")

Having removed some words from the original example sentences to make the evaluation more precise, I directly inserted the sentences already reduced to their vectors in the code to return the cosine similarity value of 0.75. For a complete example, you can consider the following code. In this case, the cosine similarity value between \(Vx\) and \(Vy\) will be approximately 0.48.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
sentences = [
    "I am fond of reading thriller novels.",  # x
    "I prefer reading thriller novels.",      # y
    "Yesterday, I arrived late."              # z
]
# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the sentences to a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(sentences)
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print("Cosine Similarity Matrix:\n", cosine_sim)

Let us proceed to calculate the cosine similarity between \(V_x\) and \(V_y\), which are known to differ in their content type. In this case the dot product \(V_x \cdot V_z\) between the vectors \(V_x\) and \(V_y\) is given by: \[ V_x \cdot V_y = (0 \times 1) + (1 \times 0) + (0 \times 1) + (1 \times 0)\] \[ + (0 \times 0) + (1 \times 0) + (1 \times 0) + (0 \times 1) = 0 \]

As the numerator assumes a value of zero, it follows that the cosine similarity measure equals zero. This result implies that the two vectors under consideration are orthogonal, indicating that they share no similarity.


Conclusions

Cosine similarity is a reliable indicator for deriving similarities between two or more documents with limited length. However, it is advisable to consider more sophisticated machine and deep learning approaches for longer documents. As document length increases, the complexity of the text also increases, rendering cosine similarity less effective in capturing nuanced semantic relationships. Consequently, a more advanced method is more appropriate for more extensive texts to extract the relevant information and deduce the similarities between them.