Skip to content

C.18 Cosine Similarity

PriyaDCosta edited this page Mar 7, 2023 · 3 revisions

1. Feature Name

C.18 Cosine Similarity

Brief Layman Description (Chat GPT) - cosine similarity is a measure of similarity between two vectors in a multi-dimensional space. It is commonly used in natural language processing and information retrieval to compare the similarity of two text documents represented as vectors of word frequencies or other text features.

Cosine similarity measures the cosine of the angle between two vectors in the space. It is calculated by taking the dot product of the two vectors and dividing it by the product of their magnitudes. The resulting value is a number between -1 and 1, where 1 indicates that the two vectors are identical, 0 indicates that they are orthogonal or dissimilar, and -1 indicates that they are diametrically opposed.

In the context of comparing text documents, cosine similarity is used to calculate the similarity between two vectors of word frequencies or other text features, where each dimension of the vector represents the frequency or importance of a particular word or feature in the text. The cosine similarity between two documents is high if they contain similar words or features, and low if they contain different words or features. Cosine similarity is a popular metric for document similarity in information retrieval, document clustering, and recommendation systems.

2. Literature Source (Serial Number, link)

C.18

3. Description of how the feature is computed (In Layman’s terms)

  1. The get_ngrams_df function is called to create the n-grams(user-specified n) and store them in a new column in df called 'ngrams'. Note that these ngrams are now a list object

  2. The 'ngrams' column is used to create a new column called 'text', which contains the concatenated strings of n-grams for each document. This is done so that they can be used as input into CountVectorizer

  3. A CountVectorizer object is created with the specified n-gram range and fitted to the 'text' column.

  4. The 'text' column is transformed into vectors using the CountVectorizer object.

  5. The cosine_similarity function is called to calculate the cosine similarity matrix between all pairs of documents based on their n-gram vectors. The diagonal of the matrix will contains all 1s as it is comparing a vector to itself. The upper and lower triangles (excluding the matrix diagonals will be symmetric.

  6. Hence, only the lower triangle of the cosine similarity matrix is extracted (excluding the diagonal) using the np.tril_indices and matrix[lower_indices] functions.

  7. The average of the lower triangle of the cosine similarity matrix is calculated using np.mean.

  8. The function returns the average cosine similarity between all pairs of documents in df based on their n-grams.

4. Algorithms used (KNN, Logistic Regression etc.)

None

5. ML Inputs/Features

None

6. Statistical concepts used

None

7. Pages of the literature to be referred to for details

PDF Pages 2 (Last Paragraph),9,19

8. Any tweaks/changes/adaptions made from the original source

None

9. Testing

Within cosine_similarity.py file

Clone this wiki locally