Can LSA be used for document similarity?

I have to find the similarity between two documents. The two documents are simple text documents and i have to report a score. I was using cosine similarity initially. But I was told that LSA is a better means. But when I got to read a few tutorials i always noticed that they used more than two documents. So is it effective to include LSA when i need to find similarity between two documents alone? I have to find the similarity between two documents from a large corpus of files. My problem statement is as follows : I have a reference document . I need to compare this reference document with the documents in my local repository and find the most relevant document . Is it advisable to use LSA and what should the term-document matrix contain (will it only talk about the reference doc and the document that is compared or will it compare the entire set of files)

text-mining
latent-semantic-analysis

287k 37 37 gold badges 643 643 silver badges 1.1k 1.1k bronze badges asked Jan 23, 2012 at 3:49 71 1 1 silver badge 2 2 bronze badges

$\begingroup$ Similarity between two documents is meaningless. It's only interesting to ask if two documents are more similar to each other than to other documents. If the size of your corpus is greater than 2, then LSA may well produce more useful similarity measures between any pair of documents (or it may not). $\endgroup$

Commented Jan 23, 2012 at 4:31

$\begingroup$ Are there 2 documents in total? Or are you trying to compute the similarity between any 2 documents in a large corpus? $\endgroup$

Commented Jan 23, 2012 at 22:45

$\begingroup$ @Nick, how do we do the later? if i have lets say 1 million documents against 100 million? $\endgroup$

Commented Jul 27, 2017 at 3:20

1 Answer 1

$\begingroup$

In general LSA is meaningful for computing document similarity. However, you need a large collection of documents (more than 100000) because LSA is based on finding associations between words (e.g. it will find that dog and cat are similar words and therefore a document about dogs is similar to a document about cats). If your collection is small no meaningful associations between words can be derived. LSA is just a change of representation and to compute similarity you still will use the cosine on the LSA representation. Originally each document is a sparse vector of dimension e.g. 100000, but after LSA it is a dense vector of dimension e.g. 200.

As you said you already can do cosine similarity on the sparse data (just transformed word counts). Hopefully you already applied stop-wording, stemming and tf-idf normalization. It's useful to know what these transformations achieve because LSA is just another transformation on top of those standard transformation. I'll briefly go over the usefulness of those transformation before I describe what LSA does.

stop wording. Document content is dominated by stop words. Those words are everywhere and if you do similarity without removing them a large portion of the similarity score will be due to stop words. This means noise and making the similarity less precise.
- stemming. Take for example words such as dog and dogs. If you don't do stemming you are missing a chance to improve on the similarity because clearly dog and dogs are related.
- tf-idf normalization. Here the issue is that some words often stop words, but also borderline stopwords (e.g. "reach", "achieve") are going to have high counts. As such the similarity will be dominated by those counts.
If you did all those, you already have a very strong similarity.

One more transformation you can do is to consider related words: for example dog, hound, animal, cat etc. are related. Some of these relationships will be meanigful for your similarity score while others will not. One way to describe LSA, is that LSA first derives such relationships. Those relationships are derived by statistical co-occurence analysis. For example, cats and dogs co-occur frequently and therefore a document about dogs will be more similar to a document about cats than to a document about tomatos. However, sometimes this may not be what you want: austria and germany are similar so if you search about germany you can get documents about austria. In general, LSA will make sense if you want to compare documents based on their topics.

What LSA does is to "enrich" or "expand" each document with related words. One way to achieve this is to just use the LSA representation. The LSA representation of a document is just the sum of the LSA representation of the individual words.

Another more controlled way still using LSA to perform document "expansion" is to take the related words that LSA derives for each word in the document and to selectively "expand" the original document with more words from LSA.

LSA will not work equally well for all documents or all words. Documents whose topics dominate in the collection will likely be improved, while "outlier" documents will be mapped to a "noise" document. Rare words will not have meaningful similar words no matter if LSA is used or not.

Another drawback of LSA is that the representation after LSA is dense, while the original representation is sparse. This means that you will not be able to find the most similar documents quickly using a search engine. However, LSA can be easily applied in a rescoring phase after candidates are found with keyword search. LSA can also be used to derive informative words from a document.