Can LSA be used for document similarity?

I have to find the similarity between two documents. The two documents are simple text documents and i have to report a score. I was using cosine similarity initially. But I was told that LSA is a better means. But when I got to read a few tutorials i always noticed that they used more than two documents. So is it effective to include LSA when i need to find similarity between two documents alone? I have to find the similarity between two documents from a large corpus of files. My problem statement is as follows : I have a reference document . I need to compare this reference document with the documents in my local repository and find the most relevant document . Is it advisable to use LSA and what should the term-document matrix contain (will it only talk about the reference doc and the document that is compared or will it compare the entire set of files)

287k 37 37 gold badges 643 643 silver badges 1.1k 1.1k bronze badges asked Jan 23, 2012 at 3:49 71 1 1 silver badge 2 2 bronze badges

$\begingroup$ Similarity between two documents is meaningless. It's only interesting to ask if two documents are more similar to each other than to other documents. If the size of your corpus is greater than 2, then LSA may well produce more useful similarity measures between any pair of documents (or it may not). $\endgroup$

Commented Jan 23, 2012 at 4:31

$\begingroup$ Are there 2 documents in total? Or are you trying to compute the similarity between any 2 documents in a large corpus? $\endgroup$

Commented Jan 23, 2012 at 22:45

$\begingroup$ @Nick, how do we do the later? if i have lets say 1 million documents against 100 million? $\endgroup$

Commented Jul 27, 2017 at 3:20

1 Answer 1

$\begingroup$

In general LSA is meaningful for computing document similarity. However, you need a large collection of documents (more than 100000) because LSA is based on finding associations between words (e.g. it will find that dog and cat are similar words and therefore a document about dogs is similar to a document about cats). If your collection is small no meaningful associations between words can be derived. LSA is just a change of representation and to compute similarity you still will use the cosine on the LSA representation. Originally each document is a sparse vector of dimension e.g. 100000, but after LSA it is a dense vector of dimension e.g. 200.

As you said you already can do cosine similarity on the sparse data (just transformed word counts). Hopefully you already applied stop-wording, stemming and tf-idf normalization. It's useful to know what these transformations achieve because LSA is just another transformation on top of those standard transformation. I'll briefly go over the usefulness of those transformation before I describe what LSA does.