2

I have some huge datasets (in between 10-20) and I need to find out relationship among these datasets. Datasets are so huge that the computation might not fit on a single machine. Fields in these datasets are texts not numbers. Adding to the complexity, some of the fields may have incorrect words as well, like 'huose' for 'house' for which I am using a fuzzy algorithm.

To solve this I am thinking about using cosine similarity but not sure about the performance for such a huge dataset. My question is, is this algorithm good enough for this kind of problem (performance and accuracy wise). If not, is there some other algorithm that I should look into?

Edit: More Information

Datasets that I will be using, might be mix of text files and database tables. Values in column is generally 10-50 char long and its not a huge document. The relationship that I look for is how similar one column of a dataset is to other. I kind of want to derive a score based on similarity among columns. Eg

Col1     Col2     Col3
A        B        X
C        S        B
E        C        A
T        V        C
X        E

So in above example one can say that Col1 and Col3 has strong relationship with each other while Col1 and Col2 has a weak relationship.

justAbit
  • 4,226
  • 2
  • 19
  • 34
  • You mean you have 10-20 large text files? What kind of text is it? And what kind of relationship are you looking for between these datasets? An example might be helpful here... – Matthias Berth Apr 30 '15 at 09:16

2 Answers2

6

No, using cosine similarity is not a good choice, because:

  1. It does not take into account the order of words (assuming bag of words model).
  2. It requires computing the pairwise distance for each pair of objects, which is computationally impossible for huge collections.

You are probably do look for something more like Near Duplicate Detection in Information Retrieval. I already explained it once in a different thread (not an exact dupe though), but here is how to do it:

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings, instead sets of only words.
For example

Text 1:"I'm writing a crawler to"
Text 2:"I'm writing a some text crawler to get"

With k=2, the sets will be:

s1 = { I'm write, write a, a crawler, crawler to }
s2 = { I'm write, write a, a some, some text, text crawler, crawler to, to get }
s1 [union] s2 = { I'm write, write a, a crawler, crawler to, a some, some text, text crawler, to get } 
s1 [intersection] s2 = { I'm write, write a, crawler to }

In the above, the jaccard-similarity will be 3/8. If you use single words with the same approach, (k=1 shinglings) you will get your desired 5/8 - but this is worse solution in my (and most IR experts) opinion.

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture ~2 years ago, based on the author's notes).

Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333
  • Sorry Amit, I forgot to add earlier that mostly values in the column will be single word strings or combination of 2-3 words. It will not be a document. So order of words may not play major role. Your solution looks interesting though. Please let me know if solution suggested by you is still good for word matching. – justAbit Apr 30 '15 at 10:19
0

This sounds like a problem that is often called Schema Matching.

Cosine distance does sound like it would be a very good approach for what you describe. Treat each column as long document and compare the cosine distance between the columns.

fgregg
  • 3,173
  • 30
  • 37