I have some huge datasets (in between 10-20) and I need to find out relationship among these datasets. Datasets are so huge that the computation might not fit on a single machine. Fields in these datasets are texts not numbers. Adding to the complexity, some of the fields may have incorrect words as well, like 'huose' for 'house' for which I am using a fuzzy algorithm.
To solve this I am thinking about using cosine similarity but not sure about the performance for such a huge dataset. My question is, is this algorithm good enough for this kind of problem (performance and accuracy wise). If not, is there some other algorithm that I should look into?
Edit: More Information
Datasets that I will be using, might be mix of text files and database tables. Values in column is generally 10-50 char long and its not a huge document. The relationship that I look for is how similar one column of a dataset is to other. I kind of want to derive a score based on similarity among columns. Eg
Col1 Col2 Col3 A B X C S B E C A T V C X E
So in above example one can say that Col1
and Col3
has strong relationship with each other while Col1
and Col2
has a weak relationship.