0

i am having a text file which contain some text file like below.

file.txt

leptop
pencil
group
leptop
book
gruop
buk
grop
laftop
pensil
laptop
pancil
laptop bag
bok

from that file i need to find out the related query and store in to a list like below.

  [leptop,leptop,laftop,laptop]
  [pencil,pensil,pancil]
  [group,gruop,grop]
  [book,buk,bok]
  [laptop bag]  

i found something like below.it's working pretty well . but i want some modification on this.

import keyword
from difflib import get_close_matches

lis = ["leptop","pencil","group","leptop","book","gruop","bowk","grop","laftop","pensil","laptop","pancil","laptop bag","bok"]
print get_close_matches("laptop",lis, n = len(lis))
print get_close_matches("pencil", lis, n = len(lis))
print get_close_matches("group", lis, n = len(lis))
print get_close_matches("book", lis, n = len(lis))

output:-

['laptop', 'leptop', 'leptop', 'laftop', 'laptop bag'] # i don't want "laptop bag" as output over here. 
['pencil', 'pensil', 'pancil']
['group', 'grop', 'gruop']
['book', 'bok', 'bowk']
s_m
  • 83
  • 1
  • 1
  • 9
  • 1
    You can use the [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) to determine how similar strings are. Google "Levelshtein distance Python" and you'll get a ton of hits on Stack Overflow. – Cory Kramer Jan 29 '15 at 12:45
  • @Cyber i checked for that . but i got data for spell check only. – s_m Jan 29 '15 at 12:47
  • 1
    Well you'll need to add some more effort to this question because as it stands it is too broad. There are many ways to go about this, but without you showing some attempt at implementing some code and asking a specific question about where you are stuck, this is a very broad question. – Cory Kramer Jan 29 '15 at 12:48
  • You could try a combination of Levenshtein distance with [union find](http://en.wikipedia.org/wiki/Disjoint-set_data_structure): Construct a graph from those words, with connections between words with "low" distance (you have to find out yourself what a good low value is), then do union find on that graph. – tobias_k Jan 29 '15 at 12:50
  • i have not started yet how can i will stuck. if you will give some suggestion also it will be very helpful to me. – s_m Jan 29 '15 at 12:50
  • [This wonderful article](http://norvig.com/spell-correct.html) gives you some ideas on how to implement a spelling corrector. – georg Jan 29 '15 at 13:08
  • If the other questions didn't help you, exactly what are you trying to do? What is "the related query", and what will you do with the lists after you make them? – Karl Knechtel Jan 29 '15 at 13:25
  • @Cyber can you look at my code again. and help me out now. – s_m Jan 30 '15 at 04:45
  • @KarlKnechtel can you look at my code and help me out – s_m Jan 30 '15 at 04:46

2 Answers2

2

I don't think that regular expressions are the right way.

You can, however, do this using a combination of Union Find and Minimum Edit Distance.

For each combination of words, determine the min_edit_dist, and if the distance is smaller than some threshold, union those words together. Picking the right value for that threshold may depend on the selection of words. For your words, 3 or 4 seem to work rather well.

import collections, itertools

# initialize 'leaders' dictionary, used in union and find
leaders = {word: None for word in words}

# union similar words together
for u, v in itertools.combinations(words, 2):
    if find(u) != find(v) and min_edit_dist(u, v) < 3:
        union(u, v)

# determine groups of similar words by their leaders
groups = collections.defaultdict(set)
for x in leaders:
    groups[find(x)].add(x)
print groups.values()

Output, for my implementations of union, find, and min_edit_dist:

[set(['laptop bag']), 
 set(['gruop', 'grop', 'group']), 
 set(['buk', 'book', 'bok']), 
 set(['laftop', 'laptop', 'leptop']), 
 set(['pencil', 'pancil', 'pensil'])]

For the union and find functions, refer to this answer. The implementation of the min_edit_dist function is left as an exercise to the reader.

A possible problem with this approach is that it may end up merging all the groups, if there are sufficiently close variations between them.


Regarding your own approach using difflib.find_close_matches:

You can use the cutoff parameter to fine-tune how "close" the matches should be. However, I did not find a value that works for all your examples, much less for all else that might be out there. 0.8 works for laptop, but is too strict for book. Also note that using this approach, you need to know which are the "root words", which might be a problem in practice.

My approach, on the other hand, does not need to know which words are the "leaders" of the group a priori, but find them itself. For similar techniques, you might also want to have a look at cluster analysis algorithms.

Community
  • 1
  • 1
tobias_k
  • 81,265
  • 12
  • 120
  • 179
1

First you need to define what "related" means.

From your examples it seems that words a and b are related if a "has a small change in characters to obtain b" ... see other comments about Levenshtein Distance which is exactly what you want to use: Levenshtein or Minimum-Edit-Distance takes two words a and b and gives you a number dist(a,b) which is 0 if a = b and it becomes higher the more changes you need to make to a to obtain b.

With this "tool" you can start to build an algorithm that solves your problem, for example by defining a distance that means "related" and taking each word line by line and checking if it has a small distance to any word below.

However you really need to consider what you want to achieve. Using a naive method could put all words in the input into one "class" if they are all transitively related.

Example: foo/fool are related, fool/pool are related, pool/bool are related. Now foo might not be "related" to bool according to your initial idea, but it is related via a chain of related words.

If you are ok with the solution that foo/fool/pool/bool all end up in one class then you can use a naive method, otherwise you need something more clever.

peschü
  • 1,299
  • 12
  • 21