It seems modules fuzzywuzzy
and RapidFuzz
don't have function for this. You could try to use process.extract()
or process.extractOne()
but it would need to split text in smaller parts (ie. words) and check every part separatelly. For longer words like International Business Machine
it would need to split in part with 3 words - so it would need even more work.
I think you need rather module fuzzysearch
import fuzzysearch
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
print(' text:', text)
print('---')
for word in sorted(words, key=len, reverse=True):
print(' word:', word)
results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
print('found:', results)
for item in reversed(results):
text = text[:item.start] + 'COMPANY' + text[item.end:]
print(' text:', text)
print('---')
Result:
text: Google and Microsoft are big companies like facebook International Business Machine
---
word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
text: Google and Microsoft are big companies like facebook COMPANY
---
word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
text: Google and COMPANY are big companies like facebook COMPANY
---
word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
text: Google and COMPANY are big companies like COMPANY COMPANY
---
word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
text: COMPANY and COMPANY are big companies like COMPANY COMPANY
If it finds many results for one word then it is better to start replacing at last position to keep other words in the same place. And this is why I use reversed()
.
I would start also with the longest word/name so later it still can search shorter words like Business
. And this is why I use sorted(..., key=len, reverse=True)
But I'm not sure if it works exactly as you want. Maybe it will have problem when words are more incorrect.
EDIT:
I tried to use fuzzywuzzy
for this and created this version but only for names with single word. For International Business Machine
it would need some other idea.
It split full text into words and compare words. Later replace word wich have ration > 80
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process
new_words = []
for part in text.split():
matches = []
for word in words:
result = fuzz.token_sort_ratio(part, word)
matches.append([result, part, word])
#print([result, part, word])
matches = sorted(matches, reverse=True)
if matches and matches[0][0] > 80:
new_words.append('COMPANY')
else:
new_words.append(matches[0][1])
print(" ".join(new_words))
Result:
[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine
EDIT:
Second version which check also names with many words
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
for name in all_names:
length = len(name.split(' ')) # how many words has name
print('name length:', length, '|', name)
words = text.split() # split text into words
# compare name with all words in text
matches = []
for index in range(0, len(words)-length+1):
# join words if name has more then 1 word
part = " ".join(words[index:index+length])
#print('part:', part)
result = fuzz.token_sort_ratio(part, name)
matches.append([result, name, part, [index, index+length]])
print([result, name, part, [index, index+length]])
# reverse to start at last position
matches = list(reversed(matches))
max_match = max(x[0] for x in matches)
print('max match:', max_match)
# replace
if max_match > 80:
for match in matches:
if match[0] == max_match:
idx = match[3]
words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]
text = " ".join(words)
print('text:', text)
print('---')
Result:
ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY
EDIT:
Version with fuzzywuzzy.process
This time I don't have positions and I simply use standard text.replace(item[0], 'COMPANY')
.
I think in most situations it will work correctly and it doesn't need better method.
This time I check it on text with mistakes:
'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like Facebook and International Business Machine'
# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
# ---
import fuzzywuzzy.process
#import fuzzywuzzy.fuzz
for name in sorted(all_names, key=len, reverse=True):
lenght = len(name.split())
words = text.split()
words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
#print(words)
#result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
print(name, result)
for item in result:
text = text.replace(item[0], 'COMPANY')
print(text)