Efficient way to find an approximate string match and replacing with predefined string

Question

I need to build a NER system (Named Entity Recognition). For simplicity, I am doing it by using approximate string matching as input can contain typos and other minor modifications. I have come across some great libraries like: fuzzywuzzy or even faster RapidFuzz. But unfortunately I didn't find a way to return the position where the match occurs. As, for my purpose I not only need to find the match, but also I need to know where the match happened. As for NER, I need to replace those matches with some predefined string.

For example, If any one of the line is found in input string I want to replace them with the string COMPANY_NAME:

google
microsoft
facebook
International Business Machine

Like, input: S/he works at Google will be transformed to S/he works at COMPANY_NAME. You can safely assume that, all the input and the pattern to match are already preprocessed and most importantly they are in lower-case now. So, there is no problem with case-sensitivity.

Currently, I have approached with a sliding window technique. And a sliding window is passed over the input string from left to right and this window has exactly the size of the pattern we want to match. For example, when I want to match with International Business Machine, I run a sliding window of size 3 from left to right and try to find the best match by observing each 3 consecutive tokens at the same time with a stride of 1. I do believe, it is not the best way to do it, also it cannot find the best match.

So, what is the efficient way to find the best possible match along with the quantification on the found match (how much they are similar) and the position of the match(es), such that we can replace them with a given fixed string (if the calculated similarity is not less than a threshold)? Obviously, a single input may contain multiple portions to be replaced, each of them will be replaced separately, like: Google and Microsoft are big companies will become COMPANY_NAME and COMPANY_NAME are big companies etc.

I think it wasn't created for showing position. You may only split text in smaller parts and check your word with every element separtelly and get the best matchin element. It seems it has function `process.extractOne(list, word)` to check word with all elements on list. For single words it can be simpler because you can split full text into list of words. But for `International Business Machine` you would have to split full text into list of words and later create list iwth strings with 3 words - and later you can use list position to calculate position in full text. — furas, Nov 24 '21 at 05:32

furas · Accepted Answer · 2021-11-24T07:57:57.153

It seems modules fuzzywuzzy and RapidFuzz don't have function for this. You could try to use process.extract() or process.extractOne() but it would need to split text in smaller parts (ie. words) and check every part separatelly. For longer words like International Business Machine it would need to split in part with 3 words - so it would need even more work.

I think you need rather module fuzzysearch

import fuzzysearch

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

print(' text:', text)
print('---')
    
for word in sorted(words, key=len, reverse=True):
    print(' word:', word)
    
    results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
    print('found:', results)
    
    for item in reversed(results):
        text = text[:item.start] + 'COMPANY' + text[item.end:]
    print(' text:', text)
    
    print('---')

Result:

 text: Google and Microsoft are big companies like facebook International Business Machine
---
 word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
 text: Google and Microsoft are big companies like facebook COMPANY
---
 word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
 text: Google and COMPANY are big companies like facebook COMPANY
---
 word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
 text: Google and COMPANY are big companies like COMPANY COMPANY
---
 word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
 text: COMPANY and COMPANY are big companies like COMPANY COMPANY

If it finds many results for one word then it is better to start replacing at last position to keep other words in the same place. And this is why I use reversed().

I would start also with the longest word/name so later it still can search shorter words like Business. And this is why I use sorted(..., key=len, reverse=True)

But I'm not sure if it works exactly as you want. Maybe it will have problem when words are more incorrect.

EDIT:

I tried to use fuzzywuzzy for this and created this version but only for names with single word. For International Business Machine it would need some other idea.

It split full text into words and compare words. Later replace word wich have ration > 80

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process

new_words = []

for part in text.split():

    matches = []

    for word in words:
        result = fuzz.token_sort_ratio(part, word)
        matches.append([result, part, word])
        #print([result, part, word])

    matches = sorted(matches, reverse=True)

    if matches and matches[0][0] > 80:
        new_words.append('COMPANY')
    else:
        new_words.append(matches[0][1])
        
print(" ".join(new_words))

Result:

[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine

EDIT:

Second version which check also names with many words

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz


for name in all_names:

    length = len(name.split(' ')) # how many words has name 
    print('name length:', length, '|', name)

    words = text.split()  # split text into words

    # compare name with all words in text
    
    matches = []
    
    for index in range(0, len(words)-length+1):
        # join words if name has more then 1 word
        part = " ".join(words[index:index+length])
        #print('part:', part)
        
        result = fuzz.token_sort_ratio(part, name)
        matches.append([result, name, part, [index, index+length]])

        print([result, name, part, [index, index+length]])
        
    # reverse to start at last position
    matches = list(reversed(matches))

    max_match = max(x[0] for x in matches)
    print('max match:', max_match)

    # replace
    if max_match > 80:
        for match in matches:
            if  match[0] == max_match:
                idx = match[3]  
                words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]

    text = " ".join(words)
    print('text:', text)
    print('---')

Result:

ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY

EDIT:

Version with fuzzywuzzy.process

This time I don't have positions and I simply use standard text.replace(item[0], 'COMPANY').

I think in most situations it will work correctly and it doesn't need better method.

This time I check it on text with mistakes:

'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'


all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like Facebook and International Business Machine'

# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

# ---

import fuzzywuzzy.process
#import fuzzywuzzy.fuzz

for name in sorted(all_names, key=len, reverse=True):
    lenght = len(name.split())

    words = text.split()
    words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
    #print(words)

    #result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
    result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
    print(name, result)

    for item in result:
        text = text.replace(item[0], 'COMPANY')

print(text)

I added version with `fuzzywuzzy.process.extractBests` - it is short and works also with `International Business Machine` but I used normal `text.replace()` which in most situations should resolve problem. — furas, Nov 24 '21 at 07:57
Thanks @furas, in the last edit probably we may also consider `from difflib import get_close_matches`, as `difflib` is a built in `python` module. Changes are: `result = get_close_matches(name, words)` and `for item in result: text = text.replace(item, 'COMPANY')`. — hafiz031, Nov 24 '21 at 10:54
`get_close_matches(name, words, cutoff = CUTOFF_VALUE)` ref: https://docs.python.org/3/library/difflib.html#difflib.get_close_matches — hafiz031, Nov 25 '21 at 00:17
@hafiz031 you could put this code (full version) with description as answer and mark as accepted. — furas, Nov 25 '21 at 00:26

Efficient way to find an approximate string match and replacing with predefined string

1 Answers1

Linked