Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."

Question

I'm writing a Python chatbot. No matter what the technique is(Levenshtein, LCS, regex, etc.), I want a pattern like My name is [ A ]. smart enough to match strings like:

My name is Tslmy.              #Distance should = 0, and groupdict()['a'] outputs "Tslmy"
My name is Tesla Tahomana.     #Distance should = 0(!), and groupdict()['a'] outputs "Tesla Tahomana"
my  naem ist tslmy .           #With a little typo, the distance = 5, and groupdict()['a'] outputs "tslmy "

Allow me to use groupdict()['a'] to refer to what the [ A ] thing (actually (?P<identifier>match)) has captured, please.

In other way, I'm looking for a "Levenshtein" with omits/skippings/blanks/neglects, and pick out what has been skipped as well.
In another way, I'm looking for a fuzzy(a.k.a. approximate) regex that can be less strict with the pattern, still provides the good old groupdict(), as well as a "fuzziness" value (or "edit distance", required to determine "the best matched pattern to the string" later).
This is the preferred solution, since it provides "sufficient" groupdict() if well managed.
However, The TRE library and the REGEX library, which is found to be the closest solution, don't seem to provide a "fuzziness" value. If this can be solved, then so much the better!

Is that possible? Thanks for paying attention.

Update:

I decided to use the powerful regex module in the end, but still unable to get the "fuzziness value".

Since the question on this page is theoratically solved, appending too further will be dishonorable. So I put forward another question about this new issue, and hopes you could solve it!

You may consider using Damerau–Levenshtein rather than plain Levenshtein as it will consider “naem” to have a distance of 1 from “name” rather than 2, not that it helps you with your problem. — icktoofay, Jun 10 '13 at 04:40
Yes, I am using DL in production. To simplify the question, I only referred to it as a normal Levenshtein. What's more, in fact, I don't really care about _what_ the fuzzy algrithm is. Thank you for your reminders, as they will delight those who have not enjoyed DL~! @icktoofay — tslmy, Jun 10 '13 at 09:23
BTW, you probably want your final `.` to be escaped; `[.]` or `\.` -- otherwise, it matches any character, rather than only itself. — Charles Duffy, Nov 16 '15 at 22:22

score 1 · Accepted Answer · answered Jun 10 '13 at 04:54

1

You could use a RegEx for the basic match:

r"My name is (\w+){1,2}."

And then use the TRE library to allow for variations.

answered Jun 10 '13 at 04:54

joel.d

1,611
16
21

Errrrr... Allow me to ask how can I use this library in my Python code? The `python\setup.py.in` seems to need `win32\Release\tre.dll` which is nonexistent. – tslmy Jun 10 '13 at 08:59
Update: After some Googling, I found [this](https://pypi.python.org/pypi/regex/) to be a easlier alternative... maybe? – tslmy Jun 10 '13 at 09:18
Well, I am choosing this answer for now. I decided to use the powerful [regex](https://pypi.python.org/pypi/regex/) module in the end, but still unable to get the "fuzziness value". Since the question on _this_ page is theoratically solved, appending too further will be dishonorable. So I put forward [another question about this new issue](http://stackoverflow.com/questions/17023862/python-regex-module-fuzziness-value), and hopes you could solve it! – tslmy Jun 10 '13 at 12:34

HamZa · Answer 2 · 2013-06-10T09:21:59.677

0

DAT REGEX O_O

(?i)(?:(?:my|ym).?|.?(?:my|ym))\s+(?:.?(?:..me|n..e|na..)|(?:..me|n..e|na..).?)\s+(?:(?:is|si).?|.?(?:is|si))\s+(\w[\w\s])\s

Let's split it up:

(?i) : set the i modifier to match case insensitive
(?:(?:my|ym).?|.?(?:my|ym)) : this will match my, ym, My, Ym, may, amy etc...
\s+ : match white space one or more times
(?:.?(?:..am|n..e|na..)|(?:..am|n..e|na..).?) : match name, naao, tame, lame, n99e, names, Naats etc...
\s+ : match white space one or more times
(?:(?:is|si).?|.?(?:is|si)) : Match is, si, ist, sit, siR etc...
\s+ : match white space one or more times
(\w[\w\s]*) : match words and spaces one or more times and group it (it must start with a word \w)
\s* : match white spaces zero or more times

Online demo

edited Jun 10 '13 at 09:21

answered Jun 10 '13 at 09:11

HamZa

14,671
11
54
75

3

Man, you are killing me!~ :) – tslmy Jun 10 '13 at 09:19
Hehe~ You can improve it by using `[a-z]` instead of `\w` depending on your needs, since `\w` will also match `_` and digits – HamZa Jun 10 '13 at 09:24
Sorry, there's tons of such patterns to match, so, being extremely lazy, I can't afford to manage it. BTW, it mush took you some time to compose this pattern, right? – tslmy Jun 10 '13 at 09:36
@tslmy Hmmm didn't really mention the time, but I would think about 10mn. I think it would depend on how fluent you're in regex, I'm writing regex here on SO almost every day xD. BTW you gave me a nice idea to write something that automates the process for writing this regex. Something in PHP to generate a regex like I did [here](http://stackoverflow.com/a/17010983) under "Breaking the laws of regex". – HamZa Jun 10 '13 at 09:40
Thank you for your efforts! I just finished high school 3 days before and am very much a freshman in every means. The regex, to tell you the truth, takes me days to study -- harder than Chemistry~! Seems you are doing great work in automating regex generation. Thank you again! – tslmy Jun 10 '13 at 09:45
@tslmy It's easier than you think, I learned it here on SO btw. Reading some tutorials on the web, and honing my skills by answering questions, and ofcourse learning from other answers. – HamZa Jun 10 '13 at 09:50
1

You can also use tools like [this guy](http://gskinner.com/RegExr/) that make writing reliable regexes much less painful – joel.d Jun 10 '13 at 23:07

Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."

2 Answers2

DAT REGEX O_O