Automatic regex builder

Question

I have N strings. Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.

I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which

1) minimizes K

2) minimizes L

3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".

All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.

A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.

Any ideas (or pointers to existing work) are greatly appreciated.

What is the information given? Just the N strings? Is the regex's already decided, but just hidden from you? You could easily generate a regex that matches a specific set of strings by joining them with "|". — Markus Jarderot, May 21 '09 at 22:52
:) that would be cheating. I guess I need another criteria to prevent this kind of solution... Restrict the regex lenth, I guess. — Igor Krivokon, May 21 '09 at 23:00
Your condition #3 can better be described as minimizing the number of matching strings that aren't in the given set of N strings. Given that you've got 3 things to minimize (although you could just as easily require L=0), you need to weight which factors are more important. — user57368, May 21 '09 at 23:31

Stephan202 · Answer 1 · 2009-05-21T23:07:59.043

What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.

I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).

Two interesting queries at Google Scholar:

Thanks! Will check papers on MDL. – Igor Krivokon May 21 '09 at 23:00 — Igor Krivokon, May 21 '09 at 23:00

Dave · Answer 2 · 2009-05-21T23:24:42.740

1

The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?

Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.

edited May 21 '09 at 23:24

answered May 21 '09 at 22:55

Dave

10,369
1
38
35

1

> What's your real problem? As a hobby project, I'm implementing a "magic editor" for big files, mostly data (plus occasional comments or "irregularities"). Quite often I need to change formatting, or remove a column of values, or something like this. Usually I do this kind of stuff with a quick perl one-liner. However, I wanted to create a more "visual" solution for people not familiar with regexes. They would just edit one line, and the other (similar) lines in the file change automatically. – Igor Krivokon May 21 '09 at 23:16

score 0 · Answer 3 · answered May 21 '09 at 22:46

0

Nothing clever here, perhaps I don't fully understand the problem?

Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

answered May 21 '09 at 22:46

benson

203
1
8

1

I don't have any regexes to start with. The problem to deduce them. – Igor Krivokon May 21 '09 at 22:57

Automatic regex builder

3 Answers3

Linked