3

I have N strings. Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.

I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which

1) minimizes K

2) minimizes L

3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".

All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.

A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.

Any ideas (or pointers to existing work) are greatly appreciated.

Igor Krivokon
  • 10,145
  • 1
  • 37
  • 41
  • What is the information given? Just the N strings? Is the regex's already decided, but just hidden from you? You could easily generate a regex that matches a specific set of strings by joining them with "|". – Markus Jarderot May 21 '09 at 22:52
  • :) that would be cheating. I guess I need another criteria to prevent this kind of solution... Restrict the regex lenth, I guess. – Igor Krivokon May 21 '09 at 23:00
  • Your condition #3 can better be described as minimizing the number of matching strings that aren't in the given set of N strings. Given that you've got 3 things to minimize (although you could just as easily require L=0), you need to weight which factors are more important. – user57368 May 21 '09 at 23:31

3 Answers3

3

What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.

I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).

Two interesting queries at Google Scholar:

Stephan202
  • 59,965
  • 13
  • 127
  • 133
1

The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?

Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.

Dave
  • 10,369
  • 1
  • 38
  • 35
  • 1
    > What's your real problem? As a hobby project, I'm implementing a "magic editor" for big files, mostly data (plus occasional comments or "irregularities"). Quite often I need to change formatting, or remove a column of values, or something like this. Usually I do this kind of stuff with a quick perl one-liner. However, I wanted to create a more "visual" solution for people not familiar with regexes. They would just edit one line, and the other (similar) lines in the file change automatically. – Igor Krivokon May 21 '09 at 23:16
0

Nothing clever here, perhaps I don't fully understand the problem?

Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

benson
  • 203
  • 1
  • 8