4

Recently I've discovered the python module pyparsing, a wonderful tool for parsing data by writing the grammar, not the parser. I'm new to the idea of context-free grammars, so please correct any false assumptions in this question.

Pyparsing can implement a BNF (Backus–Naur Form) context-free grammar. This grammar can be recursive, but can it have a forward look-ahead? I've been wondering about the answer to this since I stumbled across this question. Let me give you a concrete example. Consider the string:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Let the grammar look something like:

<number> :: __<digit>__
<block>  :: <number>(x) (<number> <number> <number> .... x times)

i.e. read the first number token, save it as x and then consume the next x numbers and group them together. The parsed line should look like:

 [[1, 2], [3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14], [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]

I've written a simple python MWE not using pyparsing so it's clear what I'm trying to do:

A = range(1,31)

B, sub_b = [], []
consume  = 0

for a in A:
    if consume: 
        sub_b.append(a)
        consume -= 1
    else:
        if sub_b: B.append(sub_b)
        sub_b = [a,]
        consume = a
B.append(sub_b)
print B

Two (related) questions: Can this be done with a BNF context-free grammar? If yes/no how can I do it with pyparsing?

Community
  • 1
  • 1
Hooked
  • 84,485
  • 43
  • 192
  • 261

3 Answers3

3

There is no such thing as a parametrized size in a context-free grammar-- or in a regular grammar, for that matter. Numerical parameters like your consume are not part of the CFG model, and it can be proved that getting the effect in a different way is impossible. You can write a production for any fixed length, though, so you can write productions for a block of length 1, length 2, length 3, etc.:

<block3> :: <number> <number> <number>

or, similarly, matching the length as a prefix or even as a post-fix:

<block3a> :: 3 <number> <number> <number>
<block3b> :: <number> <number> <number> 3

etc. So to do what you want, you just need a grammar containing rules of the kind, for all N you might need.

A given CFG will only include a finite number of these productions. It is mathematically impossible to write a CFG (in BNF or any other form) that can handle unlimited parametrized sizes, whether as a prefix or as a postfix. In practice, you might be able to update your CFG on the fly with new productions as needed. E.g., read a number N and create a rule blockN for your grammar if it's not already present. But there's no single CFG that can capture unbounded parametrized sizes.

Edit, since you also asked about context-sensitive grammars: That still won't do it. The problem is the use of integer arithmetic, not the class of the grammar. Any formal language on the Chomsky hierarchy is defined in terms of a finite number of symbols (tokens), and since are infinitely many integers, they cannot be assigned distinct meanings (note that your parsing procedure relies on integer arithmetic).

If you were to pre-process the length into a sequence of as many stars (* * * 4 7 10), the CFG parser is trivial:

<group> :: * <group> <number>
<group> :: * <number>

This is just the so-called a^n b^n language. You could also have symbols that mean "ten", etc. But without pre-processing, the only solution (and what your procedure or a Turing machine does in practice) is to interpret numeral notation in your grammar. E.g., parse "21" as ten ten one. I doubt this can be done in a CFG (the problem is handling arbitrarily long numerals without separate rules for millions, billions etc.), but I'm not really sure. Either way it is only interesting as an academic exercise, since using real integers is so much easier. I'm sure people have studied the properties of formal languages with integers, but I can't say anything about that.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Thank you alexis, you got to the heart of the question. The parameterized workaround for fixed length `n` is a great idea for real world problems. One question: if a fixed CFG or a regular grammar cannot parse an arbitrary-length string, is there a generalization (of grammar) that can? Clearly a Turing machine can do it - my example posted above is in a Turing complete language - but is there some intermediary? – Hooked Apr 05 '12 at 15:53
  • After context-free, the next step in the Chomsky hierarchy are context-sensitive languages. More recently, linguists have identified something called "weakly context-sensitive" languages, which are somewhere between context-free and context-sensitive. For each class of languages there is one or more formal automaton that "recognizes" them. Google "Chomsky hierarchy" and "weakly context-sensitive" for the details. – alexis Apr 05 '12 at 18:17
  • But note that "recognize" and "parse" are not quite the same thing: There are CFGs for every context-free language, but some can be parsed more efficiently than others (i.e., with minimal look-ahead). Massive ambiguity may be a problem for parsing (you have to try a lot of options), but in principle a string either fits the grammar, or it doesn't. – alexis Apr 05 '12 at 18:21
3

Pyparsing includes the helper countedArray which does exactly what you ask. It takes a single argument expr, and will parse an integer followed by 'n' instances of expr. For your input string, you could write:

from pyparsing import *
from pprint import pprint

# make a long string of integers, starting at 0
source = ' '.join(map(str, range(50)))

# define an integer as a 'word' of numeric digits
integer = Word(nums)

# add a parse action to convert the parsed strings to integers, at parse time
integer.setParseAction(lambda t:int(t[0]))

# parse the source string into counted strings of integers, and print out
# with pprint
lists = OneOrMore(countedArray(integer)).parseString(source)
pprint(lists.asList())

prints out:

[[],
 [2],
 [4, 5, 6],
 [8, 9, 10, 11, 12, 13, 14],
 [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
  • The parse action attached to integer converts the numeric strings back to integers (no quotes around the numbers in the lists).

  • Note the opening empty list, which was created by the leading '0' of the source string.

  • Even though the source string contains more numbers past 30, there aren't enough for another counted array.

countedArray achieves this trick by creating an expression that matches a leading integer, followed by a captive undefined Forward expression. A parse action attached to the leading integer injects a expr*n expression into the captive Forward, which then parses the following 'n' exprs. You could as easily write countedArray(Word(alphas)) and parse "4 the quick brown fox" to get ['the', 'quick', 'brown', 'fox']. As @Aaron points out in his answer, there is no need to preserve the leading counter, as you can easily get it by getting the len of the returned list.

pyparsing also supports more traditional lookaheads using the FollowedBy and NotAny constructs (NotAny can be abbreviated by using the '~' operator). In this string, "Four score and seven years ago..." you can pick out the strings that are followed by words that start with vowels using Word(alphas) + FollowedBy(Word('aeiou',alphas)), which would match 'score' and 'years'; or words that are not followed by a period using Word(alphas) + ~Literal('.'), which would match every word but 'ago'. In this case, trying to search thru an input string for matches, you would use searchString or scanString instead of parseString.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • Thanks Paul! I really appreciate the pyarsing answer, I know it will come in handy later. I'm going to keep the answer by alexis since it directly answered the CFG question. From his answer it seems that a CFG (or even a CSG) can not parse an arbitrary `countedArray`. Since `pyparsing` _can_ handle this string, I guess its safe to place it at the top of the Chomsky hierarchy? – Hooked Apr 06 '12 at 14:21
1

Not really.

The whole point of having a grammar is that it allows to define data in a human understandable way. Your demo string is readable but why does 1 mean something else than the 3?

The correct input in your case would be:

[[2], [4, 5, 6], [8, 9, 10, 11, 12, 13, 14], [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]

and the grammar would look like so:

<model> :: <term>
<list>  :: <term> | <term> <opt-whitespace> <list>
<term>  :: '[' <list> ']'

You can then recover the missing count element by looking at the length of the list.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • @Arron, respectfully I think you missed the intention of the question. In this case, I don't care if the data is parsed in a "human understandable" way, but if a _context-free grammar can turn input shown into the output shown_. The gotcha I think is that the input string does not have a fixed length. Additionally I'm not quite sure how your BNF will take my input string and turn it into your given output. How does it know how many `/` to consume before the next `` (which is the intent of my question)? – Hooked Apr 05 '12 at 16:13
  • My point is that "context free grammar" doesn't make sense in your use case because your input isn't context free: The value of some elements determine how the parsing should happen. In CFGs, each value means the same thing to the parser; keywords are used to tell the parser which rule to apply. – Aaron Digulla Apr 06 '12 at 11:15