Parsing xdot draw attributes with pyparsing

Question

New to PyParsing. I'm trying to work out how to parse the draw (and similar) attributes in xdot files. There are a number of items where the number of following elements is given as an integer at the start - sort of similar to NetStrings. I've looked at some of the sample code to deal with netstring like constructs, but it does not seem to be working for me.

Here are some samples:

Polygon with 3 points (the 3 after the P indicates the number of points following):
P 3 811 190 815 180 806 185 should parse to 'P', [[811, 190], [815, 180], [806, 185]]

Polygon with 2 points:
P 2 811 190 815 180 806 185 should parse to 'P', [[811, 190], [815, 180]] (with unparsed text at the end)

Pen fill colour (the 4 after the C indicates the number of characters after the '-' to consume):
C 4 -blue should parse to 'C', 'blue'

Updated Info:
I think I was misleading by putting the examples on their own lines, without more context. Here is a real example:

S 5 -solid S 15 -setlinewidth(1) c 5 -black C 5 -black P 3 690 181 680 179 687 187

See http://www.graphviz.org/doc/info/output.html#d:xdot for the actual spec.

Note that there could be significant spaces in the text fields - setlinewidth(1) above could be "abcd efgh hijk " and as long as it was exactly 15 characters, it should be linked with the 'S' tag. There should be exactly 7 numbers (the initial counter + 3 pairs) after the 'P' tag, and anything else should raise a parse error, since there could be more tags following (on the same line), but numbers by themselves are not valid.

Hopefully that makes things a little clearer.

After some more thought, I came up with an answer (given below). Would still love to hear other views and if there is a better way. Still, I'm very happy with PyParsing - even my result below (which is still a little bit 'manual') is far easier to write (and read) than doing it 'by hand'. — Rasjid Wilcox, Mar 28 '12 at 10:40
So `P 2 811 190 815 180 806 185` raises a parse error not as you said before "with unparsed text at the end"? — Hooked, Mar 28 '12 at 15:51
@Hooked: Sorry about that - I was trying to keep things simple, and when I was just testing thing out myself, it made sense just to get the result I was looking for and not worry about parse errors. But `S 5 -solid P 1 690 181 680 179 C 4 -blue` should really give a parse error at the 680 (column 24 I think). — Rasjid Wilcox, Mar 28 '12 at 22:19

Rasjid Wilcox · Answer 1 · 2012-03-28T10:47:44.850

Well, this is what I came up with in the end, using scanString.

int_ = Word(nums).setParseAction(lambda t: int(t[0]))
float_ = Combine(Word(nums) + Optional('.' + ZeroOrMore(Word(nums, exact=1)))).setParseAction(lambda t: float(t[0]))
point = Group(int_ * 2 ).setParseAction(lambda t: tuple(t[0]))
ellipse = ((Literal('E') ^ 'e') + point + int_ + int_).setResultsName('ellipse')
n_points_start =  (Word('PpLBb', exact=1) + int_).setResultsName('n_points')
text_start = ((('T' + point + int_*3 ) ^ ('F' + float_ + int_) ^ (Word('CcS') + int_) ) + '-').setResultsName('text')
xdot_attr_parser = ellipse ^ n_points_start ^ text_start

def parse_xdot_extended_attributes(data):
    results = []
    while True:
        try:
            tokens, start, end = xdot_attr_parser.scanString(data, maxMatches = 1).next()
            data = data[end:]
            name = tokens.getName()
            if name == 'n_points':
                number_to_get = int(tokens[-1])
                points, start, end = (point * number_to_get).scanString(data, maxMatches = 1).next()
                result = tokens[:1]
                result.append(points[:])
                results.append(result)
                data = data[end:]
            elif name == 'text':
                number_to_get = int(tokens[-2])
                text, data = data[:number_to_get], data[number_to_get:]
                result = tokens[:-2]
                result.append(text)
                results.append(result)
            else:
                results.append(tokens)
        except StopIteration:
            break
    return results

Hooked · Answer 2 · 2012-03-28T15:54:11.303

1

In response to OP's edit, the answer below is not complete anymore.

I'm going to try and get to the core of your question here and ignore the finer details. Hopefully it will put you on the right track to the rest of your grammar. Essentially you are asking, given the two lines:

P 3 811 190 815 180 806 185
P 2 811 190 815 180 806 185

how can you parse the data such that in the second line only two points are read? Personally, I would read all of the data and post-parse. You can make the job immeasurably easier for yourself if you name the results. For example:

from pyparsing import *

EOL = LineEnd().suppress()

number = Word(nums).setParseAction(lambda x: int(x[0]))
point_pair = Group(number + number)

poly_flag  = Group(Literal("P") + number("length"))("flag")
poly_type  = poly_flag + Group(OneOrMore(point_pair))("data")

xdot_line = Group(poly_type) + EOL
grammar   = OneOrMore(xdot_line)

Note that we have a data, flag and length name, this will come in handy later. Let's parse and process the string:

S = "P 3 811 190 815 180 806 185\nP 2 811 190 815 180 806 185\n"
P = grammar.parseString(S)

for line in P:
    L = line["flag"]["length"]  
    while len(line["data"]) > L: 
        line["data"].pop()

Giving the useful and structured result :

[['P', 3], [[811, 190], [815, 180], [806, 185]]]
[['P', 2], [[811, 190], [815, 180]]]

Extending the grammar

From here, you can independently build the pieces of the grammar one-by-one. Each time you add a new type, add it to xdot_line, i.e.

xdot_line = Group(poly_type | pen_fill_type) + EOL

edited Mar 28 '12 at 15:54

answered Mar 28 '12 at 14:14

Hooked

84,485
43
192
261

+1 for using results names. I personally prefer dotted attribute notation over dict notation, allowing you to write ``line.flag.length`` and ``line.data``. – PaulMcG Mar 29 '12 at 12:29
@PaulMcGuire I think they both have their uses, in this case the dotted notation may be cleaner, but I often pass the result name from a function call, making the dict notation useful. – Hooked Mar 29 '12 at 13:48
@PaulMcGuire being the resident expert on all things `pyparsing`, thank you for all the help you've given on this site! I would love to know if there is a way to consume, as the OP seems to want, the next `n` characters (whitespace included), where `n` is read from a prior token. – Hooked Mar 29 '12 at 13:51
@PaulMcGuire: Yes, that is the real issue. Is there a way to consume the next 'n' characters (or more generally, n tokens), where n is read from a prior token, without resorting to the scanString method I used. – Rasjid Wilcox Mar 30 '12 at 03:12
1

The `countedArray(expr)` helper reads a leading integer 'n', followed 'n' `expr` expressions, by using a captive Forward expression for the variable repetition part. I just tried a crazy experiment and it works - try `CharsNotIn("",exact=n)` for the variable repetition instead of `n*expr`. That is, extract the code for `countedArray` and write your own derivative, maybe call it `countedChars`. – PaulMcG Apr 01 '12 at 06:23

Parsing xdot draw attributes with pyparsing

2 Answers2

Extending the grammar

Linked