0

I'm given an XML file that looks like this:

...
<a foobar="1">
    //Begin match here
    <a foobar="1">
        <a foobar="1">
            <a foobar="1"/>
            <a foobar="2"/>
        </a>
        <a foobar="2">
            <a foobar="3"/>
            <a foobar="4"/>
        </a>
    </a>
    //End match here
    //Begin match here
    <a foobar="2">
        <a foobar="2">
            <a foobar="5"/>
            <a foobar="6"/>
        </a>
    </a>
    //End match here
</a>
<a foobar="3">
    //Begin match here
    <a foobar="3">
        ...
    </a>
    //End match here
</a>
...

*Comments were added in by me, they don't actually exist in the file

**In my example, the values are sequential, that's not the case in the file I'm dealing with right now

***Each indentation level is strictly indented by four spaces per level. Matching the whitespace is not important as I only need to be able to separate the data, but if it's easier to match the whitespace as well then that's fine too

Essentially, I'm trying to match all tags on the first indentation line (and all of their tree's contents). It's tricky because all of the tags follow the naming structure < a foobar="#" >

Ideally, I want to generate a list of the multiline strings using re.findall, but I can't come up with a multiline expression that would work for this.

I've tried this expression:

re.findall("\n( {4}<a foobar=\"[0-9]+\">.+ {4}</a>)\n", filecontents, re.DOTALL)

But that simply matches one multiline string from the beginning of what should be the first match to the end of what should be the last match.

I've been struggling with this for far longer than I'd like to admit at this point, any help with creating the expression to match these would be greatly appreciated. Also apologies if I wasn't able to explain this very well, if you need more info to solve please let me know!

Jon Warren
  • 857
  • 6
  • 18
  • try XPath for xml: http://www.freeformatter.com/xpath-tester.html – deathangel908 Jan 03 '17 at 19:57
  • Do you actually want to match sections of text in the XML file, or do you just want to get certain XML elements? It's likely easier to use something like XPath that is aware of the XML structure, rather than trying to match on the raw text. – BrenBarn Jan 03 '17 at 20:05
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – inetphantom Jan 03 '17 at 20:22
  • See like this regex: [`(?ms)^ {4}`](https://regex101.com/r/4E11b3/1) – bobble bubble Jan 03 '17 at 20:55

1 Answers1

0

As I noted in comments it's better to use Xpath for that reason.

import libxml2

doc = libxml2.parseFile("your_file.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//a")
print(res)
doc.freeDoc()
ctxt.xpathFreeContext()
deathangel908
  • 8,601
  • 8
  • 47
  • 81
  • Wow, looks like a little further research on my end showed that one should NOT use regex to parse through XML. I guess that explains why I couldn't find any good working examples of it. I ended up using lxml and came up with a solution through that, so thank you! – Jon Warren Jan 04 '17 at 20:11