I'm given an XML file that looks like this:
...
<a foobar="1">
//Begin match here
<a foobar="1">
<a foobar="1">
<a foobar="1"/>
<a foobar="2"/>
</a>
<a foobar="2">
<a foobar="3"/>
<a foobar="4"/>
</a>
</a>
//End match here
//Begin match here
<a foobar="2">
<a foobar="2">
<a foobar="5"/>
<a foobar="6"/>
</a>
</a>
//End match here
</a>
<a foobar="3">
//Begin match here
<a foobar="3">
...
</a>
//End match here
</a>
...
*Comments were added in by me, they don't actually exist in the file
**In my example, the values are sequential, that's not the case in the file I'm dealing with right now
***Each indentation level is strictly indented by four spaces per level. Matching the whitespace is not important as I only need to be able to separate the data, but if it's easier to match the whitespace as well then that's fine too
Essentially, I'm trying to match all tags on the first indentation line (and all of their tree's contents). It's tricky because all of the tags follow the naming structure < a foobar="#" >
Ideally, I want to generate a list of the multiline strings using re.findall, but I can't come up with a multiline expression that would work for this.
I've tried this expression:
re.findall("\n( {4}<a foobar=\"[0-9]+\">.+ {4}</a>)\n", filecontents, re.DOTALL)
But that simply matches one multiline string from the beginning of what should be the first match to the end of what should be the last match.
I've been struggling with this for far longer than I'd like to admit at this point, any help with creating the expression to match these would be greatly appreciated. Also apologies if I wasn't able to explain this very well, if you need more info to solve please let me know!