While working on an answer to this question, I came up with this regex:
(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2
(Note: this regex requires the PyPI regex
module)
(Short explanation: The regex relies on the fact that capture groups in lookaheads can't change their value after they've matched once, so after the first foo=
is found, the (?=())
matches and from that point onwards (?!\2)
will always fail.)
This regex works correctly with the 2 examples given in the question:
>>> pattern = r'(?:(?!\2)(?:,foo=([^,]*),(?=())|.))*\2bar=2'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>>
But something strange happens if there's an occurrence of foo=
after a bar=2
:
>>> # this doesn't match, as expected:
>>> regex.match(pattern, 'notfoo=1,bar=2')
>>> # but how the heck does it match this ?!
>>> regex.match(pattern, 'notfoo=1,bar=2,foo=3,')
<regex.Match object; span=(0, 14), match='notfoo=1,bar=2'>
As you can see, the string 'notfoo=1,bar=2,foo=3,'
produced a match of notfoo=1,bar=2
. The foo=3,
isn't even included in the match, but if it's removed, the regex no longer matches! How is this possible? Is this a bug in the regex
module?