3

I have a comma-separated string of key=value pairs like this:

foo=1,foo=1,bar=2

In this string I want to capture the value of the first foo, but only if it's immediately followed by bar=2.

Examples:

  • In this string, the value 1 should be captured:

     baz=0,foo=1,bar=2,foo=3,bar=4
    
  • In this string, nothing should be captured:

     baz=0,foo=1,foo=1,bar=2
    

My current solution uses a tempered greedy token, but that forces me to duplicate the foo=[^,]*, part of the regex:

^(?:(?!foo=[^,]*,).)*foo=([^,]*),bar=2(?:,|$)

Is there any way to do this without having to duplicate such a big part of the regex?

Community
  • 1
  • 1
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149

2 Answers2

3

It's pretty easy with backtracking control verbs:

(?<![^,])foo=([^,]*)(*COMMIT),bar=2(?![^,])

We match a position not preceded by a non-comma character (i.e. the start of the string or immediately after ,), followed by foo=, followed by 0 or more non-comma characters (which we capture). This is the foo=... part.

We then commit to the first match found and require a ,bar=2 match, not followed by a non-comma character (i.e. a , or the end of the string).

melpomene
  • 84,125
  • 8
  • 85
  • 148
0

Disclaimer: This only works in select regex engines.

Some regex engines have a "feature" that we can abuse: Capture groups in lookaheads are possessive; once they have matched they can never change their value again.

Making use of this "feature", the regex can be written like this:

.*?(?!\1)thing_you_want_the_first_occurrence_of(?=())rest_of_the_regex

In this specific case, that looks like this (the indices of the capture groups are shifted by 1 since foo=([^,]*) contains a capture group):

.*?(?!\2)(?<![^,])foo=([^,]*),(?=())bar=2(?![^,])

So, how does it work?

After the first occurrence of foo= is found, the group (?=()) matches. Because it's inside a lookahead, it can never change its value anymore - not even backtracking can affect it. So from this point onwards, (?!\2) will never match again. The fact that the first occurrence of foo= has been found is now "locked in" and cannot be undone. If the regex backtracks and tries to make the .*? match more text, the (?!\2) prevents this.

Demo using python's PyPI regex module:

>>> pattern = r'.*?(?!\2)(?<![^,])foo=([^,]*),(?=())bar=2(?![^,])'
>>> regex.match(pattern, 'baz=0,foo=1,bar=2,foo=3,bar=4').group(1)
'1'
>>> regex.match(pattern, 'baz=0,foo=1,foo=1,bar=2')
>>>
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149