3

If this

(°[0-5])

matches °4

and this

((°[0-5][0-9]))

matches °44

Why does this

((°[0-5])|(°[0-5][0-9]))

match °4 but not °44?

BattlFrog
  • 3,370
  • 8
  • 56
  • 86
  • Hrmm... I'm not set up to test C#, but the regex itself looks legit (and `/((°[0-5])|(°[0-5][0-9]))/.test('°44')` passes the sniff-test in JS-land...). – jmar777 Jul 21 '15 at 20:10
  • Although honestly I'd write that as `°[0-5][0-9]?` instead (unless you actually need the captures). – jmar777 Jul 21 '15 at 20:11
  • I'd use `(°[0-5]{1,2})` to match for ° followed by one _or_ two characters between 0 and 5. – christophano Jul 21 '15 at 20:14
  • @christophano That's not equivalent to the original regex. E.g., no match for `°48`. – jmar777 Jul 21 '15 at 20:15
  • 1
    @jmar777 absolutely, my bad. I misread the question. That being the case, the regex you posted is absolutely right. – christophano Jul 21 '15 at 20:19

3 Answers3

3

Because when you use logical OR in regex the regex engine returns the first match when it find a match with first part of regex (here °[0-5]), and in this case since °[0-5] match °4 in °44 it returns °4 and doesn't continue to match the other case (here °[0-5][0-9]):

((°[0-5])|(°[0-5][0-9]))

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • The source of the quote is from Python [7.2. re — Regular expression operations](https://docs.python.org/2/library/re.html). – Wiktor Stribiżew Jul 21 '15 at 20:50
  • @stribizhev Yeah and it Doesn't make any difference since both languages use same role here! which comes from traditional NFA. – Mazdak Jul 21 '15 at 20:55
1

You are using shorter match first in regex alternation. Better use this regex to match both strings:

°[0-5][0-9]?

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Because the alternation operator | tries the alternatives in the order specified and selects the first successful match. The other alternatives will never be tried unless something later in the regular expression causes backtracking. For instance, this regular expression

(a|ab|abc)

when fed this input:

abcdefghi

will only ever match a. However, if the regular expression is changed to

(a|ab|abc)d

It will match a. Then since the next characyer is not d it backtracks and tries then next alternative, matching ab. And since the next character is still not d it backtracks again and matches abc...and since the next character is d, the match succeeds.

Why would you not reduce your regular expression from

((°[0-5])|(°[0-5][0-9]))

to this?

°[0-5][0-9]?

It's simpler and easier to understand.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135