0

I have to find all matches in a string that contains predefined tokens (AB- or BCC- or CDD-) or [A-Z]{2,4}-. Predefined tokens have a highest priority. I mean:

regex.findAllIn("XBCC-").toList must always return List(BCC-), not List(XBCC-)

but:

regex.findAllIn("XTEST-").toList must return List(TEST-)

I try something like that:

val regex = "((AB|BCC|CDD)|[A-Z]{2,4})-".r

But it doesn't work properly.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
user2621486
  • 99
  • 1
  • 6
  • You should do that in 2 steps. First, check with the predefined values, then use the generic one. – Wiktor Stribiżew May 18 '16 at 09:51
  • Is it impossible do in one regular expression? – user2621486 May 18 '16 at 10:01
  • No, it is not possible with Scala regex. See [some explanations here](http://stackoverflow.com/questions/35606426/order-of-regular-expression-operator/35606463#35606463), and here is a [related answer](http://stackoverflow.com/questions/35944441/lazy-quantifier-not-working-as-i-would-expect/35944635#35944635) showing how regex engine works. The whole problem is that your expression is not anchored on the left, and the engine can start matching with any alternative branch at each location. – Wiktor Stribiżew May 18 '16 at 10:08
  • Well, on second thought, you might try to restrict the `[A-Z]{2,4}`: [`(?:(?:AB|BCC|CDD)|(?![A-Z]*(?:AB|BCC|CDD)-)[A-Z]{2,4})-`](https://regex101.com/r/vN2nW6/2) – Wiktor Stribiżew May 18 '16 at 10:15
  • Why do you need to do it with a single regex? You're definitely making things hard for yourself. – The Archetypal Paul May 18 '16 at 10:28
  • Thank you, it seems to be working! – user2621486 May 18 '16 at 10:43
  • Maybe you are right, and to use one regex more complicated for code comprehension. And I'll use 2 rexep to find all my tokens in text. But your can post your answer to my question. It is correct, I suppose) – user2621486 May 18 '16 at 10:52

1 Answers1

0

Don't believe the naysayers. This can quite easily be done with regex:

(?!\w+(?:AB|BCC|CDD)-)[A-Z]{2,4}-

See the demo.

The lookahead assertion here makes sure the pattern doesn't match if AB-, BCC- or CDD- is present later in the text.


Explanation:

(?!                assert that there is no...
   \w+             ...sequence of characters...
   (?:AB|BCC|CDD)  ...followed by AB, BCC, or CDD...
   -               ...and a dash
)
[A-Z]{2,4}-        then simply match 2 to 4 characters before a dash
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149