1

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:

cc = "GT__abc23_1231:TF__XYZ451"

import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)

Expected output:

GT, abc23_1231, TF, XYZ451

I saw a bunch of questions like this, but it did not help.

Community
  • 1
  • 1
Dima Lituiev
  • 12,544
  • 10
  • 41
  • 58
  • In your question code, do you mean `import re`? – Aaron Christiansen May 20 '16 at 19:43
  • 1
    Your requirements are unclear since the regex you tried contains `__` and matches some letters and `.*` matches anything, 0+ occurrences. Could you please precise? BTW, `_` is a word character, there is no `\b` between `T` and `_`. – Wiktor Stribiżew May 20 '16 at 19:44
  • I want to first split on ":", then split on double underscore. I import `regex` as it is recommended in the cited questions. – Dima Lituiev May 20 '16 at 20:05

2 Answers2

2

It seems you can use

(?:[^_:]|(?<!_)_(?!_))+

See the regex demo

Pattern details:

  • (?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
    • [^_:] - any character but _ and :
    • (?<!_)_(?!_) - a single _ not enclosed with other _s

Python demo with re based solution:

import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']

If the first character is always not a : and _, you may use an unrolled regex like:

r'[^_:]+(?:_(?!_)[^_:]*)*'

It won't match the values that start with single _ though (so, an unrolled regex is safer).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):

>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)


[EDIT]

According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:

>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125