repeated pattern in regex

Question

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:

cc = "GT__abc23_1231:TF__XYZ451"

import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)

Expected output:

GT, abc23_1231, TF, XYZ451

I saw a bunch of questions like this, but it did not help.

Your requirements are unclear since the regex you tried contains `__` and matches some letters and `.*` matches anything, 0+ occurrences. Could you please precise? BTW, `_` is a word character, there is no `\b` between `T` and `_`. — Wiktor Stribiżew, May 20 '16 at 19:44
I want to first split on ":", then split on double underscore. I import `regex` as it is recommended in the cited questions. — Dima Lituiev, May 20 '16 at 20:05

Wiktor Stribiżew · Answer 1 · 2016-05-20T19:54:47.550

It seems you can use

(?:[^_:]|(?<!_)_(?!_))+

See the regex demo

Pattern details:

(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
- [^_:] - any character but _ and :
- (?<!_)_(?!_) - a single _ not enclosed with other _s

Python demo with re based solution:

import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']

If the first character is always not a : and _, you may use an unrolled regex like:

r'[^_:]+(?:_(?!_)[^_:]*)*'

It won't match the values that start with single _ though (so, an unrolled regex is safer).

Casimir et Hippolyte · Answer 2 · 2016-05-20T21:36:12.800

Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):

>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)

[EDIT]

According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:

>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

repeated pattern in regex

2 Answers2