Find blocks of lines starting with a certain character

Question

Text:

Abcd
Aefg
bhij
Aklm
bnop
Aqrs

(Note, there is no newline after the last line)

Python code:

print(re.findall('(^A.*?$)+',Text,re.MULTILINE))

This returns

['Abcd','Aefg','Aklm','Aqrs']

However, I would like adjacent lines to be returned as one set:

['Abcd\nAefg','Aklm','Aqrs']

How should I solve this with Python?

Jan · Accepted Answer · 2020-08-01T09:03:56.163

3

You may use

((?:^A.*[\n\r]?)+)

See a demo on regex101.com. This is:

(
    (?:^A.*[\n\r]?)+ # original pattern 
                     # with newline characters, optionally
                     # repeat this as often as possible
)

In Python:

import re

data = """
Abcd
Aefg
bhij
Aklm
bnop
Aqrs"""

matches = [match.group(1).strip() 
           for match in re.finditer(r'((?:^A.*[\n\r]?)+)', data, re.M)]
print(matches)

Which yields

['Abcd\nAefg', 'Aklm', 'Aqrs']

It may lead to catastrophic backtracking eventually because of the nested quantifiers.

edited Aug 01 '20 at 09:03

answered Jul 31 '20 at 11:06

Jan

42,290
8
54
79

1

Perfect, thanks for the help! So my mistake was using `$` where I should have used `\n`, and I should have enclosed the whole pattern in a capturing group. – Peter Jul 31 '20 at 21:26

score 1 · Answer 2 · answered Jul 31 '20 at 11:05

1

You may use

re.findall(r'^A.*(?:\nA.*)*', text, re.M)

See the regex demo

Details

^ - start of string
A - an A letter
.* - the rest of the line
(?:\nA.*)* - zero or more reptitions of
- \nA - a newline and A
- .* - the rest of the line.

answered Jul 31 '20 at 11:05

Wiktor Stribiżew

607,720
39
448
563

That is a very smart alternative solution, I really like it! A very minor issue with it is having to repeat the line identifier (`A.*`), but for the rest a good idea. – Peter Jul 31 '20 at 21:29
2

@Peter It is following the best practices. If you ever have to add right-hand context to Jan's regex, you will get catastrophic backtracking (like [here](https://stackoverflow.com/questions/45463148/fixing-catastrophic-backtracking-in-regular-expression)) sooner or later. There is absolutely no problem with repeating `A`. It makes the pattern match in such a way that the subsequent patterns cannot match the same text, which makes it fail safe. Also, you may compare the amount of steps with [my regex](https://regex101.com/r/kESAea/1) and [the alternative one](https://regex101.com/r/SpmGPW/1/). – Wiktor Stribiżew Aug 01 '20 at 08:20
1

@Peter If you ever need to repeat the same pattern in a Python regex, just use variables: `A = 'my\s+amazing\s+pattern\s+part'` and then `re.findall(rf'^{A}.*(?:\n{A}.*)*', text)` – Wiktor Stribiżew Aug 05 '20 at 08:13

Find blocks of lines starting with a certain character

2 Answers2