removing repeated sequence of separate words in a row using regular expressions in python

Question

I wanted to get rid of repeated sequence of words in a text string, so I came up with this regex solution in Python:

e = re.compile(r'(\w.+)(?=\1)',flags=re.DOTALL)

so when:

text = 'Cats get wet in get wet in the rain'

The result seemed to be OK:

re.sub(e,'', text)

'Cats get wet in the rain'

but if

text = 'Rococo was replaced by the Neoclassic style.'
re.sub(e,'', text)

I get

'Roco was replaced by the Neoclassic style.'

which is not ok, because "Rococo" has changed to "Roco".

So I changed the regex a bit in order to only match distinct words (separated by spaces) repeated in a row:

e = re.compile(r'(\w.+ )(?=\1)',flags=re.DOTALL)

so I get both

text = 'Cats get wet in get wet in the rain'

re.sub(e,'', text)

'Cats get wet in the rain'

and

text = 'Rococo was replaced by the Neoclassic style.'

re.sub(e,'', text)

'Rococo was replaced by the Neoclassic style.'

fine and it seems that's my desired regex. However I also get some strange behavior:

let's put it this way:

text = 'Escobar bar established'

re.sub(e,'', text)

'Escobar established'

the word "bar" is missing in the result which is not intended at all.

Now the question is:

What is the right solution for what I want to do using Python, which means removing repeated sequence of separate words in a row, while keeping the other parts of sentence safe and intact?

Thank you so much for your help.

An alternative - [`\b(\w+(?:\s+\w+)*)(?=\s+\b\1\b)`](https://regex101.com/r/AJkImA/2) — Wiktor Stribiżew, Nov 02 '17 at 16:43

removing repeated sequence of separate words in a row using regular expressions in python

0 Answers0