I wanted to get rid of repeated sequence of words in a text string, so I came up with this regex solution in Python:
e = re.compile(r'(\w.+)(?=\1)',flags=re.DOTALL)
so when:
text = 'Cats get wet in get wet in the rain'
The result seemed to be OK:
re.sub(e,'', text)
'Cats get wet in the rain'
but if
text = 'Rococo was replaced by the Neoclassic style.'
re.sub(e,'', text)
I get
'Roco was replaced by the Neoclassic style.'
which is not ok, because "Rococo" has changed to "Roco".
So I changed the regex a bit in order to only match distinct words (separated by spaces) repeated in a row:
e = re.compile(r'(\w.+ )(?=\1)',flags=re.DOTALL)
so I get both
text = 'Cats get wet in get wet in the rain'
re.sub(e,'', text)
'Cats get wet in the rain'
and
text = 'Rococo was replaced by the Neoclassic style.'
re.sub(e,'', text)
'Rococo was replaced by the Neoclassic style.'
fine and it seems that's my desired regex. However I also get some strange behavior:
let's put it this way:
text = 'Escobar bar established'
re.sub(e,'', text)
'Escobar established'
the word "bar" is missing in the result which is not intended at all.
Now the question is:
What is the right solution for what I want to do using Python, which means removing repeated sequence of separate words in a row, while keeping the other parts of sentence safe and intact?
Thank you so much for your help.