Fast way of finding the longest substring using only regex

Question

Is it possible to find the longest substring that is at the same time a prefix and a suffix (overlapping is allowed) using only one regular expression? For example:

aab is the longest one in aabaaabaaaaab:

aabaaabaaaaab

aabaaabaaaaab

Another example is babab being the longest in babababab:

babababab

babababab

I was trying to accomplish that using the regular expression suggested by @nhahtdh in Lookahead regex failing to find the same overlapping matches but it fails to find the first example.

I wrote a code that is working properly but it is way too slow because so many uses of text.find. The reason why I have to resort to all that extra logic is because the r'(?=((\w+).*\2.*\2))' regular expression is not tuned to find only suffix and prefixes. I have tried to use ^ and $ but I was not able to make it work properly.

Here is the current correct (but slow) implementation:

from re import findall
text = "xyzxyzxyzxyzxyz"
matches = findall(r'(?=((\w+).*\2.*\2))', text)
s = ""
result = ""
for (s1, s2) in matches:
  if text.startswith(s1) and text.endswith(s1) and text.startswith(s2) and text.endswith(s2):
    if len(s1) == len(text):
      s = s2
    elif len(s2) == len(text):
      s = s1
    else:
      s = s1 if len(s1) > len(s2) else s2
    mid = text.find(s, 1)
    if not (mid == -1 or mid >= len(text) - len(s)):
      result = s if len(s) > len(result) else result
print(result) # xyzxyzxyz

You can't do that with just regex. Extract all matches, then find the longest. — Wiktor Stribiżew, Feb 19 '20 at 11:30
Do you need to use regex? Just slicing strings would be much easier and faster. — Alex Hall, Feb 19 '20 at 11:31
Does the resulting substring also have to be present a third separate time in the middle (i.e. not touching either end)? — Alex Hall, Feb 19 '20 at 11:36
`str.find` is extremely slow in this case. I implemented the Ukkonen algorithm to find all the matches and benchmarked it against the re module and I found the cpython implementation in the `re` module way faster, this is why I am trying so hard to implement that using regex — Kali User, Feb 19 '20 at 11:38
@AlexHall correct, it has to be present a third separate time in the middle (i.e. not touching either end or the start) — Kali User, Feb 19 '20 at 11:39

score 0 · Answer 1 · answered Feb 19 '20 at 11:38

0

string = "babababab"
substring = ""
for length in range(1, len(string)):
    prefix = string[:length]
    suffix = string[-length:]
    if prefix == suffix and string.find(prefix, 1) != len(string) - length:
        substring = prefix

print(substring)

The extra check string.find(prefix, 1) != len(string) - length ensures that the string also appears strictly in the middle.

answered Feb 19 '20 at 11:38

Alex Hall

34,833
5
57
89

Thank you for the answer Alex, but unfortunately `str.find` is slower than what I need. When I need to look for multiple times in the same string, `re.findall` and `re.search` in CPython are the fastest alternatives. For example, there is no way to solve this problem in Python using `str.find`, but it works using the `re` module: [10679 - I Love Strings!!](https://onlinejudge.org/index.php?option=onlinejudge&page=show_problem&problem=1620) – Kali User Feb 19 '20 at 11:42
@KaliUser If you are trying to solve those online judge problems, you would need to use Suffix Trie or Boyer Moore string matching algorithm. – nhahtdh Feb 20 '20 at 04:25
Thank you @nhahtdh. I implemented KMP and Suffix Tree (Ukkonen) both in Python and Java. Only the Java versions get accepted. But, using `re.search` in Python also gets accepted, this why I suspect the Python regex implementation is super fast – Kali User Feb 20 '20 at 09:18
@KaliUser I'd like to see what code you submitted to that "I Love Strings" problem, I couldn't get past the time limit. – Alex Hall Feb 20 '20 at 09:23
@AlexHall this is my implementation: https://repl.it/repls/FamousExcitedEnvironments – Kali User Feb 20 '20 at 09:43
@AlexHall in case you are also interested in why regexes are so fast in Python, I posted a question about that https://stackoverflow.com/questions/60349511/data-structure-used-for-regexes-in-the-python-standard-library-re-module – Kali User Feb 22 '20 at 07:43

Fast way of finding the longest substring using only regex

1 Answers1

Linked