3

I need to create a regular expression to validate strings. The strings can have only few characters and each character can be repeated only a few number of times.

The regular expression should check below conditions.

  1. The string can have only a, b, c, d, e as characters.
  2. The character 'a' can appear a maximum of 2 times.
  3. The character 'b' can appear maximum of 3 times.
  4. The character 'c' can appear maximum of 3 times.
  5. The character 'd' can be appear maximum of 1 time.
  6. The character 'e' can be appear maximum of 1 time.

I know that this can be achieved by string functions. But I am trying to do it with regular expressions.

Any help on this is highly appreciated.

Ajith
  • 1,447
  • 2
  • 17
  • 31
  • 2
    I don't think regex is suited in performing such a check. Since you also tagged python, why don't you do it programmatically in python instead? – blhsing Jul 01 '18 at 15:31
  • I have more than 100,000 strings to process here. Using regex will be faster than performing string manipulations. – Ajith Jul 01 '18 at 15:33
  • Joel Berkeley the one that was pasted below is not the duplicate. of this. There it is performing a string operation to count number of characters. I need regular expression for this. – Ajith Jul 01 '18 at 15:35
  • @Ajith Regex is usually **slower** than other means of string parsing in python. – zvone Jul 01 '18 at 15:38
  • I am new to Python. With my previous experience with other languages, I found regex is faster than string manipulations. – Ajith Jul 01 '18 at 15:39
  • I need to perform this check on 100,000 strings in my data frame. Performnce is a concern here. – Ajith Jul 01 '18 at 15:41
  • 1
    Possible duplicate of [Count the number occurrences of a character in a string](https://stackoverflow.com/questions/1155617/count-the-number-occurrences-of-a-character-in-a-string) – Mureinik Jul 01 '18 at 15:47
  • 3
    @Ajith Using regular expressions for this task will be significantly slower then iterating through the string once. Especially because the regex needs to be more complex to solve this as it's not a task intended for regular expressions. – Spencer Wieczorek Jul 01 '18 at 15:47
  • Are you using pandas? Do you want to create a boolean Series based on a str Series? – Alex Hall Jul 01 '18 at 18:09
  • @AlexHall, Hal, l am using pandas.I have the strings loaded in a dataframe. I am trying to remove the entries from the data frame which are not satisfying this conditions. – Ajith Jul 01 '18 at 20:30

4 Answers4

4

Likely, performance wise, the best way to do this is with Python native string operations.

I would write like so:

lim=(('a',2),('b',3),('c',3),('d',1),('e',1))
results={}
for s in [list_of_many_strings]:
    results[s]=bool(not(set(s)-set('abcde'))) and (not any(s.count(c)>x for c,x in lim))

This relies on str.count(sub[, start[, end]]) to count the occurrence of a sub string in a string and any function to test if any condition is true.


Since you are interested in performance, you can time how long processing 100,000 strings might take with timeit:

import re

def f1(li):
    results={}
    lim=(('a',2),('b',3),('c',3),('d',1),('e',1))
    for s in li:
        results[s]=bool(not(set(s)-set('abcde'))) and (not any(s.count(c)>x for c,x in lim))

    return results    

def f2(li):
    pat=re.compile(r'^a{0,2}b{0,3}c{0,3}d{0,1}e{0,1}$')
    results={}
    for s in li:
        results[s]=True if pat.search(''.join(sorted(s))) else False

    return results

def f3(li):
    pat=re.compile(r'^(?!.*[^a-e])(?!(?:.*a){3})(?!(?:.*b){4})(?!(?:.*c){4})(?!(?:.*d){2})(?!(?:.*e){2}).+')
    results={}
    for s in li:
        results[s]=True if pat.search(s) else False    

    return results    

if __name__=='__main__':
    import timeit    
    import random 
    s='abcdeabcdebc'
    li=[''.join(random.sample(s,8)) for _ in range(100000)]
    print(f1(li)==f2(li)==f3(li))

    for f in (f1,f2,f3):
        print("   {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(li)", setup="from __main__ import f, li", number=10)))

On my computer, takes:

True
       f1    0.8519 secs
       f2    1.1235 secs
       f3    1.3070 secs
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Your solution does not check whether the source string contains any other char than a-e. If it does, the check should fail. – Valdi_Bo Jul 01 '18 at 17:07
  • Amazingly, I find that the timings are reversed, e.g. https://repl.it/repls/PerfumedBabyishBots. I've also added an optimised version of your solution. – Alex Hall Jul 01 '18 at 18:29
  • @AlexHall: huh! I ran it on PyPy on a Mac to get those timings. I tried your exact source code and tried all of Py3, Py2 and PyPy. Some variance but not the same as the real.it version. Your optimized solution is a bit faster indeed. Thanks! – dawg Jul 01 '18 at 18:48
3

A regex checking multiple conditions can be constructed the following way:

  • ^ - Start of the source string.
  • A series of positive lookups, for any "required" condition.
  • A series of negative lookups, for any "forbidden" condition.
  • .+ - If all previous lookups succeeded, match the whole (usually non-empty) source string.

If either positive or negative lookup refers to a char located anywhere in the source string, it should start with .*, stating that before what we actually check, there can occur any number of any (other) chars, possibly none.

Your case contains actually only "forbidden" conditions, stating that it is not allowed:

  • (?!.*[^a-e]) - Any char other than a-e.
  • (?!(?:.*a){3}) - a occurs 3 times (or more).
  • (?!(?:.*b){4}) - b occurs 4 times (or more).
  • (?!(?:.*c){4}) - c occurs 4 times (or more).
  • (?!(?:.*d){2}) - d occurs 2 times (or more).
  • (?!(?:.*e){2}) - e occurs 2 times (or more).

So the whole regex should be:

^(?!.*[^a-e])(?!(?:.*a){3})(?!(?:.*b){4})(?!(?:.*c){4})(?!(?:.*d){2})(?!(?:.*e){2}).+
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • @AlexHall: Actually, this can have great performance. If the regex engine is recent, it will short-circuit on any element quickly. Lookaheads are *usually* faster than the equivelent non-lookahead construct as well. This is the highest performing solution **IF** you have many string with characters other than `a-e` because it fails on the first clause. – dawg Jul 01 '18 at 17:28
2

if your strings are already sorted, like all a characters are already in front of all b characters and so on, the simple regular expression like this will do:

r'^a{0,2}b{0,3}c{0,3}d{0,1}e{0,1}$'

If characters in your strings are unsorted, well, sort them first =)

And if the meaning of your "can appear a maximum of 2 times" is 1 or 2 times (not 0, 1 or 2, as I expected), replaces all 0 with 1 in the reg.expression.

lenik
  • 23,228
  • 4
  • 34
  • 43
1

Since you are using pandas, it's probably best to use vectorised operations. These are supposed to be faster, although I am not willing to check. Here is one possible approach, a pandas or numpy expert may have a better method:

import random

import pandas as pd

s = 'abcdeabcdebc'
df = pd.DataFrame({'s': [''.join(random.sample(s, 8)) for _ in range(100000)]})
count = df.s.str.count
df['valid'] = ((count('[^a-e]') == 0) &
               (count('a') <= 2) &
               (count('b') <= 3) &
               (count('c') <= 3) &
               (count('d') <= 1) &
               (count('e') <= 1))

print(df.head())

Example output:

          s  valid
0  cacbdbec   True
1  cabdceab   True
2  bbdcdabe  False
3  ecaadbce  False
4  ebabcdad  False
Alex Hall
  • 34,833
  • 5
  • 57
  • 89