1

I have a list of words stored in a list:

[
    'investment',
    'property',
    'something',
    'else',
    'vest'
]

I also have a list of strings, like so

[
    'investmentproperty',
    'investmentsomethingproperty',
    'investmentsomethingelseproperty',
    'abcinvestmentproperty',
    'investmentabcproperty'
]

Given this list of words and the list of strings, I need to determine which strings contain only words from the list of words, and have a maximum number of these words.

In the above example, if the maximum number of words was 3, then only the first two items from the string list would match (even though the word 'vest' is in 'investment'.

This example simplifies the word list and string list - in reality there are many thousands of words and hundreds of thousands of strings. So this needs to be performant. All the strings contain no spaces.

I've tried constructing a regex like so:

^(?:(word1)|(word2)|(word3)){1,3}$

but this is veeery slow for the number of words in the word list (currently 10,000).

Thanks

Michael Bates
  • 1,884
  • 2
  • 29
  • 40
  • In your first list, are any of the words prefixes for other words? e.g. could the first list contain `'tree'` and `'treehouse'`? – mgilson May 23 '17 at 06:01
  • @mgilson good point, I updated the question. `'vest'` is in the word `'investment'` so the larger word should match. – Michael Bates May 23 '17 at 06:09

3 Answers3

0

I think this is you are excepting

code:

strings = [
    'investmentproperty',
    'investmentsomethingproperty',
    'investmentsomethingelseproperty',
    'abcinvestmentproperty',
    'investmentabcproperty'
]
words = [
    'investment',
    'property',
    'something',
    'else'
]
new_words =filter(lambda x: [x for i in words if x in i and x != i] == [], words)
res = list()
for string in strings:
    len_string = len(string)
    in_words = []
    for w in new_words:
        if w in string:
            in_words.append(w)
    if len(''.join(in_words)) == len_string:
        res.append(string)
print res

output :

['investmentproperty', 'investmentsomethingproperty', 'investmentsomethingelseproperty']
Arun
  • 1,149
  • 11
  • 22
  • Thanks @Arun. I actually just updated the question as you posted your answer. Your answer works, but the word list may contain subsets of larger words (such as `'vest'` or `'in'`). Also, in your answer, how do I limit the number of words to match (`'investmentsomethingelseproperty'` shouldn't match if the max number of words from the word list is `3`.) – Michael Bates May 23 '17 at 06:17
  • @MichaelBates I updated my answer according to your updates,for the count you need to add a counter in the loop then check the counter. – Arun May 23 '17 at 06:37
0

If my two cents matters, i would turn the list that contains key words that you are looking for in a dictionary, this way you dont have to keep iterating over two lists.

ooo i forgot look ath this Aho–Corasick algorithm, this might help you alot

if not interested follow below

1- IF you want to keep two lists look here

matchers = ['investment', 'property', 'something', 'else', 'vest']
matching = [s for s in my_list if any(xs in s for xs in matchers)]

2- or maybe

reduce((lambda x, y: x+len(filter((lambda z, x=y: z == x), list2))), list1, 0)

3- this also sounds very intersting and looks like a good alternative to regexes

for other requirement of limiting the number of matches, maybe you can add a while loop that breaks when the number of matches has reached. In case of dictionary make the words that you are trying to find as keys and set all their values to one, every time words are found add the values, until your goal has been reached.

BlooB
  • 955
  • 10
  • 23
0

How much time do you expect? I have tested the following code:

_list = ['investmentproperty'] * 100000
_dict = [
    'investment',
    'property',
    'something',
    'else'
] * 1000
regex = re.compile("^(?:" + "|".join(_dict) + "){1,3}$")

for i in _list:
    result = regex.match(i)
#cost 5.06s

for i in _list:
    result = re.match("^(?:" + "|".join(_dict) + "){1,3}$", i)
#cost 11.04s

I think with 100000 length list and 4000 length dictionary, it is not a bad performance, right?

Sraw
  • 18,892
  • 11
  • 54
  • 87
  • As it turns out, you have greatly improved the efficiency of my regex - I was using a capturing group for every word, which slowed down the process hugely. This is great, thanks – Michael Bates May 23 '17 at 09:57
  • @MichaelBates So could you accept my answer, thanks. – Sraw May 23 '17 at 09:59