0

I am trying to match the following using regex in python (re module):

"...milk..."              => matched ['milk']

"...almondmilk..." = no match
"...almond milk..." = no match
"...almond word(s) milk..." => matched ['milk']
"...almondword(s)milk..." => matched ['milk']


"...soymilk..." = no match
"...soy milk..." = no match
"...soy word(s) milk..." => matched ['milk']
"...soyword(s)milk..." => matched ['milk']

My other requirement is to find all matches within a given string. So I am using re.findall()

I used the answer to this question (and reviewed a number of other SO pages) to construct my regex:

regx = '^(?!.*(soy|almond))(?=$|.*(milk)).*'

but when I test it with a simple example, I get incorrect behavior:

>>> food = "is combined with creamy soy and milk. a fruity and refreshing sip of spring, "
>>> re.findall(regx, food)
[]
>>> food = "is combined with creamy milk. a fruity and refreshing sip of spring, "
>>> re.findall(regx, food)
[('', 'milk')]

Both of these are supposed to return just ['milk']. Also, if I have multiple instances of milk, I only get one result instead of two:

>>> food = "is combined with creamy milk. a fruity and refreshing sip of milk, "
>>> re.findall(regx, food)
[('', 'milk')]

What am I doing wrong in my regex, and how should I adjust it to solve this problem?

Tayyar R
  • 655
  • 6
  • 22
  • Maybe `(?<!soy)(?<!soy )(?<!almond)(?<!almond )milk` will work for you. – Wiktor Stribiżew May 25 '21 at 19:28
  • I'm not sure you've thought this through. What about "...for the almond industry. Many ranchers find milk to be a refreshing beverage."? Should that match? If not, why not? – Tim Roberts May 25 '21 at 19:33
  • @TimRoberts was this a question for me or for Wiktor? For my very specific use case this would need to be matched as it falls under "...almond word(s) milk...". Where word(s) is any number of words. – Tayyar R May 25 '21 at 21:22

2 Answers2

1

This regex works for me.

(?:soy|almond)\s?[\w\(\)]+\s?(milk)

or to not accept parenthesis in the words:

(?:soy|almond)\s?\w+\s?(milk)

And in Python, that should look like:

import re

matches = re.findall(r'(?:soy|almond)\s?[\w\(\)]+\s?(milk)', your_text)
LuisAFK
  • 846
  • 4
  • 22
1

You can exclude soymilk soy milk almondmilkandalmond milk` by matching them, and capture just milk in a capture group, which will be returned by re.findall.

\b(?:soy|almond)\s?milk\b|\b(milk)\b

The pattern matches:

  • \b A word boundary to prevent a partial match
  • (?:soy|almond) Match either soy or almond
  • \s?milk\b Match an optional whitespace char and milk followed by a word boundary
  • | Or
  • \b(milk)\b Capture milk in group 1 surrounded by word boundaries

You could also use [^\S\r\n] instead of \s to match a space without a newline, as the latter can match a newline.

Regex demo | Python demo

For example

import re

regx = r"\b(?:soy|almond)\s?milk\b|\b(milk)\b"

food = "is combined with creamy soy and milk. a fruity and refreshing sip of spring, "
print(re.findall(regx, food))

food = "is combined with creamy milk. a fruity and refreshing sip of spring, "
print(re.findall(regx, food))

Output

['milk']
['milk']

Another option could be using the PyPi regex module

(?<!\b(?:soy|almond)\s*(?:milk)?)\bmilk\b

The pattern matches:

  • (?<! Negative lookbehind, assert what directly to the left is not
  • \b(?:soy|almond) A word boundary, match either soy or almond
  • \s*(?:milk)? Match optional whitespace chars and then optionally milk
  • ) Close lookbehind
  • \bmilk\b Match milk between word boundaries

Regex demo | Python demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This matches an empty string for almondmilk instead of no match ```>>> food = "is combined with creamy almondmilk. a fruity and refreshing sip of spring, " >>> re.findall(regx, food) ['']``` – Tayyar R May 25 '21 at 21:25
  • @TayyarR You can remove the empty matches from the final list like `print([m for m in re.findall(regx, food) if m])` See https://ideone.com/YAqX7b – The fourth bird May 25 '21 at 21:30
  • Thanks! Are the empty strings going to show up every time there is an unwanted match (almond for ex) or are there other situations in which this will return an empty string? This is a bit of a noob question: Is there a way to build the regex so that only _real_ matches are returned in the list? – Tayyar R May 25 '21 at 21:35
  • @TayyarR That is part of the technique ruling out what you don't want, and capture what you do want. You can use the list comprehension to remove the empty strings for example. – The fourth bird May 25 '21 at 21:38
  • 1
    @TayyarR If you can make use of the [PyPi regex module](https://pypi.org/project/regex/), you can use https://regex101.com/r/nrnotP/1 – The fourth bird May 25 '21 at 21:49
  • the pypi solution seems the best so far! I'm going to do more extensive testing and accept your answer if everything checks out. Any chance you could update your answer up above to include an explanation as to what you changed to get to the final solution? It would be great to learn from your example. Thanks so much for your help!!! – Tayyar R May 25 '21 at 22:53
  • @TayyarR I have added an update with an example link. – The fourth bird May 25 '21 at 23:23