Find substring in string but only if whole words?

Question

What is an elegant way to look for a string within another string in Python, but only if the substring is within whole words, not part of a word?

Perhaps an example will demonstrate what I mean:

string1 = "ADDLESHAW GODDARD"
string2 = "ADDLESHAW GODDARD LLP"
assert string_found(string1, string2)  # this is True
string1 = "ADVANCE"
string2 = "ADVANCED BUSINESS EQUIPMENT LTD"
assert not string_found(string1, string2)  # this should be False

How can I best write a function called string_found that will do what I need? I thought perhaps I could fudge it with something like this:

def string_found(string1, string2):
   if string2.find(string1 + " "):
      return True
   return False

But that doesn't feel very elegant, and also wouldn't match string1 if it was at the end of string2. Maybe I need a regex? (argh regex fear)

score 50 · Answer 1 · edited Jul 28 '22 at 15:53

50

You can use regular expressions and the word boundary special character \b (highlight by me):

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

def string_found(string1, string2):
    if re.search(r"\b" + re.escape(string1) + r"\b", string2):
        return True
    return False

Demo

If word boundaries are only whitespaces for you, you could also get away with pre- and appending whitespaces to your strings:

def string_found(string1, string2):
    string1 = " " + string1.strip() + " "
    string2 = " " + string2.strip() + " "
    return string2.find(string1)

edited Jul 28 '22 at 15:53

wjandrea

28,235
9
60
81

answered Nov 11 '10 at 13:50

Felix Kling

795,719
175
1,089
1,143

1

Up-voted for the theoretical suggestion. Your script, OTOH, will not work. `'\b'` is the escape sequence for the backspace (`'\x08'`) character. I would suggest `r'\b%s\b' % (re.escape(string1))` as the first parameter to `re.search()` in stead. In fact, that whole function could be reduced to `return re.search(r'\b%s\b' % (re.escape(string1)), string2) is not None` – Walter Nov 11 '10 at 13:59
1

@Walter: Not sure about `\b`. It is said: *Inside a **character range**, `\b` represents the backspace character, ...* It works for me at least. But yes, string substitution is nice too :) – Felix Kling Nov 11 '10 at 14:06
when \b is inside a character range [a-z0-9\b]...? \b should work, and did in the very brief test I did – Cubed Eye Nov 11 '10 at 14:07
1

@Walter: Your `r'\b%s\b' % (re.escape(string1))` has identical results to Felix's `r"\b" + re.escape(string1) + r"\b"`; side note: the extra parens in yours aren't useful, as they don't represent a tuple of length one. Though `if ...: return True; else: return False` is also a big pet peeve of mine. – Nov 13 '10 at 10:11
In my use case I have many cases in which string_found() return False. To make it way faster for False cases add a test for string1 in string2 before running the expensive re.search(): def string_found(string1, string2): if string1 in string2 and if re.search(r"\b" + re.escape(string1) + r"\b", string2): ... – Peter Jun 23 '15 at 14:28

score 18 · Answer 2 · edited Jul 28 '22 at 16:09

18

The simplest and most pythonic way, I believe, is to break the strings down into individual words and scan for a match:

string = "My Name Is Josh"
substring = "Name"

for word in string.split():
    if substring == word:
        print("Match Found")

For a bonus, here's a oneliner:

any(substring == word for word in string.split())

edited Jul 28 '22 at 16:09

wjandrea

28,235
9
60
81

answered Jan 09 '19 at 20:23

I like this one as it most closely matches the `grep -w` in unix – vr00n Nov 18 '19 at 16:30
Love this python approach. Works and was exactly what I was looking for! – Createdd Feb 26 '21 at 18:46
2

The true one-line is `if word in string.split()` – Kshitij Agrawal Dec 09 '21 at 10:30
Punctuation messes this up, for example: `string = "What is your name?"; substring = "name"; substring in string.split()` -> `False`. Using regex word bounds is more thorough. – wjandrea Jul 28 '22 at 16:04
@vr00n Actually, [the regex word bound answer](/a/4155064/4518341) is closer. For example, look at punctuation, like I mentioned above: `grep -qw "name" <<< "What is your name?"` -> true. (At least for GNU grep. I'm not sure about other implementations. `-w` isn't specificed in POSIX.) – wjandrea Jul 28 '22 at 16:15

aaronasterling · Answer 3 · 2010-11-13T11:11:16.000

9

Here's a way to do it without a regex (as requested) assuming that you want any whitespace to serve as a word separator.

import string

def find_substring(needle, haystack):
    index = haystack.find(needle)
    if index == -1:
        return False
    if index != 0 and haystack[index-1] not in string.whitespace:
        return False
    L = index + len(needle)
    if L < len(haystack) and haystack[L] not in string.whitespace:
        return False
    return True

And here's some demo code (codepad is a great idea: Thanks to Felix Kling for reminding me)

edited Nov 13 '10 at 11:11

answered Nov 11 '10 at 13:45

aaronasterling

68,820
20
127
125

Just make sure to "save" the codepad pastes, so they don't expire. (I include a link back in a codepad comment, just for my own notes later, too.) – Nov 13 '10 at 07:27
2

For those who want to ensure that punctuation as well as white space is considered a valid whole word delimiter... modify the above code as follows: ```not in (string.whitespace + string.punctuation)``` Also note this function is more than twice as efficient as the RegEx alternative proposed so...if you are using it a lot, this function is the way to go. – Jason Leidigh Apr 17 '17 at 18:52
Fantastic solution. For 5000k rows I've got `1e-05` while with regex `0.0018`. 180 x faster. – Peter.k Feb 28 '19 at 17:30
1

The code is not quite correct. If there are *two* or more occurrences of the substring, the first *not* being a whole word but the second being a whole word, the code will only consider the first one and return false. One must look at all matches, and return false if none of them qualify. – TCSGrad Aug 04 '19 at 19:44
Added my answer: https://stackoverflow.com/a/41391098/212942 that builds off your code. – TCSGrad Aug 04 '19 at 21:42

score 2 · Answer 4 · edited Jul 28 '22 at 16:05

I'm building off aaronasterling's answer.

The problem with the above code is that it will return false when there are multiple occurrences of needle in haystack, with the second occurrence satisfying the search criteria but not the first.

Here's my version:

def find_substring(needle, haystack):
  search_start = 0
  while (search_start < len(haystack)):
    index = haystack.find(needle, search_start)
    if index == -1:
      return False
    is_prefix_whitespace = (index == 0 or haystack[index-1] in string.whitespace)
    search_start = index + len(needle)
    is_suffix_whitespace = (search_start == len(haystack) or haystack[search_start] in string.whitespace)
    if (is_prefix_whitespace and is_suffix_whitespace):
      return True
  return False

score 0 · Answer 5 · answered Dec 30 '16 at 05:29

0

One approach using the re, or regex, module that should accomplish this task is:

import re

string1 = "pizza pony"
string2 = "who knows what a pizza pony is?"

search_result = re.search(r'\b' + string1 + '\W', string2)

print(search_result.group())

answered Dec 30 '16 at 05:29

Chris Larson

1,684
1
11
19

A site note to this answer. Regular expression is much slower than "find()" and with large text, one should consider using str.find() – Celdor Jul 18 '18 at 16:00

score 0 · Answer 6 · answered Apr 14 '20 at 01:13

0

Excuse me REGEX fellows, but the simpler answer is:

text = "this is the esquisidiest piece never ever writen"
word = "is"
" {0} ".format(text).lower().count(" {0} ".format(word).lower())

The trick here is to add 2 spaces surrounding the 'text' and the 'word' to be searched, so you guarantee there will be returning only counts for the whole word and you don't get troubles with endings and beginnings of the 'text' searched.

answered Apr 14 '20 at 01:13

Danilo Castro

11

3

What happens if, for example, the word word one is looking for has a non alphabet optional character surrounding or on either side of it? For example: text = "this is the esquisidiest piece never ever writen." word = "writen" .notice the dot at the end. – hecvd Aug 31 '20 at 21:55

score 0 · Answer 7 · answered Mar 31 '21 at 02:07

Thanks for @Chris Larson's comment, I test it and updated like below:

import re

string1 = "massage"
string2 = "muscle massage gun"
try:
    re.search(r'\b' + string1 + r'\W', string2).group()
    print("Found word")
except AttributeError as ae:
    print("Not found")

score -1 · Answer 8 · answered Aug 04 '19 at 21:51

-1

def string_found(string1,string2):
    if string2 in string1 and string2[string2.index(string1)-1]==" 
    " and string2[string2.index(string1)+len(string1)]==" ":return True
    elif string2.index(string1)+len(string1)==len(string2) and 
    string2[string2.index(string1)-1]==" ":return True
    else:return False

answered Aug 04 '19 at 21:51

SOLOSNAKE231

1
2

It does the thing they wanted to do? Idk what else you want – SOLOSNAKE231 Aug 05 '19 at 10:05
2

We try to give detail in our answers so they can be understood by the OP as well as anyone who lands on this page with a similar question and potentially a different level of understanding. Welcome to Stack, though, you might find this helpful --> https://stackoverflow.com/help/how-to-answer – Claire Aug 05 '19 at 10:11

Find substring in string but only if whole words?

8 Answers8

Linked

Related