1

I am trying to match exact words with regex but it's not working as I expect it to be. Here's a small example code and data on which I'm trying this. I am trying to match c and java words in a string if found then return true.

I am using this regex \\bc\\b|\\bjava\\b but this is also matching c# which is not what I'm looking for. It should only match that exact word. How can I achieve this?

def match(x):
    if re.match('\\bc\\b|\\bjava\\b', x) is not None:
        return True
    else: return False

print(df)

0                                  c++ c
1            c# silverlight data-binding
2    c# silverlight data-binding columns
3                               jsp jstl
4                              java jdbc
Name: tags, dtype: object

df.tags.apply(match)

0     True
1     True
2     True
3    False
4     True
Name: tags, dtype: bool

Expected Output:

0     True
1    False
2    False
3    False
4     True
Name: tags, dtype: bool
user_12
  • 1,778
  • 7
  • 31
  • 72
  • 1
    The question was marked as duplicate but the context seems different. @user_12 In case the other question doesn't help the problem is that `\b` "matches empty string at word boundary (between \w and \W)" and since # is not \w \bc\b matches c#/ – kkawabat Aug 29 '19 at 00:30
  • @kkawabat Fair enough, reopened the question. You can post an answer if you like. – Selcuk Aug 29 '19 at 00:31
  • `\b` considers alphanumeric characters to be word characters. Since `#` is not alphanumeric, it creates a word boundary, which is why `c#` matches `\bc\b`. – Tom Karzes Aug 29 '19 at 00:31
  • @TomKarzes So I should use something like `\sc\s|\sjava\s` right? I've tried that but it's returning everything as `False`. If this is not what you meant can you post it as an answer below? – user_12 Aug 29 '19 at 00:35
  • Yes, except for one thing: `\s` requires a white space character, so it won't work at the start or the end of the string. So you would need to make those matches optional at the start or end of the string. – Tom Karzes Aug 29 '19 at 01:39

2 Answers2

3

You can use a negative lookbehind and a negative lookahead pattern to ensure that each matching keyword is neither preceded nor followed by a non-space character:

(?<!\S)(?:c|java)(?!\S)

Demo: https://regex101.com/r/GOF8Uo/3

Alternatively, simply split the given string into a list of words and test if any word is in the set of keywords you're looking for:

def match(x):
    return any(w in {'c', 'java'} for w in x.split())
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • 1
    Thank you. Can I know which method is usually faster i.e (split or regex)? I have around million datapoints and 40k values in the list to check for. – user_12 Aug 29 '19 at 00:56
  • 1
    You're welcome. Regex is usually much slower than implementations with proper algorithms. See demo: https://repl.it/repls/BigPunctualLint – blhsing Aug 29 '19 at 01:03
  • If you want to speed it up, compile the regular expression (once), then use the compiled version. It's a good habit to *always* compile regular expressions, with `re.compile`. I think Python does some caching, but it's faster and more reliable to make it explicit (plus it makes it easy to reuse them elsewhere). – Tom Karzes Aug 29 '19 at 05:58
  • @TomKarzes Good point. I've updated my demo accordingly then. – blhsing Aug 29 '19 at 06:02
0

Have you tried using one of the regex test sites such as this one or this one?? They will analyse your regex patterns and explain exactly what you are actually trying to match. There are many others.

I am not familiar with the python match function, but it appears that it parses your input pattern into

\bc\b|\bjava\b

which matches either 'c' or 'java' at a word boundary. Consequently it will find a 'c' at both ends of "0", the beginning of "1" and "2", return "no match" for "3" and match 'java' in "4" which accounts for your results.

pjaj
  • 225
  • 4
  • 12