Take a string and remove all words that are matched in an array

Question

I want to make a function that takes in a string text and an array words and then compare the string with the array and removes all the words in the string that occurs in the words-array.

Example:

remove = ["first",
         "second",
         "third",
         "fourth",
         "fifth",
         "sixt",
         "seventh",
         "eigth",
         "ninth",
         "tenth",
         "monday",
         "tuesday",
         "wednesday",
         "thursday",
         "friday",
         "saturday",
         "sunday",
         "$"]

def preprocess(text, words):
    text = text.lower()
    if text in words... #Not sure what to do here...

myString = "I got second in todays competition"
preprocess(myString, remove)
#Should return: "I got in todays competition"

score 1 · Answer 1 · answered Dec 17 '19 at 13:40

Here is an example using your preprocess function you can use:

remove = ["first",
         "second",
         "third",
         "fourth",
         "fifth",
         "sixt",
         "seventh",
         "eigth",
         "ninth",
         "tenth",
         "monday",
         "tuesday",
         "wednesday",
         "thursday",
         "friday",
         "saturday",
         "sunday",
         "$"]

def preprocess(text, words):
    good = []
    text = text.lower()
    for a_word in text.split():
        if a_word not in words:
            good.append(a_word)

    s = " "
    return s.join(good)        

myString = "I got second in todays competition"
test = preprocess(myString, remove)
print(test)
#Should return: "I got in todays competition"

Here, we create a blank list good which will hold the terms not in the words list. We then split the text into words, and loop through them. Once we have a list object of all of the "good words" (i.e., they aren't in the list to be removed), we can join them on a space. Follow the link for more info.

Returns:

i got in todays competition

score 1 · Accepted Answer · answered Dec 17 '19 at 13:45

1

You could tokenize your string and then filter the resulting array:

tokens = text.split(' ')
tokens_filtered = filter(lambda word: word not in words, tokens)
text_filtered = ' '.join(tokens_filtered)

answered Dec 17 '19 at 13:45

mfilippo

114
1
1

score 0 · Answer 3 · answered Dec 17 '19 at 13:37

0

Using Regex.

Ex:

import re

def preprocess(text, words):
    pattern = re.compile(r"\b" + "|".join(words) +r"\b", flags=re.I)
    return pattern.sub("", text)

myString = "I got second in todays competition"
print(preprocess(myString, remove))

answered Dec 17 '19 at 13:37

Rakesh

81,458
17
76
113

Is there a way to remove the dubble space that is there now? – Jesper.Lindberg Dec 17 '19 at 13:38
Already answered here https://stackoverflow.com/questions/2077897/substitute-multiple-whitespace-with-single-whitespace-in-python --> `re.sub('\s+', ' ', pattern.sub("", text))` – Rakesh Dec 17 '19 at 13:41

score 0 · Answer 4 · answered Dec 17 '19 at 13:39

0

You can replace the text with an empty string

    text = text.lower()
    if text in words:
      words.replace(text,'')

answered Dec 17 '19 at 13:39

Mayowa Ayodele

549
2
11

Take a string and remove all words that are matched in an array

4 Answers4