1

I want to make a function that takes in a string text and an array words and then compare the string with the array and removes all the words in the string that occurs in the words-array.

Example:

remove = ["first",
         "second",
         "third",
         "fourth",
         "fifth",
         "sixt",
         "seventh",
         "eigth",
         "ninth",
         "tenth",
         "monday",
         "tuesday",
         "wednesday",
         "thursday",
         "friday",
         "saturday",
         "sunday",
         "$"]

def preprocess(text, words):
    text = text.lower()
    if text in words... #Not sure what to do here...

myString = "I got second in todays competition"
preprocess(myString, remove)
#Should return: "I got in todays competition"
Jesper.Lindberg
  • 313
  • 1
  • 5
  • 14

4 Answers4

1

Here is an example using your preprocess function you can use:

remove = ["first",
         "second",
         "third",
         "fourth",
         "fifth",
         "sixt",
         "seventh",
         "eigth",
         "ninth",
         "tenth",
         "monday",
         "tuesday",
         "wednesday",
         "thursday",
         "friday",
         "saturday",
         "sunday",
         "$"]

def preprocess(text, words):
    good = []
    text = text.lower()
    for a_word in text.split():
        if a_word not in words:
            good.append(a_word)

    s = " "
    return s.join(good)        

myString = "I got second in todays competition"
test = preprocess(myString, remove)
print(test)
#Should return: "I got in todays competition"

Here, we create a blank list good which will hold the terms not in the words list. We then split the text into words, and loop through them. Once we have a list object of all of the "good words" (i.e., they aren't in the list to be removed), we can join them on a space. Follow the link for more info.

Returns:

i got in todays competition

artemis
  • 6,857
  • 11
  • 46
  • 99
1

You could tokenize your string and then filter the resulting array:

tokens = text.split(' ')
tokens_filtered = filter(lambda word: word not in words, tokens)
text_filtered = ' '.join(tokens_filtered)
mfilippo
  • 114
  • 1
  • 1
0

Using Regex.

Ex:

import re

def preprocess(text, words):
    pattern = re.compile(r"\b" + "|".join(words) +r"\b", flags=re.I)
    return pattern.sub("", text)

myString = "I got second in todays competition"
print(preprocess(myString, remove))
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • Is there a way to remove the dubble space that is there now? – Jesper.Lindberg Dec 17 '19 at 13:38
  • Already answered here https://stackoverflow.com/questions/2077897/substitute-multiple-whitespace-with-single-whitespace-in-python --> `re.sub('\s+', ' ', pattern.sub("", text))` – Rakesh Dec 17 '19 at 13:41
0

You can replace the text with an empty string

    text = text.lower()
    if text in words:
      words.replace(text,'')
Mayowa Ayodele
  • 549
  • 2
  • 11