I am trying to do some text processing using NLTK and Pandas.
I have DataFrame with column 'text'. I want to add column 'text_tokenized' that will be stored as a nested list.
My code for tokenizing text is:
def sent_word_tokenize(text):
text = unicode(text, errors='replace')
sents = sent_tokenize(text)
tokens = map(word_tokenize, sents)
return tokens
Currently, I am trying to apply this function as following:
df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)
Which gives me error:
ValueError: Shape of passed values is (100, 3), indices imply (100, 21)
Not sure how to fix it and what is wrong here.