1

I am trying to do some text processing using NLTK and Pandas.

I have DataFrame with column 'text'. I want to add column 'text_tokenized' that will be stored as a nested list.

My code for tokenizing text is:

def sent_word_tokenize(text):
    text = unicode(text, errors='replace')
    sents = sent_tokenize(text)
    tokens = map(word_tokenize, sents)

    return tokens

Currently, I am trying to apply this function as following:

df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)

Which gives me error:

ValueError: Shape of passed values is (100, 3), indices imply (100, 21)

Not sure how to fix it and what is wrong here.

ymoiseev
  • 416
  • 5
  • 18
  • Hard to say for sure, but looks like axis=1 is a _row_ operation when you have a _column_ of text? – benten Aug 02 '16 at 00:51
  • http://stackoverflow.com/a/19667189/1168680 – RAVI Aug 02 '16 at 02:40
  • @user2241910, I do not think it is related to axis. You can still retrieve data from row by doing row.text. `df_small['text_tokenized'] = df_small.apply(lambda row: row.text, axis=1)` works well – ymoiseev Aug 02 '16 at 03:14
  • @RAVI, I tried wrapping return statement in both tuple and list, but still have similar error: `ValueError: Shape of passed values is (100, 6), indices imply (100, 20)` – ymoiseev Aug 02 '16 at 03:15

1 Answers1

2

Solved my own question by using different axis:

Instead of:

df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)

I used:

df['text_tokenized'] = df.text.apply(lambda text: sent_word_tokenize(text))

Although I am not sure why it works and I really appreciate if somebody could explain it to me.

ymoiseev
  • 416
  • 5
  • 18
  • 1
    When you specified `axis=1`, the apply function was operating column-wise(across **all the columns** of the dataframe). But you instead had to do the computation row-wise(across **each row** of the dataframe). Hence, the need to specify `axis=0`. – Nickil Maveli Aug 02 '16 at 10:54