Text representation for Neural training Network

Question

I'm developing a Neural training network with nntool in Matlab and I have as inputs 11250 text files with different lengths (from 10 to 500 words or let's say from 10 to 200 words if I eliminate redundant words ), I didn't find a good method to represent this input texts as a digital data to run my training algorithm. I thought about creating a vocabulary of words, but I've found that the vocabulary contains 16000 different words which is huge. There are some words in common between some text files.

What is the overall goal of your neural network?... what is the expected output? If this is, say, a spam classifier, then a binary vector that is the size of your vocabulary where 0/1 indicates the presence of a particular word is what is usually done. — rayryeng, May 03 '16 at 20:28

score 0 · Answer 1 · answered May 04 '16 at 07:34

For quick sollution you should look for "bag of words" or "tfidf". If you don't know what is this, you should start here: https://en.wikipedia.org/wiki/Vector_space_model or https://en.wikipedia.org/wiki/Document_classification .

Have you read any book about NLP? Maybe this one may be valuable: http://www.nltk.org/book/ at the very begin.

Text representation for Neural training Network

1 Answers1