I'm developing a Neural training network with nntool in Matlab and I have as inputs 11250 text files with different lengths (from 10 to 500 words or let's say from 10 to 200 words if I eliminate redundant words ), I didn't find a good method to represent this input texts as a digital data to run my training algorithm. I thought about creating a vocabulary of words, but I've found that the vocabulary contains 16000 different words which is huge. There are some words in common between some text files.
Asked
Active
Viewed 51 times
1
-
1What is the overall goal of your neural network?... what is the expected output? If this is, say, a spam classifier, then a binary vector that is the size of your vocabulary where 0/1 indicates the presence of a particular word is what is usually done. – rayryeng May 03 '16 at 20:28
1 Answers
0
For quick sollution you should look for "bag of words" or "tfidf". If you don't know what is this, you should start here: https://en.wikipedia.org/wiki/Vector_space_model or https://en.wikipedia.org/wiki/Document_classification .
Have you read any book about NLP? Maybe this one may be valuable: http://www.nltk.org/book/ at the very begin.

404pio
- 1,080
- 1
- 12
- 32