1

I'm developing a Neural training network with nntool in Matlab and I have as inputs 11250 text files with different lengths (from 10 to 500 words or let's say from 10 to 200 words if I eliminate redundant words ), I didn't find a good method to represent this input texts as a digital data to run my training algorithm. I thought about creating a vocabulary of words, but I've found that the vocabulary contains 16000 different words which is huge. There are some words in common between some text files.

Eadhun Di
  • 132
  • 1
  • 13
  • 1
    What is the overall goal of your neural network?... what is the expected output? If this is, say, a spam classifier, then a binary vector that is the size of your vocabulary where 0/1 indicates the presence of a particular word is what is usually done. – rayryeng May 03 '16 at 20:28

1 Answers1

0

For quick sollution you should look for "bag of words" or "tfidf". If you don't know what is this, you should start here: https://en.wikipedia.org/wiki/Vector_space_model or https://en.wikipedia.org/wiki/Document_classification .

Have you read any book about NLP? Maybe this one may be valuable: http://www.nltk.org/book/ at the very begin.

404pio
  • 1,080
  • 1
  • 12
  • 32