0

I want to use the 20 newsgroups dataset to test an algorithm, and analysis the significant words for each group.

In the website provided by University of Toronto. But I can't find the correspond vocabulary file for this dataset. So is there anyone else could give me a light?

Miao Yu
  • 21
  • 1
  • 3

1 Answers1

0

You could try here for the 20 newsgroups dataset. It also includes a vocabulary file, but it may not be consistent with the file you have so it might help to use all the files there.

Hope this Helps!

Matthew Spencer
  • 2,265
  • 1
  • 23
  • 28
  • I know this official website, but the pre-processing of that data may lost some information and the result is also unpleasant. After excluding the Stopwords from the 20news-bydate-matlab.tgz, the remain vocabulary still contains some words like, 'sgi, cec, att...'. And I have no idea of change the words with their stems. – Miao Yu Dec 03 '14 at 06:36