importing categorical data from CSV into scikit-learn

Question

I would like to import data from a CSV file to use in scikit-learn. It has a mix of numerical data categorical data, e.g.

someValue,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5

I need to convert this representation into a purely numerical one where categorical data points get converted into multiple binary columns, e.g.

someValue,colorIsRed,colorIsBlue,someOtherValue
1.2,1,0,55.6
1.9,0,1,20.5
3.2,1,0,16.5

Is there any utility that does this for me, or an easy way to iterate through the data and get this representation?

A simple solution is to do this step in R: http://stackoverflow.com/questions/5048638/automatically-expanding-an-r-factor-into-a-collection-of-1-0-indicator-variables — John Horton, May 25 '13 at 21:18

score 4 · Accepted Answer · answered Aug 01 '12 at 23:40

4

scikit-learn doesn't offer data-loading functions as far as I know, but it does prefer Numpy arrays as input. Numpy's loadtxt function together with its converters parameter can be used to load your csv and specify the types of each column. It does not binarize your second column though.

answered Aug 01 '12 at 23:40

Sicco

6,167
5
45
61

then what is the proper way to represent multiclass categorical data in scikit-learn? as far as i know, binarizing categorical variables is the way to do it. – genekogan Aug 02 '12 at 05:37
4

Yes you have to binarize the data so that the resulting array is homogeneous with a float data type. You can have a look at the implementation of [DictVectorizer](http://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts) to have an example on how to do this. The code is [here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/dict_vectorizer.py). – ogrisel Aug 02 '12 at 07:09

score 2 · Answer 2 · answered Apr 02 '13 at 01:46

In this answer, I'm assuming that you're trying to convert your CSV into a file that LibSVM, LIBLINEAR, or scikit-learn can load.

You can use csv2libsvm, which is provided as part of the Ruby gem vector_embed:

$ gem install vector_embed
Successfully installed vector_embed-0.1.0
1 gem installed

You need Ruby 1.9+...

$ ruby -v
ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]

If you don't have Ruby 1.9, it's easy to install with rvm, which does not require (or recommend using) root:

$ curl -#L https://get.rvm.io | bash -s stable
$ rvm install 1.9.3

Once you have successfully run gem install vector_embed, make sure your first column is called "label":

$ cat example.csv 
label,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5

$ csv2libsvm example.csv > example.libsvm

$ cat example.libsvm
1.2 1139043:55.6 1997960:1
1.9 1089740:1 1139043:20.5
3.2 1139043:16.5 1997960:1

Note that it handles both categorical and continuous data, and that it uses MurmurHash version 3 to generate the feature names ("colorIsBlue" corresponds to 1089740, "colorIsRed" is 1997960... though the Ruby code is really hashing something like "color\0red").

If you're using svm, be sure to scale your data like they recommend in "A practical guide to SVM classification".

Finally, let's say you're using scikit-learn's svmlight/libsvm loader:

>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/example.libsvm")

importing categorical data from CSV into scikit-learn

2 Answers2