In this answer, I'm assuming that you're trying to convert your CSV into a file that LibSVM, LIBLINEAR, or scikit-learn
can load.
You can use csv2libsvm
, which is provided as part of the Ruby gem vector_embed
:
$ gem install vector_embed
Successfully installed vector_embed-0.1.0
1 gem installed
You need Ruby 1.9+...
$ ruby -v
ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
If you don't have Ruby 1.9, it's easy to install with rvm
, which does not require (or recommend using) root:
$ curl -#L https://get.rvm.io | bash -s stable
$ rvm install 1.9.3
Once you have successfully run gem install vector_embed
, make sure your first column is called "label":
$ cat example.csv
label,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5
$ csv2libsvm example.csv > example.libsvm
$ cat example.libsvm
1.2 1139043:55.6 1997960:1
1.9 1089740:1 1139043:20.5
3.2 1139043:16.5 1997960:1
Note that it handles both categorical and continuous data, and that it uses MurmurHash version 3 to generate the feature names ("colorIsBlue" corresponds to 1089740, "colorIsRed" is 1997960... though the Ruby code is really hashing something like "color\0red").
If you're using svm, be sure to scale your data like they recommend in "A practical guide to SVM classification".
Finally, let's say you're using scikit-learn
's svmlight/libsvm loader:
>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/example.libsvm")