Train `sklearn` ML model with scipy sparse matrix and numpy array

Question

Just to explain some things more about my use case, A is a sparse matrix with tf-idf values and B is an array with some additional features of my data.

I have already splitted to training and test sets so A and B in my example are only about the training set. I (want to) do the same for the test set after this code.

I want to concatenate these matrices/arrays because then I want to pass them to a sklearn ML model to train it and I do not think that I can pass them separately.

So I tried to do this:

C = np.concatenate((A, B.T), axis=1)

where A is a <class 'scipy.sparse.csr.csr_matrix'> and B is a <class 'numpy.ndarray'>.

However, when I try to do this then I get the following error:

ValueError: zero-dimensional arrays cannot be concatenated

Also, I do not think that the idea of `np.concatenate` a numpy array with a sparse matrix is very good in my case because

it is basically impossible to covert my sparse array A to a dense array because it is too big
I will lose (or not actually??) information if I convert my fully dense array B to a sparse array

What is the best way to pass to an sklearn ML model a sparse and a fully dense array concatenated by rows?

Do you want the result to be an array? You can convert your sparse matrix to an array using `A.A` — user3483203, Aug 05 '19 at 16:27
@user3483203, hm I see your point. My sparse matrix is too big to convert it to an array. Ideally I would like to to have as an output a "hybrid" array where its one part is a sparse matrix and the other one an array. — Outcast, Aug 05 '19 at 16:29
That isn't really possible with `numpy`. What is your final use case? — user3483203, Aug 05 '19 at 16:30
Possible duplicate of [\`np.concatenate\` a numpy array with a sparse matrix](https://stackoverflow.com/questions/49420274/np-concatenate-a-numpy-array-with-a-sparse-matrix) — Giampietro Seu, Aug 05 '19 at 16:31
@user3483203, basically the sparse matrix are tf-idf data and the array is some additional features. I want to have one final concatenated array so that I can pass it to a ML model (e.g. random forest) and train it etc. I think that you cannot pass multiple arrays (of different kinds) as training sets in `sklearn`? — Outcast, Aug 05 '19 at 16:31
OK, so I assume you are chunking your sparse array at some point. Rather than concatenate the entire thing, you could concatenate only the chunks of the array and the sparse matrix at a time. — user3483203, Aug 05 '19 at 16:33
@user3483203, hm I am not sure what you mean. Where/When exactly will I concatenate them at a time? I am not really chunking sparse array. I just split to training and test set at the very beginning of my application. — Outcast, Aug 05 '19 at 16:36
`scipy.sparse` has its own `vstack` and `hstack`. These convert all inputs to `coo` sparse format, and make a new `coo` matrix from the combined data/row/col attributes. `np.array(a_sparse_matrix)` does not create a valid `ndarray`. `a_sparse_matrix.toarray()` is the proper converter. — hpaulj, Aug 05 '19 at 16:46
@hpaulj, I am not sure how your comment answers my points (1) and (2) above towards the end of my post. Have you read them? What is your specific answer to each one of them? — Outcast, Aug 05 '19 at 16:47
@hpaulj, cool, this is already answered by your post here: https://stackoverflow.com/a/49420566/9024698. — Outcast, Aug 05 '19 at 16:52
@hpaulj, so then two questions: 1) We definitely cannot then have a "hybrid" matrix which has a sparse matrix and a numpy array concatenated in it?, 2) If you see my points above, I say that (1) is impossible because my sparse matrix is too big. What about (2)? Do I lose information if I convert a fully dense array to a sparse matrix or not? — Outcast, Aug 05 '19 at 16:55

score 6 · Answer 1 · answered Aug 05 '19 at 18:00

You can use hstack from scipy. hstack will convert both matrices to scipy coo_matrix, merge them and return a coo_matrix by default.
No information is lost when converting dense array to sparse. Sparse matrices are just compact data storage format. Also, unless to specify a value for argument dtype of hstack everything is upcasted. So, there is no possibility of data loss there as well.

Further, if you plan to use Logistic Regression from sklearn, sparse matrices must be in csr format for fit method to work.

The following code should work for your use-case

from scipy.sparse import hstack

X = hstack((A, B), format='csr')

Train `sklearn` ML model with scipy sparse matrix and numpy array

1 Answers1

Linked