Input Text Data Formatting for CNN in Flux, in Julia

Question

I am implementing Yoon Kim's CNN (https://arxiv.org/abs/1408.5882) for text classification in Julia, using Flux as the deep learning framework, with individual sentences as input datapoints. The model zoo (https://github.com/FluxML/model-zoo) has proven useful to an extent, but it does not have an NLP example with CNNs. I'd like to check if my input data format is the correct one.

There is no explicit implementation in Flux of a 1D Conv, so I'm using Conv found in https://github.com/FluxML/Flux.jl/blob/master/src/layers/conv.jl Here is part of the docstring that explains the input data format:

Data should be stored in WHCN order (width, height, # channels, # batches).
In other words, a 100×100 RGB image would be a `100×100×3×1` array,
and a batch of 50 would be a `100×100×3×50` array.

My format is as follows:

1. width: since text in a sentence is 1D, the width is always 1 
2. height: this is the maximum number of tokens allowable in a sentence
3. \# of channels: this is the embedding size
4. \# of batches: the number of sentences in each batch

Following the MNIST example in the model zoo, I have

function make_minibatch(X, Y, idxs)
    X_batch = zeros(1, num_sentences, emb_dims, MAX_LEN)

    function get_sentence_matrix(sentence)
        embeddings = Vector{Array{Float64, 1}}()
        for word in sentence
            embedding = get_embedding(word)
            push!(embeddings, embedding)
        end
        embeddings = hcat(embeddings...)
        return embeddings
    end

    for i in 1:length(idxs)
        X_batch[1, i, :, :] = get_sentence_matrix(X[idxs[i]])
    end
    Y_batch = [Flux.onehot(label+1, 1:2) for label in Y[idxs]]
    return (X_batch, Y_batch)
end

where the X is an array of arrays of words and the get_embedding function returns an embedding as an array.

X_batch is then a Array{Float64,4}. Is this the correct approach?

Input Text Data Formatting for CNN in Flux, in Julia

0 Answers0