how to correctly shape input of a multiclass classification using keras stacked LSTM model

Question

I am working on a multiple classification problem and after dabbling with multiple neural network architectures, I settled for a stacked LSTM structure as it yields the best accuracy for my use-case. Unfortunately the network takes a long time (almost 48 hours) to reach a good accuracy (~1000 epochs) even when I use GPU acceleration. The resulting accuracy and loss functions are:

At this point, giving the good performance but the very slow training I suspect a bug in my code. I tested it using the golden tests mentioned here, which consist of running tests with 2 points only either in the testing set or the training set along with eliminating the dropouts. Unfortunately, the outputs of these runs result in testing accuracy better than the training accuracy, which should not be the case as far as I know. I suspect that I am shaping my data in the wrong way. Any hints, suggestions and advises are appreciated.

My code is the following:

# -*- coding: utf-8 -*-
import keras
import numpy as np
from time import time
from utils import dmanip, vis
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.utils import to_categorical
from keras.callbacks import TensorBoard
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split

###############################################################################
####################### Extract the data from .csv file #######################
###############################################################################
# get data
data, column_names = dmanip.get_data(file_path='../data_one_outcome.csv')

# split data
X = data.iloc[:, :-1]
y = data.iloc[:, -1:].astype('category')

###############################################################################
########################## init global config vars ############################
###############################################################################
# check if GPU is used
print(device_lib.list_local_devices())

# init
n_epochs = 1500
n_comps = X.shape[1]

###############################################################################
################################## Keras RNN ##################################
###############################################################################
# encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y))

# split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.35,
                                                    random_state=True,
                                                    shuffle=True)

# expand dimensions
x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)

# define model
model = Sequential()
model.add(LSTM(units=n_comps, return_sequences=True,
               input_shape=(x_train.shape[1], 1),
               dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(4 ,activation='softmax'))

# print model architecture summary
print(model.summary())

# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Create a TensorBoard instance with the path to the logs directory
tensorboard = TensorBoard(log_dir='./logs/rnn/{}'.format(time()))

# fit the model
history = model.fit(x_train, y_train, epochs=n_epochs, batch_size=100,
                    validation_data=(x_test, y_test), callbacks=[tensorboard])

# plot results
vis.plot_nn_stats(history=history, stat_type="accuracy", fname="RNN-accuracy")
vis.plot_nn_stats(history=history, stat_type="loss", fname="RNN-loss")

My data is a large 2d matrix (38607, 150), where 149 is the number of features and 38607 is the number of samples, with a target vector including 4 classes.

       feat1   feat2   ...  feat148  feat149  target
1      2.250   0.926   ...  16.0      0.0     class1
2      2.791   1.235   ...  1.0       0.0     class2
         .       .     .     .         .         .
         .       .      .    .         .         .
         .       .       .   .         .         .
38406  2.873   1.262   ...  281.0     0.0     class3
38407  3.222   1.470   ...  467.0     1.0     class4

score 0 · Answer 1 · answered May 29 '20 at 13:48

Regarding the Slowness of Training: You can think of using tf.data instead of Data Frames and Numpy Arrays because, Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The tf.data API helps to build flexible and efficient input pipelines.

For more information regarding tf.data, please refer this Tensorflow Documentation 1, Documentation 2.

This Tensorflow Tutorial guides you to convert your Data Frame to tf.data format.

One more important feature of use to you can be tf.profiler. Using Tensorflow Profiler, you can not only Visualize the Time and Memory Consumed in each phase of Data Science Project but it also provides us a Suggestion/Recommendation to reduce the Time/Memory Consumption and hence to Optimize our Project.

For more information on Tensorflow Profiler, refer this Documentation, this Tutorial and this Tensorflow DevSummit Youtube Video.

Regarding Testing Accuracy more than Training Accuracy: This is not a big problem and happens sometimes.

Probable Reason 1: Dropout ==> What is the reason for you to use Dropout and recurrent_dropout in your Model? Was the Model Overfitting? If the Model is not Overfitting, without Dropout and recurrent_dropout, then you can think of removing them because, If you set Dropout (0.2) and recurrent_dropout (0.2) it means 20% of features will be 0 and 20% of Time Steps will be 0, during training. However, during testing all the Features and Timesteps are used, so the model is more robust and have better testing accuracy.

Probable Reason 2: 35% of Testing Data is bit more than usual. You can make it either 20% or 25%.

Probable Reason 3: Your training data might have several arduous cases to learn and Your Testing data may contain easier cases to predict. To mitigate this, you can Split the Data Once again with different Random Seed.

For more information, please refer this Research Gate Link and this Stack Overflow Link.

Hope this helps. Happy Learning!

how to correctly shape input of a multiclass classification using keras stacked LSTM model

1 Answers1