I am training a NN in pytorch on MNIST data. The model start well, improves, reaches good accuracy for both training and testing data, stabilizes for a while and then both the testing and training accuracy collapse, as shown on the following base results graph.
As for MNIST, I use 60000 training images, 10000 testing, training batch size 100 and learning rate 0.01. Neural network consist of two fully connected hidden layers, each with 100 nodes, nodes having ReLU activation functions. F.cross_entropy is used for loss and SGD for gradients calculations.
This is not the over-fitting problem, as it is both the training and testing accuracy which collapse. I suspected that it has something to do with too large learning rate. In the base case I have used 0.01 but when I lower it to 0.001 the whole pattern repeats, just later, as shown on the following graph (please notice x-axis scale change, the pattern is happening roughly 10 times later, which is intuitive). Similar results were obtained using even lower learning rates.
I have tried unit testing, checking the individual parts and making the model smaller. Here are results when I am using only 6 data points in the training set, batch size 2. Perfect fit on training data (here clearly different from the test accuracy, as expected) is unsurprisingly reached, but it still collapses from 100% to 1/6 so no better than a random pick. What needs to happen for the network to spin out from a perfect fit on the training set, can anybody tell me?
Here is the structure of the network (the relevant libraries are added before), although I hope that the aforementioned symptoms will be sufficient for you to recognize what the problem is without it:
class Network(nn.Module):
def __init__(self):
# call to the super class Module from nn
super(Network, self).__init__()
# fc strand for 'fully connected'
self.fc1 = nn.Linear(in_features=28*28, out_features=100)
self.fc2 = nn.Linear(in_features=100, out_features=100)
self.out = nn.Linear(in_features=100, out_features=10)
def forward(self, t):
# (1) input layer (redundant)
t = t
# (2) hidden linear layer
# As my t consists of 28*28 bit pictures, I need to flatten them:
t = t.reshape(-1, 28*28)
# Now having this reshaped input, add it to the linear layer
t = self.fc1(t)
# Again, apply ReLU as the activation function
t = F.relu(t)
# (3) hidden linear layer
# As above, but reshaping is not needed now
t = self.fc2(t)
t = F.relu(t)
# (4) output layer
t = self.out(t)
t = F.softmax(t, dim=1)
return t
The main execution of the code:
for b in range(epochs):
print('***** EPOCH NO. ', b+1)
# getting a batch iterator
batch_iterator = iter(batch_train_loader)
# For loop for a single epoch, based on the length of the training set and the batch size
for a in range(round(train_size/b_size)):
print(a+1)
# get one batch for the iteration
batch = next(batch_iterator)
# decomposing a batch
images, labels = batch[0].to(device), batch[1].to(device)
# to get a prediction, as with individual layers, we need to equate it to the network with the samples as input:
preds = network(images)
# with the predictions, we will use F to get the loss as cross_entropy
loss = F.cross_entropy(preds, labels)
# function for counting the number of correct predictions
get_num_correct(preds, labels))
# calculate the gradients needed for update of weights
loss.backward()
# with the known gradients, we will update the weights according to stochastic gradient descent
optimizer = optim.SGD(network.parameters(), lr=learning_rate)
# with the known weights, step in the direction of correct estimation
optimizer.step()
# check if the whole data check should be performed (for taking full training/test data checks only in evenly spaced intervals on the log scale, pre-calculated later)
if counter in X_log:
# get the result on the whole train data and record them
full_train_preds = network(full_train_images)
full_train_loss = F.cross_entropy(full_train_preds, full_train_labels)
# Record train loss
a_train_loss.append(full_train_loss.item())
# Get a proportion of correct estimates, to make them comparable between train and test data
full_train_num_correct = get_num_correct(full_train_preds, full_train_labels)/train_size
# Record train accuracy
a_train_num_correct.append(full_train_num_correct)
print('Correct predictions of the dataset:', full_train_num_correct)
# Repeat for test predictions
# get the results for the whole test data
full_test_preds = network(full_test_images)
full_test_loss = F.cross_entropy(full_test_preds, full_test_labels)
a_test_loss.append(full_test_loss.item())
full_test_num_correct = get_num_correct(full_test_preds, full_test_labels)/test_size
a_test_num_correct.append(full_test_num_correct)
# update counter
counter = counter + 1
I have googled and checked here for answers to this questions but people either ask about over-fitting or their NNs do not increase accuracy on the training set at all (i.e. they simply don't work), not about finding a good training fit and then completely loosing it, also on the training set. I hope I did not post something obvious, I am relatively new to NN but I did my best to research the topic before posting it here, thank you for your help and understanding!