Why does training accuracy fall after reaching perfect training fit?

Question

I am training a NN in pytorch on MNIST data. The model start well, improves, reaches good accuracy for both training and testing data, stabilizes for a while and then both the testing and training accuracy collapse, as shown on the following base results graph.

As for MNIST, I use 60000 training images, 10000 testing, training batch size 100 and learning rate 0.01. Neural network consist of two fully connected hidden layers, each with 100 nodes, nodes having ReLU activation functions. F.cross_entropy is used for loss and SGD for gradients calculations.

This is not the over-fitting problem, as it is both the training and testing accuracy which collapse. I suspected that it has something to do with too large learning rate. In the base case I have used 0.01 but when I lower it to 0.001 the whole pattern repeats, just later, as shown on the following graph (please notice x-axis scale change, the pattern is happening roughly 10 times later, which is intuitive). Similar results were obtained using even lower learning rates.

I have tried unit testing, checking the individual parts and making the model smaller. Here are results when I am using only 6 data points in the training set, batch size 2. Perfect fit on training data (here clearly different from the test accuracy, as expected) is unsurprisingly reached, but it still collapses from 100% to 1/6 so no better than a random pick. What needs to happen for the network to spin out from a perfect fit on the training set, can anybody tell me?

Here is the structure of the network (the relevant libraries are added before), although I hope that the aforementioned symptoms will be sufficient for you to recognize what the problem is without it:

class Network(nn.Module):
def __init__(self):
    # call to the super class Module from nn
    super(Network, self).__init__()

    # fc strand for 'fully connected'
    self.fc1 = nn.Linear(in_features=28*28, out_features=100)
    self.fc2 = nn.Linear(in_features=100, out_features=100)
    self.out = nn.Linear(in_features=100, out_features=10)

def forward(self, t):

    # (1) input layer (redundant)
    t = t

    # (2) hidden linear layer
    # As my t consists of 28*28 bit pictures, I need to flatten them:
    t = t.reshape(-1, 28*28)
    # Now having this reshaped input, add it to the linear layer
    t = self.fc1(t)
    # Again, apply ReLU as the activation function
    t = F.relu(t)

    # (3) hidden linear layer
    # As above, but reshaping is not needed now
    t = self.fc2(t)
    t = F.relu(t)

    # (4) output layer
    t = self.out(t)
    t = F.softmax(t, dim=1)

    return t

The main execution of the code:

for b in range(epochs):
print('***** EPOCH NO. ', b+1)
# getting a batch iterator
batch_iterator = iter(batch_train_loader)
# For loop for a single epoch, based on the length of the training set and the batch size
for a in range(round(train_size/b_size)):
    print(a+1)
    # get one batch for the iteration
    batch = next(batch_iterator)
    # decomposing a batch
    images, labels = batch[0].to(device), batch[1].to(device)
    # to get a prediction, as with individual layers, we need to equate it to the network with the samples as input:
    preds = network(images)
    # with the predictions, we will use F to get the loss as cross_entropy
    loss = F.cross_entropy(preds, labels)
    # function for counting the number of correct predictions
    get_num_correct(preds, labels))
    # calculate the gradients needed for update of weights
    loss.backward()
    # with the known gradients, we will update the weights according to stochastic gradient descent
    optimizer = optim.SGD(network.parameters(), lr=learning_rate)
    # with the known weights, step in the direction of correct estimation
    optimizer.step()
    # check if the whole data check should be performed (for taking full training/test data checks only in evenly spaced intervals on the log scale, pre-calculated later)
    if counter in X_log:
        # get the result on the whole train data and record them
        full_train_preds = network(full_train_images)
        full_train_loss = F.cross_entropy(full_train_preds, full_train_labels)
        # Record train loss
        a_train_loss.append(full_train_loss.item())
        # Get a proportion of correct estimates, to make them comparable between train and test data
        full_train_num_correct = get_num_correct(full_train_preds, full_train_labels)/train_size
        # Record train accuracy
        a_train_num_correct.append(full_train_num_correct)
        print('Correct predictions of the dataset:', full_train_num_correct)
        # Repeat for test predictions
        # get the results for the whole test data
        full_test_preds = network(full_test_images)
        full_test_loss = F.cross_entropy(full_test_preds, full_test_labels)
        a_test_loss.append(full_test_loss.item())
        full_test_num_correct = get_num_correct(full_test_preds, full_test_labels)/test_size
        a_test_num_correct.append(full_test_num_correct)
    # update counter
    counter = counter + 1

I have googled and checked here for answers to this questions but people either ask about over-fitting or their NNs do not increase accuracy on the training set at all (i.e. they simply don't work), not about finding a good training fit and then completely loosing it, also on the training set. I hope I did not post something obvious, I am relatively new to NN but I did my best to research the topic before posting it here, thank you for your help and understanding!

score 1 · Accepted Answer · answered Jun 15 '19 at 14:28

The cause is a bug in the code. We need to add optimizator.zero_grad() at the beginning of the training loop and create the optimizator before the outer training loop, i.e.

optimizator = optim.SGD(...)
for b in range(epochs):

Why do we need to call zero_grad() in PyTorch? explains why.

jmdatasci · Answer 2 · 2019-06-14T15:30:56.770

0

So my take on this is that you are using way too many epochs and are overtraining the model (not overfitting). After a certain point of constantly refreshing the biases/weights they are no longer able to distinguish values from noise.

I would recommend checking this out https://machinelearningmastery.com/early-stopping-to-avoid-overtraining-neural-network-models/ to see if it is aligning with what you are seeing since it was the first thing I thought of.

Also maybe give this post a look. https://stats.stackexchange.com/questions/198629/difference-between-overtraining-and-overfitting (not saying this is a duplicate)

And this publication: Overtraining in Back-Propagation Neural Networks: A CRT Color Calibration Example https://onlinelibrary.wiley.com/doi/pdf/10.1002/col.10027

edited Jun 14 '19 at 15:30

answered Jun 14 '19 at 15:22

jmdatasci

111
9

I am using 10 epochs which I would not classify as “way too much”, but I might be mistaken- is going 10 times over the whole training set too much? I am trying to reproduce the results of Baity-Jesi 2019 (https://arxiv.org/abs/1803.06969), where they are going to 10^6 steps, implying thousands of epochs, and they get very stable solutions (at least on the training set, not test set), so I doubt that number of epochs is an issue – MagicM Jun 14 '19 at 16:38
I have read the links you have provided, thank you for sharing them. The first one, if I understood it correctly, talks about overfitting (or overtraining, the author seem to give them very similar meaning) of the training data, but still in terms of how it is bad for new predictions: “This overfitting of the training dataset will result in an increase in generalization error, making the model less useful at making predictions on new data.” – MagicM Jun 14 '19 at 16:39
So I do not see information relevant to my problem here- he does not write about the error increasing on the training data set, only on the test (or validation) data set which is classic overfitting problem- early stopping discussed there is a method of solving this overfitting problem but not mine. – MagicM Jun 14 '19 at 16:39
The second post shows clearly that there is a semantic problem with words like overtraining or overfitting and people using them very liberally, but as the best answer put it “As far as I can tell, there is no difference between an overtrained and an overfitted model”. Indeed, both of these words are used to describe the same problem: you overtrain your model on the training data, which results in overfitting the model, that is, having high error on test (new/validation) data, nothing to do with high error on training data – MagicM Jun 14 '19 at 16:40
The third link, the Alman and Ningfang 2001 paper talks again about overfitting although they call it overtraining. It is clear by checking Figure 1, where testing error increase at some point but training error keep on falling, when in my problem it increases as the testing error does. Thank you for attempting to answer my question, but as explained above, all three links deal with overfitting or in other words overtraining which is not the problem here, and 10 epochs don’t seem excessive, as Baity-Jesi 2019 got stable results for many more than this. – MagicM Jun 14 '19 at 16:40

Why does training accuracy fall after reaching perfect training fit?

2 Answers2