gradient descent in neural network training plateauing

Question

I've been trying to implement a basic back-propogation neural network in python, and have finished the programming for initializing and training the weight-set. However, on all the sets I train, the error (mean-squared) always converges to a weird number -- the error always decreases on further iterations, but never truly gets close to zero.
any help would be much appreciated.

import csv
import numpy as np

class NeuralNetwork:
layers = 0
shape = None
weights = []

layerIn = []
layerOut = []

def __init__(self, shape):
    self.shape = shape
    self.layers = len(shape) - 1

    for i in range(0,self.layers):
        n = shape[i]
        m = shape[i+1]
        self.weights.append(np.random.normal(scale=0.2, size = (m,n+1)))

def sgm(self, x):
    return 1/(1+np.exp(-x))

def dersgm(self, x):
    y = self.sgm(x)
    return y*(y-1)


def run(self, input):
    self.layerIn = []
    self.layerOut = []

    for i in range(self.layers):
        if i == 0:
            layer = self.weights[0].dot(np.vstack((input.transpose(), np.ones([1,input.shape[0]]))))
        else:
            layer = self.weights[i].dot(np.vstack((self.layerOut[-1], np.ones([1,input.shape[0]]))))
        self.layerIn.append(layer)
        self.layerOut.append(self.sgm(layer))

    return self.layerOut[-1].T

def backpropogate(self, input, y, learning_rate):
    deltas = []
    y_hat = self.run(input)

    #Calculate deltas
    for i in reversed(range(self.layers)):

        #for last layer
        if i == self.layers-1:
            error = y_hat - y
            msq_error = sum(.5 * ((error) ** 2))
            #returns delta, k rows for k inputs, m columns for m nodes
            deltas.append(error * self.dersgm(y_hat))
        else:

            error = deltas[-1].dot(self.weights[i+1][:,:-1])
            deltas.append(self.dersgm(self.layerOut[i]).T * error)

    #Calculate weight-deltas
    wdelta = []
    ordered_deltas = list(reversed(deltas)) #reverse order because created backwards

    #returns weight deltas, k rows for k nodes, m columns for m next layer nodes
    for i in range(self.layers):
        if i == 0:
            #add bias
            input_with_bias = np.vstack((input.T, np.ones(input.shape[0])))
            #some over n rows of deltas for n training examples to get one delta for all examples
            #for all nodes
            wdelta.append(ordered_deltas[i].T.dot(input_with_bias.T))
        else:
            with_bias = np.vstack((self.layerOut[i-1], np.ones(input.shape[0])))
            wdelta.append(ordered_deltas[i].T.dot(with_bias.T))



    #update_weights
    def update_weights(self, weight_deltas, learning_rate):
        for i in range(self.layers):
            self.weights[i] = self.weights[i] +\
                              (learning_rate * weight_deltas[i])


    update_weights(self, wdelta, learning_rate)

    return msq_error

    #end backpropogate

def train(self, input, target, lr, run_iter):
    for i in range(run_iter):
        if i % 100000 == 0:
            print self.backpropogate(input, target, lr)

the input is a 4x2 matrix and the target is a 4x1 matrix (column vector) — Zach Bai, May 01 '15 at 12:33
Gradient descent is twitchy about scaling on the inputs and step size being inappropriate... have you checked these things? Do you have tests for the above code that assert computed gradients are correct etc? — Ben Allison, May 01 '15 at 15:35
I will point out, just in case you're not aware, that there are several easy to use open source Python toolkits to implement ANNs. So unless you're doing this as a learning exercise, there's no need to reinvent the wheel. — cfh, May 01 '15 at 20:46
Why do you think the error should ever be zero? That's actually a bad sign and it suggests a problem called overfitting that means your ANN won't work well on unseen samples (bad generalization). You usually don't want it to be zero, and it will also be very hard to get it to absolute zero because the function you are minimizing is not convex. — IVlad, May 01 '15 at 21:43

score 3 · Accepted Answer · edited May 23 '17 at 12:07

The error function in the following scenario could not be 0, as for the error function to be 0 would require the points to be matching perfectly the curve.

Data fitting

To have more neurons would for sure reduce the error, as the function can have a more complex and precise shape. But a problem called overfitting appears when you fit too well to your data, as seen in the following image. From left to right, a curve is either underfitting the dataset, almost correctly fitting it, then poorly overfitting at the right.

underfitting vs overfitting

The scenario at the right would cause the error to be of 0, but this is not desired, and you want to avoid that. How?

The simplest way to determine if the number of neurons in a network is ideal (to have a good fitting) is by trial and error. Split your data in training data (80% - to train the network) and in test data (20% - reserved only to test the network once trained). While only training on the training data, it is possible to plot the performance on the test dataset.

You can also have a 3rd dataset used for validation, see: whats is the difference between train, validation and test set, in neural networks?

gradient descent in neural network training plateauing

1 Answers1