Wednesday, February 14, 2018

Neural Network weight optimization algorithms

In this post, I write about different ways of updating neural network weights. 

I’m writing this post out of the notes I took for the class “Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization”.

Specifically, I discuss about optimizing the neural network weights using non-traditional gradient descent algorithms.

Applying proper optimization techniques have direct effect on the performance of the neural network.

Required pre-requisites to understand this content:
  1. Neural Networks
  2. Gradient Descent
  3. Back Propagation
  4. Cost functions

To explain the following, I’m assuming neural network weights θ to be 1D/2D vector.
The different ways to update the neural network weights are described as follows:

Traditional gradient Descent:


  1. It is prone to larger oscillations during its descent to global/local optimum while optimizing the cost function. This affects performance.
  2. To reduce the oscillations, we use (exponentially weighted) moving average like approaches to dampen out the effects of these oscillations.
  3. This ensures smoother transition to the global/local optimum while optimizing the cost function.



Exponentially weighted Moving averages:

Assuming we have a 1D vector V, that changes over time (like temperature). It is prone to large local variations. 

To correct for that (quick and abrupt) change, we perform the following operations

Vcorrected  = (β* Vt-1) + (1 - β) * Voriginal

In other words, Vcorrected ≈ Approximately average over (1.0 / (1.0 - β)) observations

For example,

If β = 0.9 ≈ approximately last 10 observations average.

If β = 0.98 ≈ approximately last 50 observations average.

  1. Larger value of β gives smoother curves (as opposed to zig-zag/abrupt movement as observed in pure gradient descent).
  2. Bias correction in exponential weighted moving averages applied to gradient descent like algorithms don't affect them significantly. They can be implemented if needed. This is required in ADAptive momentum algorithm.


Momentum

Since, we understand basics of exponentially weighted moving averages, we can apply that in weight update step in neural network optimization.

We update the weights by performing the following steps:

Vdw = (β * Vdw) + ((1- β)* Vdw)
Vdb = (β * Vdb) + ((1- β)* Vdb)

Weight update step:

W = W - (α * Vdw) instead of W = W - (α * dw)
b = b - (α * Vdb) instead of b = b - (α * dw)
Usually Beta = 0.9

  1. This weight update procedure gives us smoother convergence to global/local minimum.
  2. The Vdw, Vdb terms are derived from the exponentially weighted moving average equations.


RMSProp

Root Mean Squared Propagation. Interestingly this was proposed in the Coursera course by Geoffrey Hinton back in 2011-2012.

While applying gradient descent, we update the weights by performing the following:


Sdw = (β * Sdw) + ((1- β)* (dw)2 )

Sdb = (β * Sdb) + ((1- β)* (db)2 )

Weight update step:
W = W - (α * (dw/ (sqrt(Sdw) +ε))) instead of W = W - (α * dw)
b = b - (α * (db/ sqrt(Sdb)+ε))) instead of W = W - (α * db).

The Sdw, Sdb terms are derived from the exponentially weighted moving average equations.
                                               

Adam: ADAptive Momentum


This is the most commonly used/popular optimization algorithm in the computer vision community. 

We have slightly changed update algorithm compared to Momentum.

While applying gradient descent, we update the weights by performing the following:

From Momentum we have:

Vdw = (β1 * Vdw) + ((1.0 - β1) * Vdw)
Vdb = (β1 * Vdb) + ((1.0 - β1) * Vdb)

You have to perform bias correction:

Vdw_corrected = Vdw /(1.0 - ( β1)t)
Vdb_corrected = Vdb /(1.0 - ( β1)t)

From RMSProp we have

Sdw = (β2 * Sdw) + ((1.0 - β2)* (dw)2 )
Sdb = (β2 * Sdb) + ((1.0 - β2)* (db)2 )

You have to perform bias correction:

Sdw_corrected = Sdw /(1.0 - ( β2)t)
Sdb_corrected = Sdb /(1.0 - ( β2)t)

Finally, we perform weight updates:

W = W - (α * (Vdw/ (sqrt(Sdw)+ε))) instead of W = W - (α * dw).
b = b - (α * (Vdb/ (sqrt(Sdb)+ε))) instead of b = b - (α * db).

All of this reminds me of Kalman filters, where the observed signal value is not exactly correct, we observe some process noise and observation noise and we try to incorporate that to have a updated/corrected signal value from the sensor.

Sunday, March 13, 2016

Simple Autoencoder on MNIST dataset

So, I had fun with Theano and trained an Autoencoder on a MNIST dataset.

Autoencoder is a simple Neural network (with one hidden layer) which reproduces the input passed to it. By controlling the number of hidden neurons, we can learn interesting features from the input and data can be compressed as well (sounds like PCA). Autoencoders can be used for unsupervised feature learning. Data trasnformed using autoencoders can be used for supervised classification of datasets.

More about Autoencoders is available here. More variants of Autoencoders exist (Sparse, Contractive, etc.) are available with different constraints on the hidden layer representation.

I trained an vanilla Autoencoder for 100 epochs with 16 mini batch size and learning rate of 0.01

Here are the figures for digit 7 with hidden size 10 and 20 (original data was MNIST training dataset with  784 unit length feature vector). Each of the digits (whose value was 8) were passed to the autoencoder. Each of the hidden units were visualized (after computing the mean).








Wednesday, March 9, 2016

simple Convolutional Neural Net based object recognition

I made a simple object recognition module over the last weekend. I wrote a Theano based convolutional neural network. I wrote a simple OpenCV based image segmentation program.

The output from the image segmentation program is passed to the Theano based Convolutional Neural Network.

I used 10,000 images for training and 16,000 images for testing.

The black circles in the video indicate the regions where the classifier is looking and red circles indicate true positives found by the algorithm.