Thursday, January 16, 2020

Focusing on code readability/quality in Python

I've been writing code in Python for almost over a decade now (7 years in Academia, 3+ years in Industry). I've also read a ton of Python code written by others. Python programmers who come from Java may be influenced by the Java-style of programming (variable names, comments style, etc.). The same stands for C/C++ programmers, PHP, SWIFT, etc. who code in Python.

The style of programming in Python greatly influences the readability of code one writes. Since Python lays more emphasis on readability, it is important to properly vet your code (especially when you want the code to be contributed to a large project).

For passionate Python programmers, I don't have to remind them of PEP20 (Zen of Python) and PEP8 standards. These provide important guidelines on how to write code in python with a focus on simplicity and readability.

To properly incorporate these standards into your writing style, one could use linting (for static code checking). I use PyCodeStyle frequently. Pylint is another program that one could use for linting which rates your code (out of 10). One can configure these linting tools to follow standard python coding style practices into your code.

One could configure them to run on your code within the IDE/editor (like VSCode/PyCharm).

Wednesday, February 14, 2018

Neural Network weight optimization algorithms

In this post, I write about different ways of updating neural network weights. 

I’m writing this post out of the notes I took for the class “Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization”.

Specifically, I discuss about optimizing the neural network weights using non-traditional gradient descent algorithms.

Applying proper optimization techniques have direct effect on the performance of the neural network.

Required pre-requisites to understand this content:
  1. Neural Networks
  2. Gradient Descent
  3. Back Propagation
  4. Cost functions

To explain the following, I’m assuming neural network weights θ to be 1D/2D vector.
The different ways to update the neural network weights are described as follows:

Traditional gradient Descent:

  1. It is prone to larger oscillations during its descent to global/local optimum while optimizing the cost function. This affects performance.
  2. To reduce the oscillations, we use (exponentially weighted) moving average like approaches to dampen out the effects of these oscillations.
  3. This ensures smoother transition to the global/local optimum while optimizing the cost function.

Exponentially weighted Moving averages:

Assuming we have a 1D vector V, that changes over time (like temperature). It is prone to large local variations. 

To correct for that (quick and abrupt) change, we perform the following operations

Vcorrected  = (β* Vt-1) + (1 - β) * Voriginal

In other words, Vcorrected ≈ Approximately average over (1.0 / (1.0 - β)) observations

For example,

If β = 0.9 ≈ approximately last 10 observations average.

If β = 0.98 ≈ approximately last 50 observations average.

  1. Larger value of β gives smoother curves (as opposed to zig-zag/abrupt movement as observed in pure gradient descent).
  2. Bias correction in exponential weighted moving averages applied to gradient descent like algorithms don't affect them significantly. They can be implemented if needed. This is required in ADAptive momentum algorithm.


Since, we understand basics of exponentially weighted moving averages, we can apply that in weight update step in neural network optimization.

We update the weights by performing the following steps:

Vdw = (β * Vdw) + ((1- β)* Vdw)
Vdb = (β * Vdb) + ((1- β)* Vdb)

Weight update step:

W = W - (α * Vdw) instead of W = W - (α * dw)
b = b - (α * Vdb) instead of b = b - (α * dw)
Usually Beta = 0.9

  1. This weight update procedure gives us smoother convergence to global/local minimum.
  2. The Vdw, Vdb terms are derived from the exponentially weighted moving average equations.


Root Mean Squared Propagation. Interestingly this was proposed in the Coursera course by Geoffrey Hinton back in 2011-2012.

While applying gradient descent, we update the weights by performing the following:

Sdw = (β * Sdw) + ((1- β)* (dw)2 )

Sdb = (β * Sdb) + ((1- β)* (db)2 )

Weight update step:
W = W - (α * (dw/ (sqrt(Sdw) +ε))) instead of W = W - (α * dw)
b = b - (α * (db/ sqrt(Sdb)+ε))) instead of W = W - (α * db).

The Sdw, Sdb terms are derived from the exponentially weighted moving average equations.

Adam: ADAptive Momentum

This is the most commonly used/popular optimization algorithm in the computer vision community. 

We have slightly changed update algorithm compared to Momentum.

While applying gradient descent, we update the weights by performing the following:

From Momentum we have:

Vdw = (β1 * Vdw) + ((1.0 - β1) * Vdw)
Vdb = (β1 * Vdb) + ((1.0 - β1) * Vdb)

You have to perform bias correction:

Vdw_corrected = Vdw /(1.0 - ( β1)t)
Vdb_corrected = Vdb /(1.0 - ( β1)t)

From RMSProp we have

Sdw = (β2 * Sdw) + ((1.0 - β2)* (dw)2 )
Sdb = (β2 * Sdb) + ((1.0 - β2)* (db)2 )

You have to perform bias correction:

Sdw_corrected = Sdw /(1.0 - ( β2)t)
Sdb_corrected = Sdb /(1.0 - ( β2)t)

Finally, we perform weight updates:

W = W - (α * (Vdw/ (sqrt(Sdw)+ε))) instead of W = W - (α * dw).
b = b - (α * (Vdb/ (sqrt(Sdb)+ε))) instead of b = b - (α * db).

All of this reminds me of Kalman filters, where the observed signal value is not exactly correct, we observe some process noise and observation noise and we try to incorporate that to have a updated/corrected signal value from the sensor.

Sunday, March 13, 2016

Simple Autoencoder on MNIST dataset

So, I had fun with Theano and trained an Autoencoder on a MNIST dataset.

Autoencoder is a simple Neural network (with one hidden layer) which reproduces the input passed to it. By controlling the number of hidden neurons, we can learn interesting features from the input and data can be compressed as well (sounds like PCA). Autoencoders can be used for unsupervised feature learning. Data trasnformed using autoencoders can be used for supervised classification of datasets.

More about Autoencoders is available here. More variants of Autoencoders exist (Sparse, Contractive, etc.) are available with different constraints on the hidden layer representation.

I trained an vanilla Autoencoder for 100 epochs with 16 mini batch size and learning rate of 0.01

Here are the figures for digit 7 with hidden size 10 and 20 (original data was MNIST training dataset with  784 unit length feature vector). Each of the digits (whose value was 8) were passed to the autoencoder. Each of the hidden units were visualized (after computing the mean).