Saturday, August 1, 2015

Notes from NVIDIA's Intro to Deep Learning (Q&A session)

These are the few notes I took from NVIDIA's intro deep learning class. Its for folks trying to learn deep learning and apply deep learning techniques into their research, products (companies). To understand, you will need to have basic knowledge of pattern recognition, machine learning in general. I would recommend you to go through pre-reqs and basic of deep learning.

How to determine optimal network structure?
Answer: Determining the number of layers in a deep network is still a research topic. Its more of an art than a science. Choose network architecture which is similar to other networks which were trained on similar data. For example, choose LeNet style architecture for digit or character recognition. Looks for exiting publications, examples.

If your network isn't over-fitting, it isn't large enough. Keep increasing number of layers, parameters until we get really good training accuracy, then go ahead and try testing.

How to understand multiple layers of a deep network?
Answer: One of the criticisms of deep learning is, its seen as a black-box technology (not many people understanding whats going on underneath). If we are talking about convolutional neural nets, one way to visualize what these are doing is to do deconvolution. One can also visualize filters of vanilla neural nets. NVIDIA's digits software can help you visualize the inner layers of the deep network. You can also do this manually in CAFFE, Theano, etc (other deep ml frameworks). The idea is lower layers learn edges, higher layers learn combination of edges, so on.

Look for deconvolution paper (to try to understand which input maximally activates a neuron) to understand what the network is learning.

What are usual training, testing duration for a network? What about in deployment phase?
Answer: It varies. It depends on the size of dataset, RAM, processing power of the PC, presence of GPU, memory, etc.

For a large dataset:
Training takes hours/days/weeks (we need to do feed forward, back-propagation for every image sample, update weights throughout).
Testing is feed forward (usually fast).
Deployment is also feed forward. Usually takes very less time. Theano, Caffe can be set to use CPU or GPU for feed forward (not just during training phase).
I've used this personally on a convolutional neural network (through CAFFE), one sample image (256 x 256) took 0.16 seconds for feed forwarding on CPU and it took 0.016 seconds on GPU. I used a NVIDIA Quadro 2100M.

Are there any limitations on dimensions of an input in a deep network?
Answer: For vanilla neural networks, convolutional nets, usually the input size is fixed.
It is done to support batches of training. we can vectorize network operations and training quickly. To fix it, we crop them, resize them, etc

What is Fine Tuning?
Answer: Lets say you have some data with you, you train a network. Later you come across another class of data to be added to the classification problem. You can initialize your new network with previous weights (on lower layers) and add new structure to upper layers and fine-tune them (perform feed forward, back-propagation from classification back to input). It is an effective and fast way to include new classes of data. The basic idea is lower layers of the network learn edges, but higher layers learn other combinations of these edges. So, higher layers change whereas usually, the lower layers usually have similar weights.

What is transfer learning?
Answer: Try to use the network architecture and weights on a new data (similar source of data). Then fine tune the network. We can train a network to detect images of animals, we can use same weights to initialize to train similar network architecture on a classifier to detect humans.

What are some of the techniques followed to pre-process the data?
Answer: Subtract the mean across each training image. Standardize the values (like values ranges to 0 - 1). PCA, LDA can also be applied. Whitening is also applied. Subtracting mean of the data from the data, normalizing the range of values goes a long way in improving the performance of the classification task (generally speaking).

Will weights fluctuate or converge smoothly when more data is added to the training set?
Answer: Not always. It usually converges smoothly. But, there are cases when it may fluctuate. It depends on the data that you are adding. Sometimes, it may stop changing (vanishing gradient) as the gradient transferred is very very small. We will have to fine tune learning rate, how we initialize weights. Don't ever initialize weights to zero and symmetric (all zeros, ones or any number). Initialize to non-zero, positive, less than 1, random numbers.

How to use multiple GPUs for training neural nets?
It is still a research topic. There are several ways:
1) Use same network architecture for training a network, split training data between GPUs, train different nets, exchange parameters across GPUs so that end network parameters may become a combination of the different nets training on the multiple GPUs.
2) Model parallelism: Break same network across GPUs, train layer by layer separately on each GPU.

How many epochs/iterations you train the network for? What are batches?
Answer: Epoch/Iteration are used interchangeably. How many times each sample training is fed to network once. Typically used number is 1000.
Batch: number of images used in one pass.

How to determine batch size?
Answer: It depends on size of individual training examples, number of parameters in the network. Experiment with different batch size based on size of ram, GPU memory. In CAFFE, before starting the training process, if the batch size is large, your computer doesn't have enough memory, the process will fail to start. Try small batch sizes 8, 16, 64, 100, 256, ....

How to avoiding overfitting?
Answer: We usually monitor performance of model against training set, validation set. After some time if training data accuracy keeps on increasing but not on the validation set, we are overfitting the network. We need to stop the training process and save the values of weights. Other techniques are also available (dropout?). Get more data.

What to use for time series data, Convolutional Nets or Recurrent Neural Nets?
Answer: Usually for time series data, we use recurrent neural net (can be used on temporal data). If our time series data is one dimensional, convolutional net isn't usually chosen unless. Convolutional net can be applied when there is a spatial component. If time series data has temporal and spatial component, we can consider the data as an image (like audio data represented in different bands with different colors) and use convolutional nets instead of a recurrent net.

All modern deep learning frameworks make use of GPU's. You don't necessarily have to to learn GPU programming unless you are planning to customize those frameworks (write your own activation function, customized layer, vectorize the layer through GPU, etc). Usually you could make use of GPU with flick of a switch (turn on use GPU flag). In theano, you change the config file settings, in Caffe you can set GPU or CPU through solver file or use API (PyCaffe).

From shallow to deep networks (picked it from Yann LeCun's ICML 2013 presentation).

No comments:

Post a Comment