Given a deep learning model, what are the considerations to set mini-batch size ?

The batch size is a hyper parameter. Usually people try various values to see what works best in terms of speed and accuracy. Suppose you have M training instances and k batches, higher batch size  is faster to do a pass on the entire dataset, through M/k mini batch iterations. As long as the data and network weights fit in GPU memory. But if you look at the number of iterations over the entire data, it is believed that smaller batch sizes lead to fewer iterations to converge.  


In terms of accuracy, one could try to compare stochastic gradient descent or SGD (mini-batch size =1 )  vs regular gradient descent (mini-batch size = M) and generalize the answer to a mini-batch size in between. In full-batch gradient descent, we are making one update each iteration in the precise direction of the gradient. In SGD, we are making several updates in less precise direction, where each update is learning from the previous. Hence when we do full-batch gradient descent, the objective function value might be lower on the training set. However it is widely believed that SGD acts as a regularizer and helps identify a flat minima rather than a sharp minima leading to better generalization to avoid overfitting.


Leave a Reply

Your email address will not be published. Required fields are marked *