A compilation of deep learning Interview questions with answers that are popularly asked in Machine Learning Interviews. We hope our questions will help you crack your data science interview …

- When are deep learning algorithms more appropriate compared to traditional machine learning algorithms?
Deep learning algorithms are capable of learning arbitrarily complex non-linear functions by using a deep enough and a wide enough network with the appropriate non-linear activation function.
Traditional ML algorithms ...
- What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
L2 norm regularization : Make the weights closer to zero prevent overfitting.
L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common ...
- What is negative sampling when training the skip-gram model ?Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in ...
- Given a deep learning model, what are the considerations to set mini-batch size ?The batch size is a hyper parameter. Usually people try various values to see what works best in terms of speed and accuracy. Suppose you have M training instances and ...
- Why do you typically see overflow and underflow when implementing an ML algorithms ?
A common pre-processing step is to normalize/rescale inputs so that they are not too high or low.
However, even on normalized inputs, overflows and underflows can occur:
Underflow: Joint probability distribution often ...
- How is long term dependency maintained while building a language model?
Language models can be built using the following popular methods –
Using n-gram language model
n-gram language models make assumption for the value of n. Larger the value of n, longer the ...
- What are the optimization algorithms typically used in a neural network ? Gradient descent is the most commonly used training algorithm. Momentum is a common way to augment gradient descent such that gradient in each step is accumulated over past steps ...
- I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?
Increase number of units in the hidden layer
Increase number of hidden layers
Increase data set size
Change activation function to tanh
Try all of the above
The answer is d. When I use a ...
- Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?Problems
As the vocabulary size (V) is large, these vectors will be large in size.
They will be sparse as a word may not have co-occurred with all possible words.
Resolution
Dimensionality Reduction using ...
- What are the commonly used activation functions ? When are they used.Ans. The commonly used loss functions are
Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear ...
- I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ?
Increase the number of training iterations
Make a more complex network – increase hidden layer size
Initialize weights to a random small value instead of zeros
Change tanh activations to relu
Ans : (3) ...
- Can you give an example of a classifier with high bias and high variance?High bias means the data is being underfit. The decision boundary is not usually complex enough. High variance happens due to over fitting, the decision boundary is more complex than ...
- Given the following two sentences, how do you determine if Teddy is a person or not? “Teddy bears are on sale!” and “Teddy Roosevelt was a great President!”
This is an example of Named Entity Recognition(NER) problem. One can build a sequence model such as an LSTM to perform this task. However, as shown in both the sentences above, ...