Deep Learning Interview Questions

A compilation of deep learning Interview questions with answers that are popularly asked in Machine Learning Interviews. We hope our questions will help you crack your data science interview …
Positional Encoding in the Transformer Model


https://youtu.be/5wpzAk4THcI
Transformer models are super popular. With the quadratic attention layer, how does sequence nature of data get captured? Through Positional Encoding. This video briefly explains the concept of positional encoding ...
GPT Model


https://www.youtube.com/watch?v=PbiJyXZMB9o
This video explains the GPT model, where it is used and small code snippette to understand how to use it in python with a toy example.

I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?
Increase number of units in the hidden layer
Increase number of hidden layers
 Increase data set size

Change activation function to tanh

Try all of the above

The answer is d. When I use a ...
Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?Problems

As the vocabulary size (V) is large, these vectors will be large in size.
They will be sparse as a word may not have co-occurred with all possible words.

Resolution

Dimensionality Reduction using ...
How is long term dependency maintained while building a language model?


Language models can be built using the following popular methods –

Using n-gram language model

n-gram language models make assumption for the value of n. Larger the value of n, longer the ...
What are the commonly used activation functions ? When are they used.Ans. The commonly used loss functions are 

Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear ...
Given the following two sentences, how do you determine if Teddy is a person or not?  “Teddy bears are on sale!” and “Teddy Roosevelt was a great President!”
This is an example of Named Entity Recognition(NER) problem. One can build a sequence model such as an LSTM to perform this task. However, as shown in both the sentences above, ...
Can you give an example of a classifier with high bias and high variance?High bias means the data is being  underfit. The decision boundary is not usually complex enough. High variance happens due to over fitting, the decision boundary is more complex than ...
When are deep learning algorithms more appropriate compared to traditional machine learning algorithms?
Deep learning algorithms are capable of learning arbitrarily complex non-linear functions by using a deep enough and a wide enough network with the appropriate non-linear activation function. 
Traditional ML algorithms ...
What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
L2 norm regularization : Make the weights closer to zero prevent overfitting. 
L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common ...
BERT Model


https://youtu.be/ZPmQzexoi-Q
This video explains the BERT model, its architecture, how it is trained and used. It also talks about when we would want to use the BERT model in comparison with ...
The BERT Score – Evaluating Text Generation


https://www.youtube.com/watch?v=4Hv_3Jd2O24




This video talks about the evaluation metric BERTScore, why it needed over existing metrics such as the BLEU score and so on and how it is computed and evaluated. Traditional ...
What is negative sampling when training the skip-gram model ?Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in ...
Given a deep learning model, what are the considerations to set mini-batch size ?The batch size is a hyper parameter. Usually people try various values to see what works best in terms of speed and accuracy. Suppose you have M training instances and ...
I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples  (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ? 

Increase the number of training iterations
Make a more complex network – increase hidden layer size
Initialize weights to a random small value instead of zeros
Change tanh activations to relu

 
 
Ans : (3) ...
Batch vs Mini-Batch vs Stochastic Gradient Descent


https://www.youtube.com/watch?v=1xMs6A3DLYw
Most deep learning architectures use a variation of Gradient Descent Optimization algorithm to come up with the best set of parameters for the netwrork, given the loss function and the ...
What are the optimization algorithms typically used in a neural network ? Gradient descent is the most commonly used training algorithm. Momentum is a common way to augment gradient descent such that gradient in each step is accumulated over past steps ...
Skip or Residual Connections in Deep Networks


https://www.youtube.com/watch?v=HW7Kv8HGdvM
The transformer model uses skip connections to promote accelerated learning through a deep architecture. This video explains Skip or Residual connections to enable building deep neural networks bypassing challenges such ...
Scaled Dot Product Attention


https://www.youtube.com/watch?v=RZN5Pwb4Ywg
This video explains the motivation behind scaled dot product attention used in the transformer architecture and how it is computed.

Why do you typically see overflow and underflow when implementing an ML algorithms ?
A common pre-processing step is to normalize/rescale inputs so that they are not too high or low.

However, even on normalized inputs, overflows and underflows can occur:

Underflow: Joint probability distribution often ...
Normalization in Deep Neural Networks


https://youtu.be/lLCSNRzx4F8




Batch norm and Layer norm are common normalization techniques. This brief video talks about the need for normalization and the types of norms in deep neural networks.