Deep Learning – Page 2 – Machine Learning Interviews

Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?

Posted on February 17, 2019May 2, 2019 by MLNerds

Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix to get a K dimensional approximation. Other Matrix factorisation techniques…

What is negative sampling when training the skip-gram model ?

Posted on February 17, 2019August 5, 2021 by MLNerds

Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in a space of K dimensions such that similar words are closer to each other. This is achieved by training a feed-forward network where we try…

Given the following two sentences, how do you determine if Teddy is a person or not? “Teddy bears are on sale!” and “Teddy Roosevelt was a great President!”

Posted on February 17, 2019February 20, 2019 by MLNerds

This is an example of Named Entity Recognition(NER) problem. One can build a sequence model such as an LSTM to perform this task. However, as shown in both the sentences above, forward only LSTM might fail here. Using forward only direction LSTM might result in a model which recognises Teddy as a product : “bear”, which is on…

How is long term dependency maintained while building a language model?

Posted on February 16, 2019March 8, 2019 by MLNerds

Language models can be built using the following popular methods – Using n-gram language model n-gram language models make assumption for the value of n. Larger the value of n, longer the dependency. One can refer to what is the significance of n-grams in a language model for further reading. Using hidden Markov Model(HMM) HMM maintains long…

What are the optimization algorithms typically used in a neural network ?

Posted on February 14, 2019 by MLNerds

Gradient descent is the most commonly used training algorithm. Momentum is a common way to augment gradient descent such that gradient in each step is accumulated over past steps to enable the algorithm to proceed in a smoother fashion towards the minimum. RMS prop attempts to adjust learning rate for each iteration in an automated…

Given a deep learning model, what are the considerations to set mini-batch size ?

Posted on February 14, 2019 by MLNerds

The batch size is a hyper parameter. Usually people try various values to see what works best in terms of speed and accuracy. Suppose you have M training instances and k batches, higher batch size is faster to do a pass on the entire dataset, through M/k mini batch iterations. As long as the data…

What are the commonly used activation functions ? When are they used.

Posted on February 14, 2019 by MLNerds

Ans. The commonly used loss functions are Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear activations can be shown incapable of handling non-linear decision boundaries. Sigmoid : This is a common activation function in the last layer of the neural…

I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ?

Posted on February 14, 2019February 14, 2019 by MLNerds

Increase the number of training iterations Make a more complex network – increase hidden layer size Initialize weights to a random small value instead of zeros Change tanh activations to relu Ans : (3) . I will initialize weights to a non zero value since changing all the weights in the same…

What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each

Posted on February 14, 2019February 14, 2019 by MLNerds

L2 norm regularization : Make the weights closer to zero prevent overfitting. L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common form of regularization Dropout regularization : Ensure some of the hidden units are dropped out at random to ensure the network does not overfit by…

I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?

Posted on February 14, 2019February 22, 2019 by MLNerds

Increase number of units in the hidden layer Increase number of hidden layers Increase data set size Change activation function to tanh Try all of the above The answer is d. When I use a linear activation function, the deep neural network is realizing a linear combination of linear functions which leads to modeling only…

← Newer posts Older posts →