# Machine Learning Interview Questions

A compilation of ML Interview questions with answers that are popularly asked in Machine Learning Interviews. We hope our questions will help you crack your data science interview …

``````What is Bayesian Logistic Regression?Bayesian Logistic Regression
In this video, we try to understand the motivation behind Bayesian Logistic regression and how it can be implemented.
Recap of Logistic Regression
Logistic Regression is one of the most ...What would you care more about – precision or recall for spam filtering problem?
False positive means it was not a spam and we called it spam, false negative means it was a spam and we didn’t label it spam
Precision = (TP / TP ...What is Stacking ? Ensembling Multiple Dissimilar ModelsMany of us have heard of bagging and boosting, commonly used ensemble learning techniques. This video describes ways to combine multiple dissimilar ML models through voting, averaging and stacking to ...What is the Maximum Likelihood Estimate (MLE)?Probabilistic Models help us capture the inherant uncertainity in real life situations. Examples of probabilistic models are Logistic Regression, Naive Bayes Classifier and so on..  Typically we fit (find parameters) ...What is the complexity of Viterbi algorithm ?
Viterbi algorithm is a dynamic programming approach to find the most probable sequence of hidden states given the observed data, as modeled  by a HMM.
Without dynamic programming, it becomes an ...Dartboard Paradox: Probability Density Function vs ProbabilityWhat is the Dartboard Paradox ?
Assume your are throwing a dart at dartboard such that it hits somewhere on the dartboard.  The dartboard paradox:  The probability of hitting any specific ...Fairness in ML: How to deal with bias in ML pipelines?

https://youtu.be/bxAQcOzHj7k

In this 30 minute video, we talk about  Bias and Fairness in ML workflows:

Why we need to handle fairness in ML models
How biases typically creep into the ML pipeline
How ...What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP :

TF-IDF : Term frequency, inverse document ...How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?

Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here ...What are knowledge graphs? When would you need a knowledge graph over say a database to store information?A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it ...Recursive Feature Elimination for Feature SelectionThis video explains the technique of Recursive Feature Elimination for feature selection when we have data with lots of features.

Why do we need Feature Elimination?
Often we end up with large ...How many parameters are there for an hMM model?Let us calculate the number of parameters for bi-gram hMM given as

Let  be the total number of states  and  be the vocabulary size ...How do you design a system that reads a natural language question and retrieves the closest FAQ answer?There are multiple approaches for FAQ based question answering

Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy ...How do you handle missing data in an ML algorithm ?Missing data is caused either due to issues in data collection or sometimes, the data model could allow for missing data (for instance, the field ‘maximum credit limit on any ...How to measure the performance of the language model ?
While building language model, we try to estimate the probability of the sentence or a document.
Given sequences(sentences or documents) like

Language model(bigram language model) will be ...NDCG Evaluation Metric for Recommender Systems

https://youtu.be/J-7HbXW9JpM
This video talks about the NDCG metric for recommender systems that takes into account both the degree of relevance and the ranking of items to evaluate recommender systems.

What is Bayesian Modeling?This video explains Bayesian Modeling : Why do we need Bayesian Modeling? What is Bayesian Modeling? What are some examples where we can practically use Bayesian Modeling ?
Check out https://www.tensorflow.org/probability  for code ...LIME for local explanations

https://youtu.be/s3J5x5zIU0Y

This video talks about the LIME for local interpretablity. It explains the motivation for local explanations followed by how LIME works to provide local explanations along with information on where ...Can we use the AUC Metric for a SVM Classifier ? This video explains computing the AUC metric for an SVM classifier, or other classifiers that give the absolute class values as outcomes.
What is Area Under the Curve ?
AUC is the ...MAP at K : An evaluation metric for Ranking

https://youtu.be/QSaK4l9C66c

This video talks about the Mean Average Precision at K (popularly called the MAP@K) metric that is commonly used for evaluating recommender systems and other ranking related problems.

For a ...Machine Translation

https://youtu.be/T0EKtufdIw0

Here is a high level overview of Machine Translation. This short video covers a brief history of machine translation followed by a quick explanation of SMT (statistical machine translation) and ...What are the advantages and disadvantages of using naive bayes for spam detection?Disadvantages:
Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem ...What is PMI ?PMI : Pointwise Mutual Information, is a measure of correlation between two events x and y.

As you can see from above ...You want to find food related topics in twitter – how do you go about it ?One can use any of the topic models above to get topics. However, to direct the topics to contain food related information, specialized topic modeling algorithms are available.
However, one ...Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?
Assume there are total  states and let  be the length of the largest sequence.
Think how we generate text using an hMM. We first have a state sequence and ...How do you deal with dataset imbalance in a problem like spam filtering ?Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a ...I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?
Increase number of units in the hidden layer
Increase number of hidden layers
Increase data set size

Change activation function to tanh

Try all of the above

The answer is d. When I use a ...How to find the Optimal Number of Clusters in K-means? Elbow and Silhouette MethodsK-means Clustering Recap
Clustering is the process of finding cohesive groups of items in the data. K means clusterin is the most popular clustering algorithm. It is simple to implement and ...How does KNN algorithm work ? What are the advantages and disadvantages of KNN ?The KNN algorithm is commonly used in many ML applications – right from supervised settings such as classification and regression, to just retrieving similar items in applications such as recommendation ...Bias in Machine Learning : How to measure Fairness based on Confusion Matrix ?Machine Learning models often give us unexpected and biased outcomes if the underlying data is biased. Very often, our process of collecting data is incomplete or flawed leading to data ...What is the difference between word2Vec and Glove ?Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring ...What are the different ways of representing documents ?
Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing ...How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. ...What are some common tools available for NER ? Named Entity Recognition ?Notable NER platforms include:

GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API.
OpenNLP includes rule-based and statistical named-entity recognition.
SpaCy ...How to build a Global Surrogate Model for Interpretable AI?

https://youtu.be/uOL-Zb9_DO4

This brief 5 minute video explains building Global Surrogate Models for interpretable AI.

What are the different independence assumptions in hMM & Naive Bayes ?Both the hMM and Naive Bayes have conditional independence assumption.
hMM can be expressed by the equation below :

Second equation implies a conditional ...Do we need to learn Linear Algebra for Machine Learning ?A lot of things we do in the ML pipeline involve vectors and matrices
Linear Algebra helps us understand how these vectors interact with each other, how to perform vector & ...What is the difference between supervised and unsupervised learning ?In Supervised Learning the algorithm learns from labeled training data. In other words, each data point is tagged with the answer or the label the algorithm should come up with. ...Target Encoding for Categorical FeaturesThis video describes target encoding for categorical features, that is more effecient and more effective in several usecases than the popular one-hot encoding.

Recap: Categorical Features and One-hot encoding
Categorical features are ...Explain Locality Sensitive Hashing for Nearest Neighbour Search ?What is Locality Sensitive Hashing (LSH) ?
Locality Sensitive hashing is a technique to enable creating a hash or putting items in buckets such

similar items are in the same bucket (same ...You have come up with a Spam classifier. How do you measure accuracy ?Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy :
True positives : Those data points where the outcome ...Naive Bayes Classifier : Advantages and Disadvantages

https://youtu.be/YuNfG6dFuZo
How does the Naive Bayes Classifier Work? What are the advantages and Disadvantages of using the Naive Bayes classifier?

Recap: Naive Bayes Classifier

Naive Bayes Classifier is a popular model for classification ...What is overfitting and underfitting ? Give examples. How do you overcome them?ANSWER here
What is the Page Rank Algorithm ?How do search engines find what you want?
When we search on the internet, we want to see the most relevant pages. Page rank algorithm is a tool to determine which ...What is negative sampling when training the skip-gram model ?Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in ...What is the difference between deep learning and machine learning?Deep learning is a subset of Machine Learning. Machine learning is the ability to build “models” that can learn automatically from data, without programming explicit rules. Machine Learning models typically ...What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
L2 norm regularization : Make the weights closer to zero prevent overfitting.
L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common ...What are evaluation metrics for multi-class classification problem (like positive/negative/neutral sentiment analysis)For multiclass classification(MCC) problems, metrics  can be derived from the confusion matrix. Let \$tp_i,tn_i,fp_i,fn_i\$ denote the true positives, true negatives, false positives, false negatives respectively.
MCC problems, usually macro and ...You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements ...How do you measure quality of Machine translation ?BLEU (Bilingual evaluation understudy) score is the most common metric used during machine translation. Typically, it is used to measure a candidate translation against a set of reference translations available ...What is stratified sampling and why is it important ?
Stratified sampling is a sampling method where population is divided into homogenous subgroups called strata and the right number of instances are sampled from each stratum. For further explanation visit ...Can you give an example of a classifier with high bias and high variance?High bias means the data is being  underfit. The decision boundary is not usually complex enough. High variance happens due to over fitting, the decision boundary is more complex than ...How do you train a hMM model in practice ?The joint probability distribution for the HMM model is given by the following equation where  are the observed data points and  the corresponding latent states:
...You are building a natural language search box for a website. How do you accommodate spelling errors?If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed ...What are Isolation Forests? How to use them for Anomaly Detection?All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression.
Can ...How do you generate text using a Hidden Markov Model (HMM) ?The HMM is a latent variable model where the observed sequence of variables  are assumed to be generated from a set of temporally connected latent  variables .
The joint distribution ...What are some knowledge graphs you know. What is different between these ?
DBPedia : Entities and relationships are automatically extracted from wikipedia.
Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. ...What is Simpsons Paradox ?Simpsons Paradox occures when trends in aggregates are reversed when examining trends in subgroups. Data often has biases that are might might lead to unexpected trends, but digging deeper and ...Evaluation Metrics for Recommendation SystemsThis video explores how one can evaluate recommender systems.
Evaluating a recommender system involves (1) If the right results are being recommended (2) Whether more relevant results are being recommended at ...What is Elastic Net Regularization for Regression?Most of us know that ML models often tend to overfit to the training data for various reasons. This  could be due to lack of enough training data or the ...Bayesian Neural NetworksBayesian Neural networks enable capturing uncertainity in the parameters of a neural network.
This video contains:

A brief Recap of Feedforward Neural Networks
Motivation behind a Bayesian Neural Network
What is a Bayesian Neural ...Bias in Machine Learning : Types of Data BiasesBias in Machine Learning models could often lead to unexpected outcomes. In this brief video we will look at different ways we might end up building biased ML models, with ...Why do you typically see overflow and underflow when implementing an ML algorithms ?
A common pre-processing step is to normalize/rescale inputs so that they are not too high or low.

However, even on normalized inputs, overflows and underflows can occur:

Underflow: Joint probability distribution often ...Learning Feature Importance from Decision Trees and Random ForestsThis video shows the process of feature selection with Decision Trees and Random Forests.

Why do we need Feature Selection?
Often we end up with large datasets with redundant features that need ...Covariance and CorrelationOften in data science, we want to understand how one variable is related to another. These variables could be features for an ML model, or sometimes we might want to ...Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?Problems

As the vocabulary size (V) is large, these vectors will be large in size.
They will be sparse as a word may not have co-occurred with all possible words.

Resolution

Dimensionality Reduction using ...Is the run-time of an ML algorithm important? How do I evaluate whether the run-time is OK?Runtime considerations are often important for many applications.  Typically you should look at training time and prediction time for an ML algorithm.
Some common questions to ask include:

Training: Do you want ...If the average length of a sentence is 100 in all documents, should we build 100-gram language model ?
A 100 gram model will be more complex and will have lot of parameters.
One way is to start with n-gram model with different values of n from 2 to ...What is AUC : Area Under the Curve?What is AUC ?
AUC is the area under the ROC curve. It is a popularly used classification metric.
Classifiers such as logistic regression and naive bayes predict class probabilities  as the ...What is One-Class SVM ? How to use it for anomaly detection?https://youtu.be/vmE9ScCb2KY
One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection.
Let’s say we are analyzing credit card transactions to identify fraud. We ...Berkson’s ParadoxThis video explains the Berkson’s Paradox. Berkson’s Paradox typically arises from selection bias when we create our dataset, that could lead to unintended inferences from our data.
Summary of contents:

Berkson’s Paradox ...Detecting and Removing Gender Bias in Word Embeddings
What are Word Embeddings?
Word embeddings are vector representation of words that can be used as input (features) to other downstream tasks and ML models. Here is an article that  explains ...How do you use Complement Naive Bayes for Imbalanced Datasets?

https://youtu.be/Rhs3RIECfe4

This brief video explains the Complement Naive Bayes classifier, a modification of the naive bayes classifier that works well for imbalanced datasets.

I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples  (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ?

Increase the number of training iterations
Make a more complex network – increase hidden layer size
Initialize weights to a random small value instead of zeros
Change tanh activations to relu

Ans : (3) ...What is speaker segmentation in speech recognition ? How do you use it ?Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications ...What are the commonly used activation functions ? When are they used.Ans. The commonly used loss functions are

Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear ...With the maximum likelihood estimate are we guaranteed to find a global Optima ?Maximum likelihood estimate finds that value of parameters that maximize the likelihood. If the likelihood is strictly concave(or negative of likelihood is strictly convex), we are guaranteed to find a ...When are deep learning algorithms more appropriate compared to traditional machine learning algorithms?
Deep learning algorithms are capable of learning arbitrarily complex non-linear functions by using a deep enough and a wide enough network with the appropriate non-linear activation function.
Traditional ML algorithms ...Global vs Local Interpretability

https://youtu.be/EHAwlKyFwOg
Global vs Local Interpretability: Interpretable AI

This video explains why we need interpretability for our AI and the two approaches typically used for interpretability namely Global and Local Interpretability. While global ...``````