A compilation of ML Interview questions with answers that are popularly asked in Machine Learning Interviews. We hope our questions will help you crack your data science interview …

- How do you deal with dataset imbalance in a problem like spam filtering ?
Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a ...- How do you generate text using a Hidden Markov Model (HMM) ?
The HMM is a latent variable model where the observed sequence of variables are assumed to be generated from a set of temporally connected latent variables . The joint distribution ...- What is the difference between deep learning and machine learning?
Deep learning is a subset of Machine Learning. Machine learning is the ability to build “models” that can learn automatically from data, without programming explicit rules. Machine Learning models typically ...- If the average length of a sentence is 100 in all documents, should we build 100-gram language model ?
A 100 gram model will be more complex and will have lot of parameters. One way is to start with n-gram model with different values of n from 2 to ...- What is AUC : Area Under the Curve?
What is AUC ? AUC is the area under the ROC curve. It is a popularly used classification metric. Classifiers such as logistic regression and naive bayes predict class probabilities as the ...- How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?
Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here ...- Naive Bayes Classifier : Advantages and Disadvantages
https://youtu.be/YuNfG6dFuZo How does the Naive Bayes Classifier Work? What are the advantages and Disadvantages of using the Naive Bayes classifier? Recap: Naive Bayes Classifier Naive Bayes Classifier is a popular model for classification ...- With the maximum likelihood estimate are we guaranteed to find a global Optima ?
Maximum likelihood estimate finds that value of parameters that maximize the likelihood. If the likelihood is strictly concave(or negative of likelihood is strictly convex), we are guaranteed to find a ...- You are building a natural language search box for a website. How do you accommodate spelling errors?
If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed ...- Anomaly Detection Techniques
https://youtu.be/6q3Lqy56G_w Techniques for Anomaly Detection Anomaly detection is an important task with many applications – right from finding outliers in the data to avoid building bad models to applications such as fraud ...- Successive Halving For Grid Search
https://youtu.be/DQ-T9aRBM_Q This brief video explains Grid Search with successive Halving. Grid Search is often very slow and the primary bottleneck in many production pipelines. Successive Halving is a strategy to find ...- Bias in Machine Learning : Types of Data Biases
Bias in Machine Learning models could often lead to unexpected outcomes. In this brief video we will look at different ways we might end up building biased ML models, with ...- Bayesian Neural Networks
Bayesian Neural networks enable capturing uncertainity in the parameters of a neural network. This video contains: A brief Recap of Feedforward Neural Networks Motivation behind a Bayesian Neural Network What is a Bayesian Neural ...- Why are Micro Precision and Micro Recall Same for Multiclass Models?
https://youtu.be/r2-682JBvIs Precision and recall are popular metrics for classification. For multiclass settings, we often compute the micro and macro precision and recall. However, the micro precision, micro recall and the overall ...- What are the different ways of representing documents ?
Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing ...- How is Wroking with Time Series Data different?
https://youtu.be/WuALMDv87y4 This video talks about what time series data is and how working with time series data is different from other forms of data.- What is the Maximum Likelihood Estimate (MLE)?
Probabilistic Models help us capture the inherant uncertainity in real life situations. Examples of probabilistic models are Logistic Regression, Naive Bayes Classifier and so on.. Typically we fit (find parameters) ...- How to find the Optimal Number of Clusters in K-means? Elbow and Silhouette Methods
K-means Clustering Recap Clustering is the process of finding cohesive groups of items in the data. K means clusterin is the most popular clustering algorithm. It is simple to implement and ...- How do you use Complement Naive Bayes for Imbalanced Datasets?
https://youtu.be/Rhs3RIECfe4 This brief video explains the Complement Naive Bayes classifier, a modification of the naive bayes classifier that works well for imbalanced datasets.- What is Simpsons Paradox ?
Simpsons Paradox occures when trends in aggregates are reversed when examining trends in subgroups. Data often has biases that are might might lead to unexpected trends, but digging deeper and ...- What is PMI ?
PMI : Pointwise Mutual Information, is a measure of correlation between two events x and y. As you can see from above ...- Covariance and Correlation
Often in data science, we want to understand how one variable is related to another. These variables could be features for an ML model, or sometimes we might want to ...- BLUE Score
https://youtu.be/UV2ymKoMcyw This brief video describes the BLEU score, a popular evaluation metric used for sevaral tasks such as machine translation, text summarization and so on. What is BLEU Score? BLEU stands for ...- What is the difference between supervised and unsupervised learning ?
In Supervised Learning the algorithm learns from labeled training data. In other words, each data point is tagged with the answer or the label the algorithm should come up with. ...- Differencing Time Series Data to Remove Trend
https://youtu.be/IqM8szMyfeg This brief video explains the differencing operation on time series data. It talks about why differencing is required and how differencing actually removes non-stationarity such as trend when we work ...- How does KNN algorithm work ? What are the advantages and disadvantages of KNN ?
The KNN algorithm is commonly used in many ML applications – right from supervised settings such as classification and regression, to just retrieving similar items in applications such as recommendation ...- Can we use the AUC Metric for a SVM Classifier ?
This video explains computing the AUC metric for an SVM classifier, or other classifiers that give the absolute class values as outcomes. What is Area Under the Curve ? AUC is the ...- Evaluation Metrics for Recommendation Systems
This video explores how one can evaluate recommender systems. Evaluating a recommender system involves (1) If the right results are being recommended (2) Whether more relevant results are being recommended at ...- What is Bayesian Logistic Regression?
Bayesian Logistic Regression In this video, we try to understand the motivation behind Bayesian Logistic regression and how it can be implemented. Recap of Logistic Regression Logistic Regression is one of the most ...- How do you handle missing data in an ML algorithm ?
Missing data is caused either due to issues in data collection or sometimes, the data model could allow for missing data (for instance, the field ‘maximum credit limit on any ...- What is the complexity of Viterbi algorithm ?
Viterbi algorithm is a dynamic programming approach to find the most probable sequence of hidden states given the observed data, as modeled by a HMM. Without dynamic programming, it becomes an ...- BERT vs Word2Vec Embeddings
https://youtu.be/9eTeIO6nFTI This video talks about two of the popular word embeddings BERT and Word2Vec and explains the differences and when it makes sense to use each.- What is Rejection Sampling?
https://youtu.be/yQBS0HCWN_8 This short video explains why we need sampling, particularly rejection sampling. It also explains how rejection sampling works and some places where it is used.- Local Outlier Factor for Anomaly Detection
https://youtu.be/Xl7XVPyvO5U Anomaly detection is an important application used across various verticals like healthcare, finance, manufacturing and so on. Local Outlier Factor is a popular density based technique for anomaly detection that ...- Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?
Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using ...- What would you care more about – precision or recall for spam filtering problem?
False positive means it was not a spam and we called it spam, false negative means it was a spam and we didn’t label it spam Precision = (TP / TP ...- Dartboard Paradox: Probability Density Function vs Probability
What is the Dartboard Paradox ? Assume your are throwing a dart at dartboard such that it hits somewhere on the dartboard. The dartboard paradox: The probability of hitting any specific ...- Gaussian Processes for Bayesian Hyperparameter Tuning
https://youtu.be/kAQMOujS5YY Grid search is often very expensive for hyper-parameter tuning and a bottleneck in ML pipelines. This brief video explains the Bayesian hyper-parameter optimization technique with Gaussian processes in a simple ...- Target Encoding for Categorical Features
This video describes target encoding for categorical features, that is more effecient and more effective in several usecases than the popular one-hot encoding. Recap: Categorical Features and One-hot encoding Categorical features are ...- What are the commonly used activation functions ? When are they used.
Ans. The commonly used loss functions are Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear ...- Stratified Sampling for Imbalanced Datasets
https://youtu.be/hAr985UmQ0c This 5 minute video describes the need for stratified sampling and how to incorporate it in our ML pipelines with scikit-learn.- Bias in Machine Learning : How to measure Fairness based on Confusion Matrix ?
Machine Learning models often give us unexpected and biased outcomes if the underlying data is biased. Very often, our process of collecting data is incomplete or flawed leading to data ...- Popular Distance Metrics in ML
https://youtu.be/XlMo0vuhq6w This video talks about popular distance metrics used in Machine Learning algorithms – Euclidean distance, Manhatten distance, Minkowiski distance, Hamming distance and Cosine distance.- Machine Translation
https://youtu.be/T0EKtufdIw0 Here is a high level overview of Machine Translation. This short video covers a brief history of machine translation followed by a quick explanation of SMT (statistical machine translation) and ...- What is Elastic Net Regularization for Regression?
Most of us know that ML models often tend to overfit to the training data for various reasons. This could be due to lack of enough training data or the ...- How to tune hyperparameters with Randomized Grid Search?
https://youtu.be/J_tuSp5PzXc Randomized Grid Search is a variation of Grid Search that samples each parameter from a distribution. Conventional grid search evaluates the model at fixed combinations of parameter values and could ...- You want to find food related topics in twitter – how do you go about it ?
One can use any of the topic models above to get topics. However, to direct the topics to contain food related information, specialized topic modeling algorithms are available. However, one ...- Avoiding Feedback Loops in Recommender Systems
https://youtu.be/j0lzd-82ENA This video talks about avoiding feedback loops in recommender systems. Recommender systems often suffer from exposure bias, where we have customer feedback from only those items we actually recommend to the ...- How to measure the performance of the language model ?
While building language model, we try to estimate the probability of the sentence or a document. Given sequences(sentences or documents) like Language model(bigram language model) will be ...- I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?
Increase number of units in the hidden layer Increase number of hidden layers Increase data set size Change activation function to tanh Try all of the above The answer is d. When I use a ...- Global vs Local Interpretability
https://youtu.be/EHAwlKyFwOg Global vs Local Interpretability: Interpretable AI This video explains why we need interpretability for our AI and the two approaches typically used for interpretability namely Global and Local Interpretability. While global ...- How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. ...- What is One-Class SVM ? How to use it for anomaly detection?
https://youtu.be/vmE9ScCb2KY One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection. Let’s say we are analyzing credit card transactions to identify fraud. We ...- MAP at K : An evaluation metric for Ranking
https://youtu.be/QSaK4l9C66c This video talks about the Mean Average Precision at K (popularly called the MAP@K) metric that is commonly used for evaluating recommender systems and other ranking related problems. Why do ...- I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ?
Increase the number of training iterations Make a more complex network – increase hidden layer size Initialize weights to a random small value instead of zeros Change tanh activations to relu Ans : (3) ...- What are some knowledge graphs you know. What is different between these ?
DBPedia : Entities and relationships are automatically extracted from wikipedia. Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. ...- How to do Kfold Crossvalidation for Temporal Data?
https://youtu.be/bTSraiFsCi8 Temporal Leakage is a common problem with temporal data. This short video discusses ways to do crossvalidation with temporal data without temporal leakage.- Inverse Propensity Weighing (IPW)
https://youtu.be/1okhwPz7VLM This video explains the technique of Inverse Propensity Weighing (IPW) that is commonly used to address sampling bias in datasets by giving more weightage to underrepresented groups.- What are some common tools available for NER ? Named Entity Recognition ?
Notable NER platforms include: GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API. OpenNLP includes rule-based and statistical named-entity recognition. SpaCy ...- Why do you typically see overflow and underflow when implementing an ML algorithms ?
A common pre-processing step is to normalize/rescale inputs so that they are not too high or low. However, even on normalized inputs, overflows and underflows can occur: Underflow: Joint probability distribution often ...- Where do we use Divide and Conquer in Machine Learning?
https://youtu.be/Scn6y9xLr0I This video talks about examples where we use the Divide and Conquer technique in Machine Learning. Algorithms and data structures are used in various instances while building efficient Machine Learning ...- How do you measure quality of Machine translation ?
BLEU (Bilingual evaluation understudy) score is the most common metric used during machine translation. Typically, it is used to measure a candidate translation against a set of reference translations available ...- Macro, Micro and Weighted F1 Score
https://youtu.be/N1_3KrC337s This video explains the need for Macro, Micro and Weighted F1 score metrics for multiclass classification problems.- Feedback Loops: What causes Bias Amplification in Recommender Systems?
https://youtu.be/CnDUINYBeXk This short video talks about feedback loops that often show up in recommender systems – causing frequently recommended items to pop up even more frequently. It is important to be ...- When are deep learning algorithms more appropriate compared to traditional machine learning algorithms?
Deep learning algorithms are capable of learning arbitrarily complex non-linear functions by using a deep enough and a wide enough network with the appropriate non-linear activation function. Traditional ML algorithms ...- How do you design a system that reads a natural language question and retrieves the closest FAQ answer?
There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy ...- Why Learn Data Structures to be a Data Scientist?
https://youtu.be/c4R_o3BzQPk Why learn data structures to be a data scientist? The video covers examples of where various data structures are used in an ML context to highlight the importance of understanding ...- What are Isolation Forests? How to use them for Anomaly Detection?
All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression. Can ...- Explain Locality Sensitive Hashing for Nearest Neighbour Search ?
What is Locality Sensitive Hashing (LSH) ? Locality Sensitive hashing is a technique to enable creating a hash or putting items in buckets such similar items are in the same bucket (same ...- What is Autocorrelation?
https://youtu.be/xwabmu5LnU0 Autocorrelation is a useful concept to analyze time-series data. This video explains what autocorrelation is and why we care about it, followed by how we can write simple python code ...- What is Stacking ? Ensembling Multiple Dissimilar Models
Many of us have heard of bagging and boosting, commonly used ensemble learning techniques. This video describes ways to combine multiple dissimilar ML models through voting, averaging and stacking to ...- What is the difference between word2Vec and Glove ?
Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring ...- What are Ball Trees?
https://youtu.be/ZOZqJqGgP1M This video describes what Ball Trees are, how they work and python snippets to get them working in your code..- How to build a Global Surrogate Model for Interpretable AI?
https://youtu.be/uOL-Zb9_DO4 This brief 5 minute video explains building Global Surrogate Models for interpretable AI.- What is stratified sampling and why is it important ?
Stratified sampling is a sampling method where population is divided into homogenous subgroups called strata and the right number of instances are sampled from each stratum. For further explanation visit ...- Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?
Assume there are total states and let be the length of the largest sequence. Think how we generate text using an hMM. We first have a state sequence and ...- Moving Average Method for Time Series Modeling
https://youtu.be/ab8vAgd5_ak This short video talks about the moving average technique for time series data and how it can be used for smoothing, forecasting and feature construction with simple python code.- What is Bayesian Modeling?
This video explains Bayesian Modeling : Why do we need Bayesian Modeling? What is Bayesian Modeling? What are some examples where we can practically use Bayesian Modeling ? Check out https://www.tensorflow.org/probability for code ...- What is Temporal Leakage in ML Pipelines?
https://youtu.be/WqjWwRxTjf0 Data leakage is a common problem in ML pipelines due to which we end up with models to not generalize the way we expect them to. This short video explains ...- Building ML Models for Mixed Data
https://youtu.be/ZEdahv3Q7Gw Mixed data refers to datasets where different columns have different data types. While mixed data is very common in the real world, many of the commonly used ML models cannot ...- LIME for local explanations
https://youtu.be/s3J5x5zIU0Y This video talks about the LIME for local interpretablity. It explains the motivation for local explanations followed by how LIME works to provide local explanations along with information on where ...- You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?
This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements ...- What is speaker segmentation in speech recognition ? How do you use it ?
Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications ...- METEOR metric for machine translation
https://youtu.be/FqQbrlEh_b0 This short video describes METEOR, a metric for evaluating machine generated text. It is used to evaluate whether the candidate text generated by an ML model matches the reference text ...- Fairness in ML: How to deal with bias in ML pipelines?
https://youtu.be/bxAQcOzHj7k In this 30 minute video, we talk about Bias and Fairness in ML workflows: Why we need to handle fairness in ML models How biases typically creep into the ML pipeline How ...- Recursive Feature Elimination for Feature Selection
This video explains the technique of Recursive Feature Elimination for feature selection when we have data with lots of features. Why do we need Feature Elimination? Often we end up with large ...- Z-Score for Outlier Detection
https://youtu.be/T0IJT6dDt3c This video explains Z-Score for Anomaly detection with examples and python code. Datasets often contain anomalies or outliers whose properties are different from those of the regular data points. Such outliers ...- Missing Value Imputation with Mean Median and Mode
https://youtu.be/vxNFY6Z6Kv0 This video explains feature imputation for missing values in a dataset, based on other values in the same column. Popular techniques of univariate imputation based on mean, median and mode ...- What is overfitting and underfitting ? Give examples. How do you overcome them?
ANSWER here- Berkson’s Paradox
This video explains the Berkson’s Paradox. Berkson’s Paradox typically arises from selection bias when we create our dataset, that could lead to unintended inferences from our data. Summary of contents: Berkson’s Paradox ...- Detecting and Removing Gender Bias in Word Embeddings
What are Word Embeddings? Word embeddings are vector representation of words that can be used as input (features) to other downstream tasks and ML models. Here is an article that explains ...- KDTrees for Nearest Neighbour Search: Advantages and Disadvantages
https://youtu.be/L8jKECGYIpQ This video explains where Kd-trees are used, how they work. It talks about where one can use Kd-trees and where they fail. Also provided are quick python snippets that help ...- What are knowledge graphs? When would you need a knowledge graph over say a database to store information?
A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it ...- Is the run-time of an ML algorithm important? How do I evaluate whether the run-time is OK?
Runtime considerations are often important for many applications. Typically you should look at training time and prediction time for an ML algorithm. Some common questions to ask include: Training: Do you want ...- NDCG Evaluation Metric for Recommender Systems
https://youtu.be/J-7HbXW9JpM This video talks about the NDCG metric for recommender systems that takes into account both the degree of relevance and the ranking of items to evaluate recommender systems. What is the ...- How do you train a hMM model in practice ?
The joint probability distribution for the HMM model is given by the following equation where are the observed data points and the corresponding latent states: ...- Multivariate Imputation of Missing Values
https://youtu.be/mlk-MheAipE This video describes the process of imputing missing values using multivariate imputation techniques that use the other columns as features to predict the missing values in a particular column. In ...- What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?
Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document ...- How to find the Optimal Threshold from ROC curve?
https://youtu.be/-_9blLWSlBM This brief video talks about how the ROC curve is constructed and how one can find the optimal threshold for a classifier such as logistic regression, from the ROC curve. ...- You have come up with a Spam classifier. How do you measure accuracy ?
Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome ...- What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
L2 norm regularization : Make the weights closer to zero prevent overfitting. L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common ...- What is negative sampling when training the skip-gram model ?
Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in ...- Correlation vs Causation
https://youtu.be/yNYHr8o9IRk This video explains the difference between correlation and causation. How to measure correlation and how to infer if there is a causal relationship between two variables.- When to Not Remove Stopwords?
https://youtu.be/UySNvVK7B4k What are stopwords? Why do we typically remove them? When does it make sense to remove stop words? When does it not make sense to remove stopwords?- Stationarity in Time Series Data
https://youtu.be/smetkq85tO4 This brief video explains what stationarity is when we model time series data and why we care about it along with an overview of different kinds of stationarity with examples.- What is an autoencoder? What are applications of autoencoders?
https://youtu.be/I9TvEa0TV-A This video explains autoencoders in a crisp three minute video and explains where they are used.- Can you give an example of a classifier with high bias and high variance?
High bias means the data is being underfit. The decision boundary is not usually complex enough. High variance happens due to over fitting, the decision boundary is more complex than ...- Do we need to learn Linear Algebra for Machine Learning ?
A lot of things we do in the ML pipeline involve vectors and matrices Linear Algebra helps us understand how these vectors interact with each other, how to perform vector & ...- Gower Distance for Mixed Data
https://youtu.be/PHu8VoPv-o4 Mixed data is data that contains a combination of various types such as integer categorical ordinal nominal and so on. This video talks about the Gower Distance for Mixed ...- What are the different independence assumptions in hMM & Naive Bayes ?
Both the hMM and Naive Bayes have conditional independence assumption. hMM can be expressed by the equation below : Second equation implies a conditional ...- What are the advantages and disadvantages of using naive bayes for spam detection?
Disadvantages: Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem ...- What is the Page Rank Algorithm ?
How do search engines find what you want? When we search on the internet, we want to see the most relevant pages. Page rank algorithm is a tool to determine which ...- How many parameters are there for an hMM model?
Let us calculate the number of parameters for bi-gram hMM given as Let be the total number of states and be the vocabulary size ...- What are evaluation metrics for multi-class classification problem (like positive/negative/neutral sentiment analysis)
For multiclass classification(MCC) problems, metrics can be derived from the confusion matrix. Let $tp_i,tn_i,fp_i,fn_i$ denote the true positives, true negatives, false positives, false negatives respectively. MCC problems, usually macro and ...- Learning Feature Importance from Decision Trees and Random Forests
This video shows the process of feature selection with Decision Trees and Random Forests. Why do we need Feature Selection? Often we end up with large datasets with redundant features that need ...