Machine Learning Interview Questions

A compilation of ML Interview questions with answers that are popularly asked in Machine Learning Interviews. We hope our questions will help you crack your data science interview …
When to Not Remove Stopwords?


https://youtu.be/UySNvVK7B4k




What are stopwords? Why do we typically remove them? When does it make sense to remove stop words? When does it not make sense to remove stopwords?




Batch vs Mini-Batch vs Stochastic Gradient Descent


https://www.youtube.com/watch?v=1xMs6A3DLYw
Most deep learning architectures use a variation of Gradient Descent Optimization algorithm to come up with the best set of parameters for the netwrork, given the loss function and the ...
What are evaluation metrics for multi-class classification problem (like positive/negative/neutral sentiment analysis)For multiclass classification(MCC) problems, metrics  can be derived from the confusion matrix. Let $tp_i,tn_i,fp_i,fn_i$ denote the true positives, true negatives, false positives, false negatives respectively. 
MCC problems, usually macro and ...
What is Temporal Leakage in ML Pipelines?


https://youtu.be/WqjWwRxTjf0
Data leakage is a common problem in ML pipelines due to which we end up with models to not generalize the way we expect them to. This short video explains ...
When are deep learning algorithms more appropriate compared to traditional machine learning algorithms?
Deep learning algorithms are capable of learning arbitrarily complex non-linear functions by using a deep enough and a wide enough network with the appropriate non-linear activation function. 
Traditional ML algorithms ...
What is the difference between supervised and unsupervised learning ?In Supervised Learning the algorithm learns from labeled training data. In other words, each data point is tagged with the answer or the label the algorithm should come up with. ...
Building ML Models for Mixed Data


https://youtu.be/ZEdahv3Q7Gw
Mixed data refers to datasets where different columns have different data types. While mixed data is very common in the real world, many of the commonly used ML models cannot ...
BERT vs Word2Vec Embeddings


https://youtu.be/9eTeIO6nFTI




This video talks about two of the popular word embeddings BERT and Word2Vec and explains the differences and when it makes sense to use each.




Correlation vs Causation


https://youtu.be/yNYHr8o9IRk
This video explains the difference between correlation and causation. How to measure correlation and how to infer if there is a causal relationship between two variables.






What is an autoencoder? What are applications of autoencoders?


https://youtu.be/I9TvEa0TV-A
This video explains autoencoders in a crisp three minute video and explains where they are used. 






How do you deal with dataset imbalance in a problem like spam filtering ?Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a ...
GPT Model


https://www.youtube.com/watch?v=PbiJyXZMB9o
This video explains the GPT model, where it is used and small code snippette to understand how to use it in python with a toy example.

Can you give an example of a classifier with high bias and high variance?High bias means the data is being  underfit. The decision boundary is not usually complex enough. High variance happens due to over fitting, the decision boundary is more complex than ...
How is Wroking with Time Series Data different?


https://youtu.be/WuALMDv87y4
This video talks about what time series data is and how working with time series data is different from other forms of data. 




How to do Kfold Crossvalidation for Temporal Data?


https://youtu.be/bTSraiFsCi8
Temporal Leakage is a common problem with temporal data. This short video discusses ways to do crossvalidation with temporal data without temporal leakage.

Can we use the AUC Metric for a SVM Classifier ? This video explains computing the AUC metric for an SVM classifier, or other classifiers that give the absolute class values as outcomes.
What is Area Under the Curve ?
AUC is the ...
What is Batch Normalization


https://youtu.be/j5GUZWgRXBs
This video talks about batch normalization in Deep Neural Networks, why it is required, how the batch norm is computed and a small code example.

Risks When Building with LLMs and Generative AI


https://youtu.be/D6CheqTPczk
LLMs and Generative AI have permiated our lives. This video talks about some risks to keep in mind when we build with LLMs and generative AI.



What are some knowledge graphs you know. What is different between these ?
DBPedia : Entities and relationships are automatically extracted from wikipedia. 
Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. ...
Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?
Assume there are total  states and let  be the length of the largest sequence.
Think how we generate text using an hMM. We first have a state sequence and ...
What is PMI ?PMI : Pointwise Mutual Information, is a measure of correlation between two events x and y.  

      
      
As you can see from above ...
Fairness in ML: How to deal with bias in ML pipelines?


https://youtu.be/bxAQcOzHj7k




In this 30 minute video, we talk about  Bias and Fairness in ML workflows:



Why we need to handle fairness in ML models
How biases typically creep into the ML pipeline
How ...
Macro, Micro and Weighted F1 Score


https://youtu.be/N1_3KrC337s
This video explains the need for Macro, Micro and Weighted F1 score metrics for multiclass classification problems.






The BERT Score – Evaluating Text Generation


https://www.youtube.com/watch?v=4Hv_3Jd2O24




This video talks about the evaluation metric BERTScore, why it needed over existing metrics such as the BLEU score and so on and how it is computed and evaluated. Traditional ...
Knowledge Distillation


https://www.youtube.com/watch?v=B2wGxgQfKxo
This video talks about model compression and what knowledge distillation is. It talks about the distillation loss and the common frameworks employed for knowledge distillation.

Popular Distance Metrics in ML


https://youtu.be/XlMo0vuhq6w
This video talks about popular distance metrics used in Machine Learning algorithms – Euclidean distance, Manhatten distance, Minkowiski distance, Hamming distance and Cosine distance.




What is the Page Rank Algorithm ?How do search engines find what you want?
When we search on the internet, we want to see the most relevant pages. Page rank algorithm is a tool to determine which ...
Z-Score for Outlier Detection


https://youtu.be/T0IJT6dDt3c
This video explains Z-Score for Anomaly detection with examples and python code.



Datasets often contain anomalies or outliers whose properties are different from those of the regular data points. Such outliers ...
If the average length of a sentence is 100 in all documents, should we build 100-gram language model ?
A 100 gram model will be more complex and will have lot of parameters. 
One way is to start with n-gram model with different values of n from 2 to ...
What is AUC : Area Under the Curve?What is AUC ?
AUC is the area under the ROC curve. It is a popularly used classification metric.
Classifiers such as logistic regression and naive bayes predict class probabilities  as the ...
Bias in Machine Learning : Types of Data BiasesBias in Machine Learning models could often lead to unexpected outcomes. In this brief video we will look at different ways we might end up building biased ML models, with ...
What is Rejection Sampling?


https://youtu.be/yQBS0HCWN_8
This short video explains why we need sampling, particularly rejection sampling. It also explains how rejection sampling works and some places where it is used. 




You are building a natural language search box for a website. How do you accommodate spelling errors?If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed ...
BERT Model


https://youtu.be/ZPmQzexoi-Q
This video explains the BERT model, its architecture, how it is trained and used. It also talks about when we would want to use the BERT model in comparison with ...
Moving Average Method for Time Series Modeling


https://youtu.be/ab8vAgd5_ak
This short video talks about the moving average technique for time series data and how it can be used for smoothing, forecasting and feature construction with simple python code.






LIME for local explanations


https://youtu.be/s3J5x5zIU0Y




This video talks about the LIME for local interpretablity. It explains the motivation for local explanations followed by how LIME works to provide local explanations along with information on where ...
Gaussian Processes for Bayesian Hyperparameter Tuning


https://youtu.be/kAQMOujS5YY
Grid search is often very expensive for hyper-parameter tuning and a bottleneck in ML pipelines. This brief video explains the Bayesian hyper-parameter optimization technique with Gaussian processes in a simple ...
What is negative sampling when training the skip-gram model ?Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in ...
Recursive Feature Elimination for Feature SelectionThis video explains the technique of Recursive Feature Elimination for feature selection when we have data with lots of features.

Why do we need Feature Elimination?
Often we end up with large ...
What is Bayesian Logistic Regression?Bayesian Logistic Regression
In this video, we try to understand the motivation behind Bayesian Logistic regression and how it can be implemented. 
Recap of Logistic Regression
Logistic Regression is one of the most ...
Explain Locality Sensitive Hashing for Nearest Neighbour Search ?What is Locality Sensitive Hashing (LSH) ?
Locality Sensitive hashing is a technique to enable creating a hash or putting items in buckets such

similar items are in the same bucket (same ...
What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : 

TF-IDF : Term frequency, inverse document ...
METEOR metric for machine translation


https://youtu.be/FqQbrlEh_b0




This short video describes METEOR, a metric for evaluating machine generated text. 


It is used to evaluate whether the candidate text generated by an ML model matches the reference text ...
Where do we use Divide and Conquer in Machine Learning?


https://youtu.be/Scn6y9xLr0I
This video talks about examples where we use the Divide and Conquer technique in Machine Learning. 


Algorithms and data structures are used in various instances while building efficient Machine Learning ...
Scaled Dot Product Attention


https://www.youtube.com/watch?v=RZN5Pwb4Ywg
This video explains the motivation behind scaled dot product attention used in the transformer architecture and how it is computed.

The Transformer Architecture


https://youtu.be/lelhu5B0jls
This short video talks about the various components of the transformer architecture like the positional encoding, multi head attention, layer norm, skip connections, feedforward network, loss function and so on.

Avoiding Feedback Loops in Recommender Systems


https://youtu.be/j0lzd-82ENA
This video talks about avoiding feedback loops in recommender systems.


Recommender systems often suffer from exposure bias, where we have customer feedback from only those items we actually recommend to the ...
How do you train a hMM model in practice ?The joint probability distribution for the HMM model is given by the following equation where  are the observed data points and  the corresponding latent states:
      ...
Feedback Loops: What causes Bias Amplification in Recommender Systems?


https://youtu.be/CnDUINYBeXk
This short video talks about feedback loops that often show up in recommender systems – causing frequently recommended items to pop up even more frequently. It is important to be ...
MAP at K : An evaluation metric for Ranking


https://youtu.be/QSaK4l9C66c






This video talks about the Mean Average Precision at K (popularly called the MAP@K) metric that is commonly used for evaluating recommender systems and other ranking related problems. 


Why do ...
What is One-Class SVM ? How to use it for anomaly detection?https://youtu.be/vmE9ScCb2KY
One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection.
Let’s say we are analyzing credit card transactions to identify fraud. We ...
How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. ...
What is overfitting and underfitting ? Give examples. How do you overcome them?ANSWER here
How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?


Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here ...
Local Outlier Factor for Anomaly Detection


https://youtu.be/Xl7XVPyvO5U




Anomaly detection is an important application used across various verticals like healthcare, finance, manufacturing and so on. Local Outlier Factor is a popular density based technique for anomaly detection that ...
Target Encoding for Categorical FeaturesThis video describes target encoding for categorical features, that is more effecient and more effective in several usecases than the popular one-hot encoding.

Recap: Categorical Features and One-hot encoding
Categorical features are ...
What is Median Absolute Deviation


https://youtu.be/8GsnT2X4u_U
This short video talks about what the median absolute deviation. We commonly use mean absolute deviation and the standard deviation. This video talks about some of the shortcomings of using ...
I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples  (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ? 

Increase the number of training iterations
Make a more complex network – increase hidden layer size
Initialize weights to a random small value instead of zeros
Change tanh activations to relu

 
 
Ans : (3) ...
How to build a Global Surrogate Model for Interpretable AI?


https://youtu.be/uOL-Zb9_DO4




This brief 5 minute video explains building Global Surrogate Models for interpretable AI. 



Global vs Local Interpretability


https://youtu.be/EHAwlKyFwOg
Global vs Local Interpretability: Interpretable AI


This video explains why we need interpretability for our AI and the two approaches typically used for interpretability namely Global and Local Interpretability. While global ...
Why Learn Data Structures to be a Data Scientist?


https://youtu.be/c4R_o3BzQPk
Why learn data structures to be a data scientist? 


The video covers examples of where various data structures are used in an ML context to highlight the importance of understanding ...
Why are Transformers So Successful?


https://www.youtube.com/watch?v=hLvxa3JI4Js
This video talks about why the transformer models are successful compared to its predecessors. It talks about various aspects of the transformer model such as self attention, positional encoding and ...
Machine Translation


https://youtu.be/T0EKtufdIw0




Here is a high level overview of Machine Translation. This short video covers a brief history of machine translation followed by a quick explanation of SMT (statistical machine translation) and ...
How does KNN algorithm work ? What are the advantages and disadvantages of KNN ?The KNN algorithm is commonly used in many ML applications – right from supervised settings such as classification and regression, to just retrieving similar items in applications such as recommendation ...
What would you care more about – precision or recall for spam filtering problem?
False positive means it was not a spam and we called it spam, false negative means it was a spam and we didn’t label it spam
Precision = (TP / TP ...
You have come up with a Spam classifier. How do you measure accuracy ?Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : 
True positives : Those data points where the outcome ...
How to tune hyperparameters with Randomized Grid Search?


https://youtu.be/J_tuSp5PzXc




Randomized Grid Search is a variation of Grid Search that samples each parameter from a distribution. Conventional grid search evaluates the model at fixed combinations of parameter values and could ...
What are Isolation Forests? How to use them for Anomaly Detection?All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression.
Can ...
Dartboard Paradox: Probability Density Function vs ProbabilityWhat is the Dartboard Paradox ?
Assume your are throwing a dart at dartboard such that it hits somewhere on the dartboard.  The dartboard paradox:  The probability of hitting any specific ...
Naive Bayes Classifier : Advantages and Disadvantages


https://youtu.be/YuNfG6dFuZo
How does the Naive Bayes Classifier Work? What are the advantages and Disadvantages of using the Naive Bayes classifier?





Recap: Naive Bayes Classifier


Naive Bayes Classifier is a popular model for classification ...
What is speaker segmentation in speech recognition ? How do you use it ?Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications ...
Evaluation Metrics for Recommendation SystemsThis video explores how one can evaluate recommender systems.
Evaluating a recommender system involves (1) If the right results are being recommended (2) Whether more relevant results are being recommended at ...
Positional Encoding in the Transformer Model


https://youtu.be/5wpzAk4THcI
Transformer models are super popular. With the quadratic attention layer, how does sequence nature of data get captured? Through Positional Encoding. This video briefly explains the concept of positional encoding ...
Berkson’s ParadoxThis video explains the Berkson’s Paradox. Berkson’s Paradox typically arises from selection bias when we create our dataset, that could lead to unintended inferences from our data.
Summary of contents:

Berkson’s Paradox ...
What is the difference between deep learning and machine learning?Deep learning is a subset of Machine Learning. Machine learning is the ability to build “models” that can learn automatically from data, without programming explicit rules. Machine Learning models typically ...
Learning Feature Importance from Decision Trees and Random ForestsThis video shows the process of feature selection with Decision Trees and Random Forests.

Why do we need Feature Selection?
Often we end up with large datasets with redundant features that need ...
What is the difference between word2Vec and Glove ?Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring ...
What are Ball Trees?


https://youtu.be/ZOZqJqGgP1M
This video describes what Ball Trees are, how they work and python snippets to get them working in your code..






You want to find food related topics in twitter – how do you go about it ?One can use any of the topic models above to get topics. However, to direct the topics to contain food related information, specialized topic modeling algorithms are available. 
However, one ...
What are the different independence assumptions in hMM & Naive Bayes ?Both the hMM and Naive Bayes have conditional independence assumption.
hMM can be expressed by the equation below :
      
      
Second equation implies a conditional ...
Stratified Sampling for Imbalanced Datasets


https://youtu.be/hAr985UmQ0c
This 5 minute video describes the need for stratified sampling and how to incorporate it in our ML pipelines with scikit-learn.






What is Simpsons Paradox ?Simpsons Paradox occures when trends in aggregates are reversed when examining trends in subgroups. Data often has biases that are might might lead to unexpected trends, but digging deeper and ...
How do you design a system that reads a natural language question and retrieves the closest FAQ answer?There are multiple approaches for FAQ based question answering

Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy ...
Successive Halving For Grid Search


https://youtu.be/DQ-T9aRBM_Q
This brief video explains Grid Search with successive Halving. Grid Search is often very slow and the primary bottleneck in many production pipelines. Successive Halving is a strategy to find ...
Multivariate Imputation of Missing Values


https://youtu.be/mlk-MheAipE
This video describes the process of imputing missing values using multivariate imputation techniques that use the other columns as features to predict the missing values in a particular column. In ...
What is stratified sampling and why is it important ?
Stratified sampling is a sampling method where population is divided into homogenous subgroups called strata and the right number of instances are sampled from each stratum. For further explanation visit ...
What is Elastic Net Regularization for Regression?Most of us know that ML models often tend to overfit to the training data for various reasons. This  could be due to lack of enough training data or the ...
How to find the Optimal Number of Clusters in K-means? Elbow and Silhouette MethodsK-means Clustering Recap
Clustering is the process of finding cohesive groups of items in the data. K means clusterin is the most popular clustering algorithm. It is simple to implement and ...
What are some common tools available for NER ? Named Entity Recognition ?Notable NER platforms include:

GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API.
OpenNLP includes rule-based and statistical named-entity recognition.
SpaCy ...
What is Bayesian Modeling?This video explains Bayesian Modeling : Why do we need Bayesian Modeling? What is Bayesian Modeling? What are some examples where we can practically use Bayesian Modeling ?
Check out https://www.tensorflow.org/probability  for code ...
How do you handle missing data in an ML algorithm ?Missing data is caused either due to issues in data collection or sometimes, the data model could allow for missing data (for instance, the field ‘maximum credit limit on any ...
How to find the Optimal Threshold from ROC curve?


https://youtu.be/-_9blLWSlBM




This brief video talks about how the ROC curve is constructed and how one can find the optimal threshold for a classifier such as logistic regression, from the ROC curve. ...
Gower Distance for Mixed Data


https://youtu.be/PHu8VoPv-o4
Mixed data is data that contains a combination of various types such as integer categorical ordinal nominal and so on.  This video talks about the Gower Distance for Mixed ...
How do you generate text using a Hidden Markov Model (HMM) ?The HMM is a latent variable model where the observed sequence of variables  are assumed to be generated from a set of temporally connected latent  variables .
The joint distribution ...
Differencing Time Series Data to Remove Trend


https://youtu.be/IqM8szMyfeg
This brief video explains the differencing operation on time series data. It talks about why differencing is required and how differencing actually removes non-stationarity such as trend when we work ...
How to measure the performance of the language model ?
While building language model, we try to estimate the probability of the sentence or a document. 
Given sequences(sentences or documents) like
      

Language model(bigram language model) will be ...
Missing Value Imputation with Mean Median and Mode


https://youtu.be/vxNFY6Z6Kv0
This video explains feature imputation for missing values in a dataset, based on other values in the same column. Popular techniques of univariate imputation based on mean, median and mode ...
What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
L2 norm regularization : Make the weights closer to zero prevent overfitting. 
L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common ...
Covariance and CorrelationOften in data science, we want to understand how one variable is related to another. These variables could be features for an ML model, or sometimes we might want to ...
I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?
Increase number of units in the hidden layer
Increase number of hidden layers
 Increase data set size

Change activation function to tanh

Try all of the above

The answer is d. When I use a ...
Is the run-time of an ML algorithm important? How do I evaluate whether the run-time is OK?Runtime considerations are often important for many applications.  Typically you should look at training time and prediction time for an ML algorithm.
Some common questions to ask include:

Training: Do you want ...
What are the advantages and disadvantages of using naive bayes for spam detection?Disadvantages: 
Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem ...
Why are Micro Precision and Micro Recall Same for Multiclass Models?


https://youtu.be/r2-682JBvIs
Precision and recall are popular metrics for classification. For multiclass settings, we often compute the micro and macro precision and recall. However, the micro precision, micro recall and the overall ...
Bias in Machine Learning : How to measure Fairness based on Confusion Matrix ?Machine Learning models often give us unexpected and biased outcomes if the underlying data is biased. Very often, our process of collecting data is incomplete or flawed leading to data ...
What is the complexity of Viterbi algorithm ?
Viterbi algorithm is a dynamic programming approach to find the most probable sequence of hidden states given the observed data, as modeled  by a HMM. 
Without dynamic programming, it becomes an ...
What is the Maximum Likelihood Estimate (MLE)?Probabilistic Models help us capture the inherant uncertainity in real life situations. Examples of probabilistic models are Logistic Regression, Naive Bayes Classifier and so on..  Typically we fit (find parameters) ...
You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements ...
Do we need to learn Linear Algebra for Machine Learning ?A lot of things we do in the ML pipeline involve vectors and matrices
Linear Algebra helps us understand how these vectors interact with each other, how to perform vector & ...
NDCG Evaluation Metric for Recommender Systems


https://youtu.be/J-7HbXW9JpM
This video talks about the NDCG metric for recommender systems that takes into account both the degree of relevance and the ranking of items to evaluate recommender systems.


What is the ...
What is Layer Normalization


https://www.youtube.com/watch?v=b19rLQUijxI
Normalization of features is very common in ML pipelines. In Deep learning models, normalization of the intermediate activations helps combat ‘internal covariate shift’ that might hinder the learning process.
This brief ...
Detecting and Removing Gender Bias in Word Embeddings 
 What are Word Embeddings?
Word embeddings are vector representation of words that can be used as input (features) to other downstream tasks and ML models. Here is an article that  explains ...
Normalization in Deep Neural Networks


https://youtu.be/lLCSNRzx4F8




Batch norm and Layer norm are common normalization techniques. This brief video talks about the need for normalization and the types of norms in deep neural networks.

Bayesian Neural NetworksBayesian Neural networks enable capturing uncertainity in the parameters of a neural network.
This video contains:

A brief Recap of Feedforward Neural Networks
Motivation behind a Bayesian Neural Network
What is a Bayesian Neural ...
How do you measure quality of Machine translation ?BLEU (Bilingual evaluation understudy) score is the most common metric used during machine translation. Typically, it is used to measure a candidate translation against a set of reference translations available ...
Stationarity  in Time Series Data


https://youtu.be/smetkq85tO4
This brief video explains what stationarity is when we model time series data and why we care about it along with an overview of different kinds of stationarity with examples.




With the maximum likelihood estimate are we guaranteed to find a global Optima ?Maximum likelihood estimate finds that value of parameters that maximize the likelihood. If the likelihood is strictly concave(or negative of likelihood is strictly convex), we are guaranteed to find a ...
What are the different ways of representing documents ?
Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing ...
Skip or Residual Connections in Deep Networks


https://www.youtube.com/watch?v=HW7Kv8HGdvM
The transformer model uses skip connections to promote accelerated learning through a deep architecture. This video explains Skip or Residual connections to enable building deep neural networks bypassing challenges such ...
Multi Head Attention : Transformer Architecture


https://www.youtube.com/watch?v=m0MfDpi61wc
The Transformer Series: This video explains the multi-head attention of the transformer architecture. In the last few posts we looked at what Skip or Residul connection are and the scaled dot ...
What is Autocorrelation?


https://youtu.be/xwabmu5LnU0
Autocorrelation is a useful concept to analyze time-series data. This video explains what autocorrelation is and why we care about it, followed by how we can write simple python code ...
Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?Problems

As the vocabulary size (V) is large, these vectors will be large in size.
They will be sparse as a word may not have co-occurred with all possible words.

Resolution

Dimensionality Reduction using ...
What are knowledge graphs? When would you need a knowledge graph over say a database to store information?A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it ...
How many parameters are there for an hMM model?Let us calculate the number of parameters for bi-gram hMM given as
      
Let  be the total number of states  and  be the vocabulary size ...
Inverse Propensity Weighing (IPW)


https://youtu.be/1okhwPz7VLM
This video explains the technique of Inverse Propensity Weighing (IPW) that is commonly used to address sampling bias in datasets by giving more weightage to underrepresented groups.




KDTrees for Nearest Neighbour Search: Advantages and Disadvantages


https://youtu.be/L8jKECGYIpQ
This video explains where Kd-trees are used, how they work. It talks about where one can use Kd-trees and where they fail. Also provided are quick python snippets that help ...
Why do you typically see overflow and underflow when implementing an ML algorithms ?
A common pre-processing step is to normalize/rescale inputs so that they are not too high or low.

However, even on normalized inputs, overflows and underflows can occur:

Underflow: Joint probability distribution often ...
BLUE Score


https://youtu.be/UV2ymKoMcyw




This brief video describes the BLEU score, a popular evaluation metric used for sevaral tasks such as machine translation, text summarization and so on.


What is BLEU Score? 


BLEU stands for ...
How do you use Complement Naive Bayes for Imbalanced Datasets?


https://youtu.be/Rhs3RIECfe4




This brief video explains the Complement Naive Bayes classifier, a modification of the naive bayes classifier that works well for imbalanced datasets.

What are the commonly used activation functions ? When are they used.Ans. The commonly used loss functions are 

Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear ...
What is Stacking ? Ensembling Multiple Dissimilar ModelsMany of us have heard of bagging and boosting, commonly used ensemble learning techniques. This video describes ways to combine multiple dissimilar ML models through voting, averaging and stacking to ...
Anomaly Detection Techniques


https://youtu.be/6q3Lqy56G_w
Techniques for Anomaly Detection


Anomaly detection is an important task with many applications – right from finding outliers in the data to avoid building bad models to applications such as fraud ...