Featured

Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?

Assume there are total states and let be the length of the largest sequence. Think how we generate text using an hMM. We first have a state sequence and from each state we emit an output. From each state, any word out of possible outcomes can be generated. Since there are states, at each possible…

Explain Locality Sensitive Hashing for Nearest Neighbour Search ?

What is Locality Sensitive Hashing (LSH) ? Locality Sensitive hashing is a technique to enable creating a hash or putting items in buckets such similar items are in the same bucket (same hash) with high probability Dissimilar items are in different buckets – i.e dissimilar items are in the same bucket with low probability. Where…

The Machine Learning Product Lifecycle – Challenges building ML products

Unlike the popular notion that being involved with ML products involves crunching math and stats, there are a lot of steps involved in productionizing ML and creating real products. Here is a brief video that explores Machine Learning Product development lifecycle and also talks about how it is different from the traditional product development lifecycle….

What is Simpsons Paradox ?

Simpsons Paradox occures when trends in aggregates are reversed when examining trends in subgroups. Data often has biases that are might might lead to unexpected trends, but digging deeper and deciphering these biases and looking at appropriate sub-groups leads to drawing the right insights. Why does Simpson’s paradox occur ? Arithmetically, when (a1/A1) < (a2/A2)…

What is Elastic Net Regularization for Regression?

Most of us know that ML models often tend to overfit to the training data for various reasons. This  could be due to lack of enough training data or the training data not being representative of data we expect to apply the model on. But the result is that we end up building an overly…

What are Isolation Forests? How to use them for Anomaly Detection?

All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression. Can we use random forests in an unsupervised setting? (where we have no labeled data?) Isolation forests are a variation of random forests that can…

What is One-Class SVM ? How to use it for anomaly detection?

One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection. Let’s say we are analyzing credit card transactions to identify fraud. We are likely to have many normal transactions and very few fraudulent transactions. Also, the next fraud transaction might be completely different from all previous…

What does the typical day of a data scientist look like ?

Being a data scientist is much more than simply churning models with lot of math! This video breaks down and explains the tasks in the typical day of a data scientist : Communicating with stake holders Analyzing data Designing the end to end data pipeline Building models Tuning models Testing and debugging Evaluating models Measuring…

Can we use the AUC Metric for a SVM Classifier ? 

What is AUC ? AUC is the area under the ROC curve. It is a popularly used classification metric. Classifiers such as logistic regression and naive bayes predict class probabilities  as the outcome instead of the predicting the labels themselves. A new data point is classified as positive if the predicted probability of positive class…

Top 50 Machine Learning Interview Questions

Whether you are kickstarting your interview preparation, or wrapping up your preparation and are looking for final touches, here are over 50 must see questions to prepare for a data science interview. We have put them in five categories for convenience. (Note: There are sevaral more questions along with answers in the main menu “Interview…