What is Simpsons Paradox ?

Simpsons Paradox occures when trends in aggregates are reversed when examining trends in subgroups. Data often has biases that are might might lead to unexpected trends, but digging deeper and deciphering these biases and looking at appropriate sub-groups leads to drawing the right insights. Why does Simpson’s paradox occur ? Arithmetically, when (a1/A1) < (a2/A2)…

What is Elastic Net Regularization for Regression?

Most of us know that ML models often tend to overfit to the training data for various reasons. This  could be due to lack of enough training data or the training data not being representative of data we expect to apply the model on. But the result is that we end up building an overly…

What are Isolation Forests? How to use them for Anomaly Detection?

All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression. Can we use random forests in an unsupervised setting? (where we have no labeled data?) Isolation forests are a variation of random forests that can…

What is One-Class SVM ? How to use it for anomaly detection?

One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection. Let’s say we are analyzing credit card transactions to identify fraud. We are likely to have many normal transactions and very few fraudulent transactions. Also, the next fraud transaction might be completely different from all previous…

What does the typical day of a data scientist look like ?

Being a data scientist is much more than simply churning models with lot of math! This video breaks down and explains the tasks in the typical day of a data scientist : Communicating with stake holders Analyzing data Designing the end to end data pipeline Building models Tuning models Testing and debugging Evaluating models Measuring…

Can we use the AUC Metric for a SVM Classifier ? 

This video explains computing the AUC metric for an SVM classifier, or other classifiers that give the absolute class values as outcomes. What is Area Under the Curve ? AUC is the area under the ROC curve. It is a popularly used classification metric. If you want to recap how AUC works, here is a…

Finding the Right Data Science Job with Online Networking

When I was graduating from University of Utah, there were not a lot of companies that used to turn up for campus placements since we had a good but a very small department with less than 20 students in MS + PhD around then. While I had a few companies that interviewed me, I felt…

What is the difference between a BarChart and a Histogram ?

A Histogram represents the distribution of a numerical variable.  A bar-chart is typically used to compare numeric values corresponding to categorical variables. To construct a histogram:  X-axis: Usually the range of values is binned. In other words, the entire range is divided into a series of intervals and each interval occupies a slot on the…

Learn Data Science and Machine Learning from Scratch

The task of transitioning to a new field is challenging ! not for the faint hearted… It is not very different from climbing a mountain ! To become a data scientist you need to learn Some math (Stats, linear algebra, optimization) Programming (preferably Python / R) The art of working with and analyzing data But…