Decoding the Data Scientist Hiring Gap

The need for AI/ML is growing and more and more jobs are being created as data awareness is increasing and more data is being collected. However, hiring data scientists has not been an easy task – most of these roles are not yet filled. On the other hand, data science is a very popular discipline….

Detecting and Removing Gender Bias in Word Embeddings

   What are Word Embeddings? Word embeddings are vector representation of words that can be used as input (features) to other downstream tasks and ML models. Here is an article that  explains popular word  embeddings in more detail.  They are used in many NLP applications such as sentiment analysis, document clustering, question answering, paraphrase detection…

Explain Locality Sensitive Hashing for Nearest Neighbour Search ?

What is Locality Sensitive Hashing (LSH) ? Locality Sensitive hashing is a technique to enable creating a hash or putting items in buckets such similar items are in the same bucket (same hash) with high probability Dissimilar items are in different buckets – i.e dissimilar items are in the same bucket with low probability. Where…

The Machine Learning Product Lifecycle – Challenges building ML products

Unlike the popular notion that being involved with ML products involves crunching math and stats, there are a lot of steps involved in productionizing ML and creating real products. Here is a brief video that explores Machine Learning Product development lifecycle and also talks about how it is different from the traditional product development lifecycle….

What is Simpsons Paradox ?

Simpsons Paradox occures when trends in aggregates are reversed when examining trends in subgroups. Data often has biases that are might might lead to unexpected trends, but digging deeper and deciphering these biases and looking at appropriate sub-groups leads to drawing the right insights. Why does Simpson’s paradox occur ? Arithmetically, when (a1/A1) < (a2/A2)…

What is Elastic Net Regularization for Regression?

Most of us know that ML models often tend to overfit to the training data for various reasons. This  could be due to lack of enough training data or the training data not being representative of data we expect to apply the model on. But the result is that we end up building an overly…

What are Isolation Forests? How to use them for Anomaly Detection?

All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression. Can we use random forests in an unsupervised setting? (where we have no labeled data?) Isolation forests are a variation of random forests that can…