What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?

Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained  by leveraging word co-occurrence, through an encoder –…

You are building a natural language search box for a website. How do you accommodate spelling errors?

If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed color shoes” – these are perfect dictionary words, but a sensible model would come up with “red color shoes”. Using the language model to come…

What are knowledge graphs? When would you need a knowledge graph over say a database to store information?

A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it by extracting entities and relationships automatically. Examples of knowledge graphs are Wordnet, DBpedia,..   A database also is a knowledge graph in some sense. Since…

What are the advantages and disadvantages of using naive bayes for spam detection?

Disadvantages: Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem by saying features are independant and gives sub par performance. Advantages: However, naive bayes is very efficient. It is a model you can train in…

How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?

Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…

How will you build an auto suggestion feature for a messaging app or google search?

Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. There are 2 caveats here – large corpus because we need to cover almost every case. This is important for recall. relevant data is useful…