Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained by leveraging word co-occurrence, through an encoder –…
Category: Machine Learning
You are building a natural language search box for a website. How do you accommodate spelling errors?
If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed color shoes” – these are perfect dictionary words, but a sensible model would come up with “red color shoes”. Using the language model to come…
What are knowledge graphs? When would you need a knowledge graph over say a database to store information?
A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it by extracting entities and relationships automatically. Examples of knowledge graphs are Wordnet, DBpedia,.. A database also is a knowledge graph in some sense. Since…
What are the advantages and disadvantages of using naive bayes for spam detection?
Disadvantages: Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem by saying features are independant and gives sub par performance. Advantages: However, naive bayes is very efficient. It is a model you can train in…
How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?
Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…
How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. There are 2 caveats here – large corpus because we need to cover almost every case. This is important for recall. relevant data is useful…