You are building a natural language search box for a website. How do you accommodate spelling errors?

If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed color shoes” – these are perfect dictionary words, but a sensible model would come up with “red color shoes”. Using the language model to come…

What are knowledge graphs? When would you need a knowledge graph over say a database to store information?

A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it by extracting entities and relationships automatically. Examples of knowledge graphs are Wordnet, DBpedia,..   A database also is a knowledge graph in some sense. Since…

What is shallow parsing

Typically we have a generative grammar that tells us how a sentence is generated from a set of rules.  Parsing is the process of finding a parse tree that is consistent with the grammar rules – in other words, we want to find the set of grammar rules and their sequence that generated the sentence….

What are the advantages and disadvantages of using naive bayes for spam detection?

Disadvantages: Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem by saying features are independant and gives sub par performance. Advantages: However, naive bayes is very efficient. It is a model you can train in…

Explain latent dirichlet allocation – where is it typically used ?

Latent Dirichlet Allocation is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content…

Where would you not want to remove stop words ?

Stop words can be in most application removed  when you are doing bag of words features. Some exceptions can involve sentiment analysis where ‘not’ cannot be removed because it is a stop word. When you are not using bag of words, any model where you have context required – say n-grams or sequence-to-sequence models, removing…

You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?

 Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include :   TF-IDF – Term frequency inverse document frequency Leveraging manually curated stop word lists and eliminating…

Which is better to use while extracting features character n-grams or word n-grams? Why?

Both have their uses. Character n-grams are great where character level information is important : Example:  spelling correction,  language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…

If you don’t have a stop-word dictionary or are working on a new language, what approach would you take to remove stop words?

TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. The intuition here is that commonly occurring words, that occur in almost all documents are stop words. On the other hand, words that occur commonly, but only in some of the documents…

Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.

Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words,  one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…