You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?

 Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include :   TF-IDF – Term frequency inverse document frequency Leveraging manually curated stop word lists and eliminating…

Which is better to use while extracting features character n-grams or word n-grams? Why?

Both have their uses. Character n-grams are great where character level information is important : Example:  spelling correction,  language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…

If you don’t have a stop-word dictionary or are working on a new language, what approach would you take to remove stop words?

TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. The intuition here is that commonly occurring words, that occur in almost all documents are stop words. On the other hand, words that occur commonly, but only in some of the documents…

Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.

Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words,  one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…

How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?

Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…

How will you build the automatic/smart reply feature on an app like gmail or linkedIn?

Generating replies on the fly: Smart Reply can be built using sequence to sequence modeling. An incoming mail acts as the input to the model and the reply will be the output of the model. Encoder-Decoder architecture is also often used for sequence to sequence task such as smart reply Picking one of the pre-existing templates:…

How will you build an auto suggestion feature for a messaging app or google search?

Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. There are 2 caveats here – large corpus because we need to cover almost every case. This is important for recall. relevant data is useful…

Why are bigrams or any n-grams important in NLP(task like sentiment classification or spam detection)  or important enough to find them explicitly?

There are mainly 2 reasons Some pair of words always occur together more often than they occur individually. Hence it is important to treat such co-occurring words as a single entity or a single token in training. For named entity recognition problem, Tokens such as “United States”, “North America”, “Red Wine” would make sense when…

What can you say about the most frequent and most rare words ? Why are they important or not important ?

Most frequent words are usually stop words like [“in”,”that”,”so”,”what”,”are”,”this”,”the”,”that”,”a”,”is” …etc] Rare words could be because of spelling mistakes or due to the word being sparsely used in the data set. Usually both the most frequent and most rare words are not useful in providing contextual information. Very frequent words are called stop words. As stop-words…