Which is better to use while extracting features character n-grams or word n-grams? Why?

Both have their uses. Character n-grams are great where character level information is important : Example:  spelling correction,  language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…

Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.

Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words,  one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…

How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?

Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…

How will you build the automatic/smart reply feature on an app like gmail or linkedIn?

Generating replies on the fly: Smart Reply can be built using sequence to sequence modeling. An incoming mail acts as the input to the model and the reply will be the output of the model. Encoder-Decoder architecture is also often used for sequence to sequence task such as smart reply Picking one of the pre-existing templates:…

What can you say about the most frequent and most rare words ? Why are they important or not important ?

Most frequent words are usually stop words like [“in”,”that”,”so”,”what”,”are”,”this”,”the”,”that”,”a”,”is” …etc] Rare words could be because of spelling mistakes or due to the word being sparsely used in the data set. Usually both the most frequent and most rare words are not useful in providing contextual information. Very frequent words are called stop words. As stop-words…

What will happen if you do not convert all characters to a single case (either lower or upper) during the pre-processing step of an NLP algorithm?

When all words are not converted  to a single case, the vocabulary size will increase drastically as words like Up/up or Fast/fast or This/this will be treated differently which isn’t a correct behaviour for the NLP task. Sparsity is higher when building the language model since the cat is  treated differently from The cat. Suppose…