Both have their uses. Character n-grams are great where character level information is important : Example: spelling correction, language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…
Category: Natural Language Processing
If you don’t have a stop-word dictionary or are working on a new language, what approach would you take to remove stop words?
TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. The intuition here is that commonly occurring words, that occur in almost all documents are stop words. On the other hand, words that occur commonly, but only in some of the documents…
Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.
Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words, one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…
What are the advantages and disadvantages of using Rule based approaches in NLP?
Cold start: Many a times when we have the cold start problem (No data to begin with) in Machine Learning, rule based approaches make sense. For example, you want to recommend products to customers, but how do you start without data. It makes sense to build rules so that the system can start delivering. This will…
How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?
Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…
How will you build the automatic/smart reply feature on an app like gmail or linkedIn?
Generating replies on the fly: Smart Reply can be built using sequence to sequence modeling. An incoming mail acts as the input to the model and the reply will be the output of the model. Encoder-Decoder architecture is also often used for sequence to sequence task such as smart reply Picking one of the pre-existing templates:…
How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. There are 2 caveats here – large corpus because we need to cover almost every case. This is important for recall. relevant data is useful…
Why are bigrams or any n-grams important in NLP(task like sentiment classification or spam detection) or important enough to find them explicitly?
There are mainly 2 reasons Some pair of words always occur together more often than they occur individually. Hence it is important to treat such co-occurring words as a single entity or a single token in training. For named entity recognition problem, Tokens such as “United States”, “North America”, “Red Wine” would make sense when…
What can you say about the most frequent and most rare words ? Why are they important or not important ?
Most frequent words are usually stop words like [“in”,”that”,”so”,”what”,”are”,”this”,”the”,”that”,”a”,”is” …etc] Rare words could be because of spelling mistakes or due to the word being sparsely used in the data set. Usually both the most frequent and most rare words are not useful in providing contextual information. Very frequent words are called stop words. As stop-words…
What will happen if you do not convert all characters to a single case (either lower or upper) during the pre-processing step of an NLP algorithm?
When all words are not converted to a single case, the vocabulary size will increase drastically as words like Up/up or Fast/fast or This/this will be treated differently which isn’t a correct behaviour for the NLP task. Sparsity is higher when building the language model since the cat is treated differently from The cat. Suppose…