Latent Dirichlet Allocation is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content…
Category: Natural Language Processing
Where would you not want to remove stop words ?
Stop words can be in most application removed when you are doing bag of words features. Some exceptions can involve sentiment analysis where ‘not’ cannot be removed because it is a stop word. When you are not using bag of words, any model where you have context required – say n-grams or sequence-to-sequence models, removing…
You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?
Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include : TF-IDF – Term frequency inverse document frequency Leveraging manually curated stop word lists and eliminating…
Which is better to use while extracting features character n-grams or word n-grams? Why?
Both have their uses. Character n-grams are great where character level information is important : Example: spelling correction, language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…
If you don’t have a stop-word dictionary or are working on a new language, what approach would you take to remove stop words?
TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. The intuition here is that commonly occurring words, that occur in almost all documents are stop words. On the other hand, words that occur commonly, but only in some of the documents…
Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.
Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words, one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…
What are the advantages and disadvantages of using Rule based approaches in NLP?
Cold start: Many a times when we have the cold start problem (No data to begin with) in Machine Learning, rule based approaches make sense. For example, you want to recommend products to customers, but how do you start without data. It makes sense to build rules so that the system can start delivering. This will…
How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?
Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…
How will you build the automatic/smart reply feature on an app like gmail or linkedIn?
Generating replies on the fly: Smart Reply can be built using sequence to sequence modeling. An incoming mail acts as the input to the model and the reply will be the output of the model. Encoder-Decoder architecture is also often used for sequence to sequence task such as smart reply Picking one of the pre-existing templates:…
How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. There are 2 caveats here – large corpus because we need to cover almost every case. This is important for recall. relevant data is useful…