NLP Interview Questions

A compilation of NLP Interview questions with answers that are popularly asked in Natural Language Processing Interviews. We hope our questions will help you crack your data science interviews …
The BERT Score – Evaluating Text Generation


https://www.youtube.com/watch?v=4Hv_3Jd2O24




This video talks about the evaluation metric BERTScore, why it needed over existing metrics such as the BLEU score and so on and how it is computed and evaluated. Traditional ...
What is speaker segmentation in speech recognition ? How do you use it ?Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications ...
What is the difference between stemming and lemmatisation?
Stemming is about replacing each word with its origin stem word in order to remove the suffixes like “es”, “ies”, “s”. For ex., “cats” => “cat”, “computers” => “computer” etc. ...
What are the different independence assumptions in hMM & Naive Bayes ?Both the hMM and Naive Bayes have conditional independence assumption.
hMM can be expressed by the equation below :
      
      
Second equation implies a conditional ...
What will happen if you do not convert all characters to a single case (either lower or upper) during the pre-processing step of an NLP algorithm?When all words are not converted  to a single case, the vocabulary size will increase drastically as words like Up/up or Fast/fast or This/this will be treated differently which isn’t ...
What is the difference between word2Vec and Glove ?Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring ...
Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.
Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word.
Hearst ...
Given a bigram language model, in what scenarios do we encounter zero probabilities? How Should we handle these situations ?
Recall the Bi-gram model can be expressed as :

      



Scenario 1 – Out of vocabulary(OOV) words – such words may not be present during training and hence ...
What is PMI ?PMI : Pointwise Mutual Information, is a measure of correlation between two events x and y.  

      
      
As you can see from above ...
Where would you not want to remove  stop words ?Stop words can be in most application removed  when you are doing bag of words features. Some exceptions can involve sentiment analysis where ‘not’ cannot be removed because it is ...
Explain latent dirichlet allocation – where is it typically used ?Latent Dirichlet Allocation is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials ...
Why is smoothing applied in language model ?
Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is 
      
 and ...
Given the following two sentences, how do you determine if Teddy is a person or not?  “Teddy bears are on sale!” and “Teddy Roosevelt was a great President!”
This is an example of Named Entity Recognition(NER) problem. One can build a sequence model such as an LSTM to perform this task. However, as shown in both the sentences above, ...
How do you deal with out of vocabulary words during run time when you build a language model ?Out of vocabulary words are words that are not in the training set, but appear in the test set, real data.  The main problem is that the model assigns a ...
How is long term dependency maintained while building a language model?


Language models can be built using the following popular methods –

Using n-gram language model

n-gram language models make assumption for the value of n. Larger the value of n, longer the ...
Skip or Residual Connections in Deep Networks


https://www.youtube.com/watch?v=HW7Kv8HGdvM
The transformer model uses skip connections to promote accelerated learning through a deep architecture. This video explains Skip or Residual connections to enable building deep neural networks bypassing challenges such ...
How do you generate text using a Hidden Markov Model (HMM) ?The HMM is a latent variable model where the observed sequence of variables  are assumed to be generated from a set of temporally connected latent  variables .
The joint distribution ...
What are knowledge graphs? When would you need a knowledge graph over say a database to store information?A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it ...
How to measure the performance of the language model ?
While building language model, we try to estimate the probability of the sentence or a document. 
Given sequences(sentences or documents) like
      

Language model(bigram language model) will be ...
How do you design a system that reads a natural language question and retrieves the closest FAQ answer?There are multiple approaches for FAQ based question answering

Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy ...
Which is better to use while extracting features character n-grams or word n-grams? Why?Both have their uses. Character n-grams are great where character level information is important : Example:  spelling correction,  language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are ...
Positional Encoding in the Transformer Model


https://youtu.be/5wpzAk4THcI
Transformer models are super popular. With the quadratic attention layer, how does sequence nature of data get captured? Through Positional Encoding. This video briefly explains the concept of positional encoding ...
What order of Markov assumption does n-grams model make ?An n-grams model makes order n-1 Markov assumption. This assumption implies: given the previous n-1 words, probability of  word is independent of words prior to  words.
Suppose we have k words ...
How do you train a hMM model in practice ?The joint probability distribution for the HMM model is given by the following equation where  are the observed data points and  the corresponding latent states:
      ...
What are the advantages and disadvantages of using naive bayes for spam detection?Disadvantages: 
Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem ...
What is shallow parsingTypically we have a generative grammar that tells us how a sentence is generated from a set of rules.  Parsing is the process of finding a parse tree that is ...
How will you build an auto suggestion feature for a messaging app or google search?
Auto Suggestion feature involves recommending the next word in a sentence or a phrase. For this, we need to build a language model on large enough corpus of “relevant” data. ...
Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?
Assume there are total  states and let  be the length of the largest sequence.
Think how we generate text using an hMM. We first have a state sequence and ...
Say you’ve generated a language model using Bag of Words (BoW) with 1-hot encoding , and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose I see sentence “Have a great day” p(great)=0.0 using this language model. How can you solve this problem leveraging the fact that good and great are similar words?
BoW with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning ...
Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?Problems

As the vocabulary size (V) is large, these vectors will be large in size.
They will be sparse as a word may not have co-occurred with all possible words.

Resolution

Dimensionality Reduction using ...
What are some common tools available for NER ? Named Entity Recognition ?Notable NER platforms include:

GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API.
OpenNLP includes rule-based and statistical named-entity recognition.
SpaCy ...
What can you say about the most frequent and most rare words ? Why are they important or not important ?


Most frequent words are usually stop words like  
Rare words could be because of spelling mistakes or due to the word being sparsely used in the data set.
Usually ...
What are the different ways of representing documents ?
Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing ...
How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?


Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here ...
What would you care more about – precision or recall for spam filtering problem?
False positive means it was not a spam and we called it spam, false negative means it was a spam and we didn’t label it spam
Precision = (TP / TP ...
You are building a natural language search box for a website. How do you accommodate spelling errors?If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed ...
What is the state of the art technique for Machine Translation ?
Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define  context. Linguistic Rules ...
How do you find the most probable sequence of POS tags from a sequence of text?
This problem can be solved with a HMM. 
Using a HMM involves finding the transition probabilities (what is the probability of going from one POS tag to another and emission/output ...
What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : 

TF-IDF : Term frequency, inverse document ...
What is a language model ? How do you create one ? Why do you need one ?A language model is a probability distribution over sequences of words P(w_1,… ,w_m). It enables us to measure the relative likelihood of different phrases. Measuring the likelihood of a sequence ...
Knowledge Distillation


https://www.youtube.com/watch?v=B2wGxgQfKxo
This video talks about model compression and what knowledge distillation is. It talks about the distillation loss and the common frameworks employed for knowledge distillation.

What is the difference between paraphrasing and textual entailment ?Textual entailment is the process of determining if a source T implies the hypothesis text H. Example :It is a unidirectional relationship : text: If you help the needy, God ...
You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?  Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information ...
What are the advantages and disadvantages of using Rule based approaches in NLP?


Cold start: Many a times when we have the cold start problem (No data to begin with) in Machine Learning, rule based approaches make sense. 

For example, you want to recommend ...
You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements ...
How do you deal with dataset imbalance in a problem like spam filtering ?Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a ...
How will you build the automatic/smart reply feature on an app like gmail or linkedIn?
Generating replies on the fly: Smart Reply can be built using sequence to sequence modeling. An incoming mail acts as the input to the model and the reply will be ...
Why are bigrams or any n-grams important in NLP(task like sentiment classification or spam detection)  or important enough to find them explicitly?
There are mainly 2 reasons

Some pair of words always occur together more often than they occur individually. Hence it is important to treat such co-occurring words as a single entity ...
What is the significance of n-grams in a language model ?n-grams is a term used for a sequence of n consecutive words/tokens/grams.
In general, n-grams can either preserve the ordering or indicate what level of dependency is required in order to ...
What are some knowledge graphs you know. What is different between these ?
DBPedia : Entities and relationships are automatically extracted from wikipedia. 
Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. ...
BLUE Score


https://youtu.be/UV2ymKoMcyw




This brief video describes the BLEU score, a popular evaluation metric used for sevaral tasks such as machine translation, text summarization and so on.


What is BLEU Score? 


BLEU stands for ...
If you don’t have a stop-word dictionary or are working on a new language, what approach would you take to remove stop words?TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. 
The intuition here is that commonly occurring ...
How many parameters are there for an hMM model?Let us calculate the number of parameters for bi-gram hMM given as
      
Let  be the total number of states  and  be the vocabulary size ...
BERT Model


https://youtu.be/ZPmQzexoi-Q
This video explains the BERT model, its architecture, how it is trained and used. It also talks about when we would want to use the BERT model in comparison with ...
What are common tools for speech recognition ? What are the advantages and disadvantages of each?There are several ready tools for speech recognition, that one can use to train custom models given the appropriate dataset. 
CMU Sphinx : Used more in an academic setting, one ...
You have come up with a Spam classifier. How do you measure accuracy ?Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : 
True positives : Those data points where the outcome ...
What is negative sampling when training the skip-gram model ?Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in ...
What is the difference between translation and transliterationTransliteration is the process of converting a word written in one language into another language, phoneme by phoneme. Enabling transliteration for your search engine allows your site visitors to type ...
Scaled Dot Product Attention


https://www.youtube.com/watch?v=RZN5Pwb4Ywg
This video explains the motivation behind scaled dot product attention used in the transformer architecture and how it is computed.

You want to find food related topics in twitter – how do you go about it ?One can use any of the topic models above to get topics. However, to direct the topics to contain food related information, specialized topic modeling algorithms are available. 
However, one ...