smoothing Archives - Ace the Data Science Interview!

How do you deal with out of vocabulary words during run time when you build a language model ?

Posted on February 26, 2019July 31, 2019 by MLInterview

Out of vocabulary words are words that are not in the training set, but appear in the test set, real data. The main problem is that the model assigns a probability zero to out of vocabulary words resulting in a zero likelihood. This is a common problem, specially when you have not trained on a…

Given a bigram language model, in what scenarios do we encounter zero probabilities? How Should we handle these situations ?

Posted on February 16, 2019March 8, 2019 by MLNerds

Recall the Bi-gram model can be expressed as : Scenario 1 – Out of vocabulary(OOV) words – such words may not be present during training and hence any probability term involving OOV words will be 0.0 leading entire term to be zero. This is solved by replacing OOV words by UNK tokens in both…

Why is smoothing applied in language model ?

Posted on February 16, 2019March 8, 2019 by MLNerds

Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is and you need to find the probability of a sequence like where <START> is the token applied at the beginning of the document. Then…