- Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is
![Rendered by QuickLaTeX.com \[w_{train}\,=\,This\,is\,the\,only\,sentence\,in\,the\,corpus\]](https://machinelearninginterview.com/wp-content/ql-cache/quicklatex.com-42984f5ffd4301ce119a1b8530c99d42_l3.png)
and you need to find the probability of a sequence like
![Rendered by QuickLaTeX.com \[w_{test}\,=\,This\,is\,the\,sentence\,in\,the\,corpus\]](https://machinelearninginterview.com/wp-content/ql-cache/quicklatex.com-70dfff7a8e9fb4d781e77f7df014f176_l3.png)
![]()
where <START> is the token applied at the beginning of the document.
Then
![]()
as bi-gram “the sentence” doesn’t occur in the training set, but the test sequence is highly probable given the training set. To avoid such situations, add-k or other type of smoothing techniques are used such that any conditional probability is non-zero.
A related question could be this.