- Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is
and you need to find the probability of a sequence like
where <START> is the token applied at the beginning of the document.
Then
as bi-gram “the sentence” doesn’t occur in the training set, but the test sequence is highly probable given the training set. To avoid such situations, add-k or other type of smoothing techniques are used such that any conditional probability is non-zero.
A related question could be this.