Why is smoothing applied in language model ?

  • Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is 

        \[w_{train}\,=\,This\,is\,the\,only\,sentence\,in\,the\,corpus\]

    and you need to find the probability of a sequence like

        \[w_{test}\,=\,This\,is\,the\,sentence\,in\,the\,corpus\]

    \[p(w_{test}) = p(this|<START>)*p(is|this)*...*p(sentence|the)*...*p(corpus|the)\]

where <START> is the token applied at the beginning of the document.

Then

    \[p(sentence | the) = 0.0\]

as bi-gram “the sentence” doesn’t occur in the training set, but the test sequence is highly probable given the training set. To avoid such situations, add-k or other type of smoothing techniques are used such that any conditional probability is non-zero.

A related question could be this.

Leave a Reply

Your email address will not be published. Required fields are marked *