An n-grams model makes order n-1 Markov assumption. This assumption implies: given the previous n-1 words, probability of
word is independent of words prior to
words.
Suppose we have k words in a sentence, their joint probability can be expressed as follows using chain rule:
![]()
Now, the Markov assumption can be used to make the above factorization simpler, where each word in a sequence depends only on the previous n-1 words for an n grams model.
For bi-gram model(n=2), first order Markov assumption is made and the above expression becomes
![Rendered by QuickLaTeX.com \[p(w)\,=\prod_{i=1}^{k+1} p(w_{i} | w_{i-1})\]](https://machinelearninginterview.com/wp-content/ql-cache/quicklatex.com-4985da7f8fc7f62207e30dc7ada4822d_l3.png)
For tri-gram model(n=3), second order Markov assumption is made, which means probability of a word depends on previous 2 words, hence second order.
![Rendered by QuickLaTeX.com \[p(w)\,=\prod_{i=1}^{k+1} p(w_{i} | w_{i-1},w_{i-2})\]](https://machinelearninginterview.com/wp-content/ql-cache/quicklatex.com-36fb38fe4709853ec6f179e539f4431f_l3.png)
Thinking exercise – how do you handle words like
?