- There are mainly 2 reasons
- Some pair of words always occur together more often than they occur individually. Hence it is important to treat such co-occurring words as a single entity or a single token in training. For named entity recognition problem, Tokens such as “United States”, “North America”, “Red Wine” would make sense when recognised as bi-grams. n-grams are an extension of bi-grams to work with longer sequences.
- If we take Bag of Words features, using only single words loses the ordering of the sequence. To preserve the ordering, one can use n-grams also as features in BoW approach.
- Note that Frequently occurring sequence of words(not only bi-grams) are called Collocations.
- NLTK provides a function called collocations() to find frequent bigrams.
Why are bigrams or any n-grams important in NLP(task like sentiment classification or spam detection) or important enough to find them explicitly?
Posted on