Which is better to use while extracting features character n-grams or word n-grams? Why?

Posted on February 9, 2019February 9, 2019 by MLNerds

Both have their uses. Character n-grams are great where character level information is important : Example: spelling correction, language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on.

Character level n-grams are much more efficient. However with word level n-grams it is much harder to go beyond 3 gram.

Sometimes though not very common, byte level n-grams are also used.

Leave a Reply Cancel reply