Which is better to use while extracting features character n-grams or word n-grams? Why?

Both have their uses. Character n-grams are great where character level information is important : Example:  spelling correction,  language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on.

Character level n-grams are much more efficient. However with word level n-grams it is much harder to go beyond 3 gram.

Sometimes though not very common, byte level n-grams are also used.

Leave a Reply

Your email address will not be published. Required fields are marked *