-
- Most frequent words are usually stop words like [“in”,”that”,”so”,”what”,”are”,”this”,”the”,”that”,”a”,”is” …etc]
- Rare words could be because of spelling mistakes or due to the word being sparsely used in the data set.
- Usually both the most frequent and most rare words are not useful in providing contextual information. Very frequent words are called stop words. As stop-words occur in almost every sentence/document, they do not help in uniquely identifying content in sentences/documents. The very rare words could sometimes be very useful, but are often so sparse that it is hard to draw insights from them .
- Most frequent and most rare words can be handled using tf-idf instead of raw frequency count, to construct the feature vector, in text processing. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word
What can you say about the most frequent and most rare words ? Why are they important or not important ?
Posted on