You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?

 Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include :  

  1. TF-IDF – Term frequency inverse document frequency
  2. Leveraging manually curated stop word lists and eliminating these words
  3. We also reduce words to their roots – this is called lemmatization. This ensures a word that occurs several time receives more weightage even if the occurrences have different endings example: teach, teaching, teaches..

Leave a Reply

Your email address will not be published. Required fields are marked *