Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include :
- TF-IDF – Term frequency inverse document frequency
- Leveraging manually curated stop word lists and eliminating these words
- We also reduce words to their roots – this is called lemmatization. This ensures a word that occurs several time receives more weightage even if the occurrences have different endings example: teach, teaching, teaches..