Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP :
- TF-IDF : Term frequency, inverse document frequency (link to relevant article)
- Word2Vec / Glove : These are very popular recently. They are obtained by leveraging word co-occurrence, through an encoder – decoder setting in a deep neural network. (** give references ). A document embedding is obtained by averaging embeddings of all words in the document.
- Elmo Embeddings: Deep contextual embeddings – Elmo might give a slightly different embedding for each context a word occurs in.
- LSI : Latent semantic Indexing ( based on Singular Value Decomposition (SVD))
- Topic Modeling : Techniques such as latent dirichlet allocation that find relevant topics in document collection and represent the document as a reduced dimensional vector of topic strengths.