- Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by taking only those words that are relevant.
- Aggregated word embeddings : Use word embeddings such as word2vec / glove for each word in the document. And the document embedding is the average of embeddings of all words in the document. This works well for short documents. For long document, there are problems due to the averaging out effect. Advantage is that you could use pre-trained embeddings such as those from google news dataset.
- Phrase embeddings, document embeddings : There are many techniques to do embeddings of the entire document. One technique is to feed the sentence into an RNN with memory such as an LSTM and take the contents of the last hidden unit vector as a representation of the entire sentence as the hidden layer keeps getting richer and richer along the sequence.
- Directly use the sequence of words as input to a deep learning model such as an LSTM for the end task.
What are the different ways of representing documents ?
Posted on