Problems
- As the vocabulary size (V) is large, these vectors will be large in size.
- They will be sparse as a word may not have co-occurred with all possible words.
Resolution
- Dimensionality Reduction using approaches like
- Singular Value Decomposition (SVD) of the term document matrix to get a K dimensional approximation.
- Other Matrix factorisation techniques can be employed for dimensionality reduction.
Possible followup question : What is the information lost in approximating a V dimensional word representation with a K dimensional representation. Answer: SVD finds the best possible K dimensional approximation of the term-document matrix from a information theoretic perspective.