embeddings – Machine Learning Interviews

How do you deal with out of vocabulary words during run time when you build a language model ?

Posted on February 26, 2019July 31, 2019 by MLInterview

Out of vocabulary words are words that are not in the training set, but appear in the test set, real data. The main problem is that the model assigns a probability zero to out of vocabulary words resulting in a zero likelihood. This is a common problem, specially when you have not trained on a…

Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?

Posted on February 17, 2019May 2, 2019 by MLNerds

Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix to get a K dimensional approximation. Other Matrix factorisation techniques…

What is negative sampling when training the skip-gram model ?

Posted on February 17, 2019August 5, 2021 by MLNerds

Recap: Skip-Gram model is a popular algorithm to train word embeddings such as word2vec. It tries to represent each word in a large text as a lower dimensional vector in a space of K dimensions such that similar words are closer to each other. This is achieved by training a feed-forward network where we try…

How do you design a system that reads a natural language question and retrieves the closest FAQ answer?

Posted on February 14, 2019February 21, 2019 by MLNerds

There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…

What are the different ways of representing documents ?

Posted on February 14, 2019 by MLNerds

Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by…

What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?

Posted on February 14, 2019February 14, 2019 by MLNerds

Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained by leveraging word co-occurrence, through an encoder –…