BLUE Score

This brief video describes the BLEU score, a popular evaluation metric used for sevaral tasks such as machine translation, text summarization and so on. What is BLEU Score? BLEU stands for Bilingual evaluation Understudy. It is a metric used to evaluate the quality of machine generated text by comparing it with a reference text that…

You want to find food related topics in twitter – how do you go about it ?

One can use any of the topic models above to get topics. However, to direct the topics to contain food related information, specialized topic modeling algorithms are available. However, one simple way to direct the topics to food related things is : Filter tweets by a limited set of food related keywords (food, meal, dinner,…

Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?

Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix to get a K dimensional approximation. Other Matrix factorisation techniques…

What is PMI ?

PMI : Pointwise Mutual Information, is a measure of correlation between two events x and y.           As you can see from above expression, is directly proportional to the number of times both events occur together and inversely proportional to the individual counts which are in the denominator. This expression ensures high…

Given the following two sentences, how do you determine if Teddy is a person or not? “Teddy bears are on sale!” and “Teddy Roosevelt was a great President!”

This is an example of Named Entity Recognition(NER) problem. One can build a sequence model such as an LSTM to perform this task. However, as shown in both the sentences above, forward only LSTM might fail here. Using forward only direction LSTM might result in a model which recognises Teddy as a product : “bear”, which is on…

Say you’ve generated a language model using Bag of Words (BoW) with 1-hot encoding , and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose I see sentence “Have a great day” p(great)=0.0 using this language model. How can you solve this problem leveraging the fact that good and great are similar words?

BoW with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning of the words. A simple solution could be to cluster the word embeddings and group synonyms into a unique token. Alternately, when a word has…

Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?

Assume there are total states and let be the length of the largest sequence. Think how we generate text using an hMM. We first have a state sequence and from each state we emit an output. From each state, any word out of possible outcomes can be generated. Since there are states, at each possible…