This is an example of Named Entity Recognition(NER) problem. One can build a sequence model such as an LSTM to perform this task. However, as shown in both the sentences above, forward only LSTM might fail here. Using forward only direction LSTM might result in a model which recognises Teddy as a product : “bear”, which is on…
Author: MLNerds
Say you’ve generated a language model using Bag of Words (BoW) with 1-hot encoding , and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose I see sentence “Have a great day” p(great)=0.0 using this language model. How can you solve this problem leveraging the fact that good and great are similar words?
BoW with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning of the words. A simple solution could be to cluster the word embeddings and group synonyms into a unique token. Alternately, when a word has…
What is the complexity of Viterbi algorithm ?
Viterbi algorithm is a dynamic programming approach to find the most probable sequence of hidden states given the observed data, as modeled by a HMM. Without dynamic programming, it becomes an exponential problem as there are exponential number of possible sequences for a given observation(How – explained in answer below). Let the transition probabilities(state transition)…
Suppose you are modeling text with a HMM, What is the complexity of finding most the probable sequence of tags or states from a sequence of text using brute force algorithm?
Assume there are total states and let be the length of the largest sequence. Think how we generate text using an hMM. We first have a state sequence and from each state we emit an output. From each state, any word out of possible outcomes can be generated. Since there are states, at each possible…
How do you find the most probable sequence of POS tags from a sequence of text?
This problem can be solved with a HMM. Using a HMM involves finding the transition probabilities (what is the probability of going from one POS tag to another and emission/output probabilities (what is the probability of observing a word given a POS tag) as explained in the question How do you train an hMM. Once…
How do you train a hMM model in practice ?
The joint probability distribution for the HMM model is given by the following equation where are the observed data points and the corresponding latent states: Before proceeding to answer the question on training a HMM, it makes sense to ask following questions What is the problem in hand for which we are training…
What are the different independence assumptions in hMM & Naive Bayes ?
Both the hMM and Naive Bayes have conditional independence assumption. hMM can be expressed by the equation below : Second equation implies a conditional independence assumption: Given the state observed variable is conditionally independent of previous observed variables, i.e. and Naive Bayes Model is expressed as: is the feature…
How many parameters are there for an hMM model?
Let us calculate the number of parameters for bi-gram hMM given as Let be the total number of states and be the vocabulary size and be the length of the sequence Before directly estimating the number of parameters, let us first try to see what are the different probabilities or rather probability matrix…
How do you generate text using a Hidden Markov Model (HMM) ?
The HMM is a latent variable model where the observed sequence of variables are assumed to be generated from a set of temporally connected latent variables . The joint distribution of the observed variables or data and the latent variables can be written as : One possible interpretation of the latent variables in…
What order of Markov assumption does n-grams model make ?
An n-grams model makes order n-1 Markov assumption. This assumption implies: given the previous n-1 words, probability of word is independent of words prior to words. Suppose we have k words in a sentence, their joint probability can be expressed as follows using chain rule: Now, the Markov assumption can be used to make…