MLNerds

How is long term dependency maintained while building a language model?

Posted on February 16, 2019March 8, 2019 by MLNerds

Language models can be built using the following popular methods – Using n-gram language model n-gram language models make assumption for the value of n. Larger the value of n, longer the dependency. One can refer to what is the significance of n-grams in a language model for further reading. Using hidden Markov Model(HMM) HMM maintains long…

What is the significance of n-grams in a language model ?

Posted on February 16, 2019February 16, 2019 by MLNerds

n-grams is a term used for a sequence of n consecutive words/tokens/grams. In general, n-grams can either preserve the ordering or indicate what level of dependency is required in order to simplify the modeling task. While using bag of Words, n-grams come handy to preserve ordering between words but for language modeling, they signify the…

Given a bigram language model, in what scenarios do we encounter zero probabilities? How Should we handle these situations ?

Posted on February 16, 2019March 8, 2019 by MLNerds

Recall the Bi-gram model can be expressed as : Scenario 1 – Out of vocabulary(OOV) words – such words may not be present during training and hence any probability term involving OOV words will be 0.0 leading entire term to be zero. This is solved by replacing OOV words by UNK tokens in both…

Why is smoothing applied in language model ?

Posted on February 16, 2019March 8, 2019 by MLNerds

Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is and you need to find the probability of a sequence like where <START> is the token applied at the beginning of the document. Then…

If the average length of a sentence is 100 in all documents, should we build 100-gram language model ?

Posted on February 16, 2019February 16, 2019 by MLNerds

A 100 gram model will be more complex and will have lot of parameters. One way is to start with n-gram model with different values of n from 2 to 10 worst case. After some value of n, say n=7, the accuracy of the model becomes almost stagnant. One reason for this could be that…

How to measure the performance of the language model ?

Posted on February 16, 2019February 21, 2019 by MLNerds

While building language model, we try to estimate the probability of the sentence or a document. Given sequences(sentences or documents) like Language model(bigram language model) will be : for each sequence given by above equation. Once we apply Maximum Likelihood Estimation(MLE), we should have a value for the term . Perplexity…

What would you care more about – precision or recall for spam filtering problem?

Posted on February 16, 2019February 16, 2019 by MLNerds

False positive means it was not a spam and we called it spam, false negative means it was a spam and we didn’t label it spam Precision = (TP / TP + FP) and Recall = (TP / (TP + FN)). Increasing precision involves decreasing FP and increasing recall means decreasing FN. We don’t want…

What is the difference between stemming and lemmatisation?

Posted on February 16, 2019February 21, 2019 by MLNerds

Stemming is about replacing each word with its origin stem word in order to remove the suffixes like “es”, “ies”, “s”. For ex., “cats” => “cat”, “computers” => “computer” etc. This is more of a heuristic approach and not using any grammar or dictionary. Lemmatisation has the same purpose as above but doing it properly…

What are the optimization algorithms typically used in a neural network ?

Posted on February 14, 2019 by MLNerds

Gradient descent is the most commonly used training algorithm. Momentum is a common way to augment gradient descent such that gradient in each step is accumulated over past steps to enable the algorithm to proceed in a smoother fashion towards the minimum. RMS prop attempts to adjust learning rate for each iteration in an automated…

Given a deep learning model, what are the considerations to set mini-batch size ?

Posted on February 14, 2019 by MLNerds

The batch size is a hyper parameter. Usually people try various values to see what works best in terms of speed and accuracy. Suppose you have M training instances and k batches, higher batch size is faster to do a pass on the entire dataset, through M/k mini batch iterations. As long as the data…

← Newer posts Older posts →