How is long term dependency maintained while building a language model?

Language models can be built using the following popular methods – Using n-gram language model n-gram language models make assumption for the value of n. Larger the value of n, longer the dependency. One can refer to what is the significance of n-grams in a language model for further reading. Using hidden Markov Model(HMM) HMM maintains long…

What is the significance of n-grams in a language model ?

n-grams is a term used for a sequence of n consecutive words/tokens/grams. In general, n-grams can either preserve the ordering or indicate what level of dependency is required in order to simplify the modeling task. While using bag of Words, n-grams come handy to preserve ordering between words but for language modeling, they signify the…

Given a bigram language model, in what scenarios do we encounter zero probabilities? How Should we handle these situations ?

Recall the Bi-gram model can be expressed as :     Scenario 1 – Out of vocabulary(OOV) words – such words may not be present during training and hence any probability term involving OOV words will be 0.0 leading entire term to be zero. This is solved by replacing OOV words by UNK tokens in both…

Why is smoothing applied in language model ?

Because there might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is      and you need to find the probability of a sequence like         where <START> is the token applied at the beginning of the document. Then…

How to measure the performance of the language model ?

While building language model, we try to estimate the probability of the sentence or a document. Given sequences(sentences or documents) like     Language model(bigram language model) will be :     for each sequence given by above equation. Once we apply Maximum Likelihood Estimation(MLE), we should have a value for the term . Perplexity…

What is the difference between stemming and lemmatisation?

Stemming is about replacing each word with its origin stem word in order to remove the suffixes like “es”, “ies”, “s”. For ex., “cats” => “cat”, “computers” => “computer” etc. This is more of a heuristic approach and not using any grammar or dictionary. Lemmatisation has the same purpose as above but doing it properly…

You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?

This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements –  about meals and about food can probably characterized by the same topic though they do not necessarily use the same vocabulary. Topic models typically…

What is speaker segmentation in speech recognition ? How do you use it ?

Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications  to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…

What is a language model ? How do you create one ? Why do you need one ?

A language model is a probability distribution over sequences of words P(w_1,… ,w_m). It enables us to measure the relative likelihood of different phrases. Measuring the likelihood of a sequence of words is useful  in many NLP tasks such as speech recognition, machine translation, POS tagging, parsing, and so on. Example :  In any generative…