You want to find food related topics in twitter – how do you go about it ?

One can use any of the topic models above to get topics. However, to direct the topics to contain food related information, specialized topic modeling algorithms are available. However, one simple way to direct the topics to food related things is : Filter tweets by a limited set of food related keywords (food, meal, dinner,…

Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?

Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix to get a K dimensional approximation. Other Matrix factorisation techniques…

What is speaker segmentation in speech recognition ? How do you use it ?

Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications  to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…

What is a language model ? How do you create one ? Why do you need one ?

A language model is a probability distribution over sequences of words P(w_1,… ,w_m). It enables us to measure the relative likelihood of different phrases. Measuring the likelihood of a sequence of words is useful  in many NLP tasks such as speech recognition, machine translation, POS tagging, parsing, and so on. Example :  In any generative…

What is the state of the art technique for Machine Translation ?

Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define  context. Linguistic Rules are defined to translate a specific word in a given context into target language. https://en.wikipedia.org/wiki/Rule-based_machine_translation Advantages of this approach : No requirement of parallel corpora…

How do you design a system that reads a natural language question and retrieves the closest FAQ answer?

There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…

How do you deal with dataset imbalance in a problem like spam filtering ?

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Assymetric cost function where the cost…