Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…
Category: Natural Language Processing
What is a language model ? How do you create one ? Why do you need one ?
A language model is a probability distribution over sequences of words P(w_1,… ,w_m). It enables us to measure the relative likelihood of different phrases. Measuring the likelihood of a sequence of words is useful in many NLP tasks such as speech recognition, machine translation, POS tagging, parsing, and so on. Example : In any generative…
What are some common tools available for NER ? Named Entity Recognition ?
Notable NER platforms include: GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API. OpenNLP includes rule-based and statistical named-entity recognition. SpaCy features fast statistical NER as well as an open-source named-entity visualizer.
What is the difference between word2Vec and Glove ?
Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring out the semantic similarity of words that captures different facets of the meaning of a word. They are used in many NLP applications such as sentiment…
What are some knowledge graphs you know. What is different between these ?
DBPedia : Entities and relationships are automatically extracted from wikipedia. Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. It is a knowledge base that tracks specific kinds of relationships like synonym, antonym, hyponymy and so on. http://wordnetcode.princeton.edu/5papers.pdf Yago : Also extracts knowledge from…
What is the difference between paraphrasing and textual entailment ?
Textual entailment is the process of determining if a source T implies the hypothesis text H. Example :It is a unidirectional relationship : text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences. Some techniques for textual entailment include lexical similarity based techniques to identify…
What is the state of the art technique for Machine Translation ?
Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define context. Linguistic Rules are defined to translate a specific word in a given context into target language. https://en.wikipedia.org/wiki/Rule-based_machine_translation Advantages of this approach : No requirement of parallel corpora…
How do you design a system that reads a natural language question and retrieves the closest FAQ answer?
There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…
How do you deal with dataset imbalance in a problem like spam filtering ?
Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Assymetric cost function where the cost…
You have come up with a Spam classifier. How do you measure accuracy ?
Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome is spam and the document is actually spam. True Negatives: Those data points where the outcome is not spam and the document is actually not…