MLNerds, Author at Ace the Data Science Interview!

You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?

Posted on February 14, 2019February 14, 2019 by MLNerds

This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements – about meals and about food can probably characterized by the same topic though they do not necessarily use the same vocabulary. Topic models typically…

What is speaker segmentation in speech recognition ? How do you use it ?

Posted on February 14, 2019February 14, 2019 by MLNerds

Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…

What is a language model ? How do you create one ? Why do you need one ?

Posted on February 14, 2019February 16, 2019 by MLNerds

A language model is a probability distribution over sequences of words P(w_1,… ,w_m). It enables us to measure the relative likelihood of different phrases. Measuring the likelihood of a sequence of words is useful in many NLP tasks such as speech recognition, machine translation, POS tagging, parsing, and so on. Example : In any generative…

What are some common tools available for NER ? Named Entity Recognition ?

Posted on February 14, 2019February 14, 2019 by MLNerds

Notable NER platforms include: GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API. OpenNLP includes rule-based and statistical named-entity recognition. SpaCy features fast statistical NER as well as an open-source named-entity visualizer.

What is the difference between word2Vec and Glove ?

Posted on February 14, 2019October 26, 2020 by MLNerds

Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring out the semantic similarity of words that captures different facets of the meaning of a word. They are used in many NLP applications such as sentiment…

What are some knowledge graphs you know. What is different between these ?

Posted on February 14, 2019February 14, 2019 by MLNerds

DBPedia : Entities and relationships are automatically extracted from wikipedia. Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. It is a knowledge base that tracks specific kinds of relationships like synonym, antonym, hyponymy and so on. http://wordnetcode.princeton.edu/5papers.pdf Yago : Also extracts knowledge from…

What is the difference between paraphrasing and textual entailment ?

Posted on February 14, 2019 by MLNerds

Textual entailment is the process of determining if a source T implies the hypothesis text H. Example :It is a unidirectional relationship : text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences. Some techniques for textual entailment include lexical similarity based techniques to identify…

What is the state of the art technique for Machine Translation ?

Posted on February 14, 2019 by MLNerds

Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define context. Linguistic Rules are defined to translate a specific word in a given context into target language. https://en.wikipedia.org/wiki/Rule-based_machine_translation Advantages of this approach : No requirement of parallel corpora…

How do you design a system that reads a natural language question and retrieves the closest FAQ answer?

Posted on February 14, 2019February 21, 2019 by MLNerds

There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…

How do you deal with dataset imbalance in a problem like spam filtering ?

Posted on February 14, 2019April 4, 2019 by MLNerds

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Assymetric cost function where the cost…

← Newer posts Older posts →