How does KNN algorithm work ? What are the advantages and disadvantages of KNN ?

The KNN algorithm is commonly used in many ML applications – right from supervised settings such as classification and regression, to just retrieving similar items in applications such as recommendation systems, search, question answering and so on. What is the KNN Algorithm? KNN for Nearest Neighbour Search: KNN algorithm involves retrieving the K datapoints that are…

You are given some documents and asked to find prevalent topics in the documents – how do you go about it ?

This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements –  about meals and about food can probably characterized by the same topic though they do not necessarily use the same vocabulary. Topic models typically…

What is speaker segmentation in speech recognition ? How do you use it ?

Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications  to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of…

What are some knowledge graphs you know. What is different between these ?

DBPedia : Entities and relationships are automatically extracted from wikipedia. Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. It is a knowledge base that tracks specific kinds of relationships like synonym, antonym, hyponymy and so on.  http://wordnetcode.princeton.edu/5papers.pdf Yago : Also extracts knowledge from…

How do you design a system that reads a natural language question and retrieves the closest FAQ answer?

There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…

How do you deal with dataset imbalance in a problem like spam filtering ?

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Assymetric cost function where the cost…

What are the different ways of representing documents ?

Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by…