There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…
Category: Natural Language Processing
How do you deal with dataset imbalance in a problem like spam filtering ?
Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Assymetric cost function where the cost…
You have come up with a Spam classifier. How do you measure accuracy ?
Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome is spam and the document is actually spam. True Negatives: Those data points where the outcome is not spam and the document is actually not…
What is the difference between translation and transliteration
Transliteration is the process of converting a word written in one language into another language, phoneme by phoneme. Enabling transliteration for your search engine allows your site visitors to type a query phonetically in one language and have that query appear in another language. Translation helps convert text in one language to text in another…
What are the different ways of representing documents ?
Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by…
What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?
Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained by leveraging word co-occurrence, through an encoder –…
You are building a natural language search box for a website. How do you accommodate spelling errors?
If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed color shoes” – these are perfect dictionary words, but a sensible model would come up with “red color shoes”. Using the language model to come…
What are knowledge graphs? When would you need a knowledge graph over say a database to store information?
A knowledge graph organizes real world knowledge as entities and relationships between entities. Creating a knowledge graph often involves scraping / ingesting unstructured data and creating structure out of it by extracting entities and relationships automatically. Examples of knowledge graphs are Wordnet, DBpedia,.. A database also is a knowledge graph in some sense. Since…
What is shallow parsing
Typically we have a generative grammar that tells us how a sentence is generated from a set of rules. Parsing is the process of finding a parse tree that is consistent with the grammar rules – in other words, we want to find the set of grammar rules and their sequence that generated the sentence….
What are the advantages and disadvantages of using naive bayes for spam detection?
Disadvantages: Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem by saying features are independant and gives sub par performance. Advantages: However, naive bayes is very efficient. It is a model you can train in…