NLP Archives - Page 2 of 3 - Ace the Data Science Interview!

You have come up with a Spam classifier. How do you measure accuracy ?

Posted on February 14, 2019 by MLNerds

Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome is spam and the document is actually spam. True Negatives: Those data points where the outcome is not spam and the document is actually not…

What is the difference between translation and transliteration

Posted on February 14, 2019 by MLNerds

Transliteration is the process of converting a word written in one language into another language, phoneme by phoneme. Enabling transliteration for your search engine allows your site visitors to type a query phonetically in one language and have that query appear in another language. Translation helps convert text in one language to text in another…

What are the different ways of representing documents ?

Posted on February 14, 2019 by MLNerds

Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by…

What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?

Posted on February 14, 2019February 14, 2019 by MLNerds

Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained by leveraging word co-occurrence, through an encoder –…

You are building a natural language search box for a website. How do you accommodate spelling errors?

Posted on February 14, 2019February 14, 2019 by MLNerds

If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed color shoes” – these are perfect dictionary words, but a sensible model would come up with “red color shoes”. Using the language model to come…

Explain latent dirichlet allocation – where is it typically used ?

Posted on February 10, 2019February 14, 2019 by MLNerds

Latent Dirichlet Allocation is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content…

You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?

Posted on February 9, 2019 by MLNerds

Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include : TF-IDF – Term frequency inverse document frequency Leveraging manually curated stop word lists and eliminating…

Which is better to use while extracting features character n-grams or word n-grams? Why?

Posted on February 9, 2019February 9, 2019 by MLNerds

Both have their uses. Character n-grams are great where character level information is important : Example: spelling correction, language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…

Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.

Posted on February 9, 2019March 10, 2019 by MLNerds

Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words, one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…

How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?

Posted on February 9, 2019March 9, 2019 by MLNerds

Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…

← Newer posts Older posts →