What is the difference between translation and transliteration

Transliteration is the process of converting a word written in one language into another language, phoneme by phoneme. Enabling transliteration for your search engine allows your site visitors to type a query phonetically in one language and have that query appear in another language. Translation helps convert text in one language to text in another…

What are the different ways of representing documents ?

Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by…

What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?

Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained  by leveraging word co-occurrence, through an encoder –…

You are building a natural language search box for a website. How do you accommodate spelling errors?

If you have a dictionary of words, edit distance is the simplest way of incorporating this. However, sometimes corrections based on context make sense. For instance, suppose I type “bed color shoes” – these are perfect dictionary words, but a sensible model would come up with “red color shoes”. Using the language model to come…

Explain latent dirichlet allocation – where is it typically used ?

Latent Dirichlet Allocation is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content…

You are trying to cluster documents using a Bag of Words method. Typically words like if, of, is and so on are not great features. How do you make sure you are leveraging the more informative words better during the feature Engineering?

 Words like if, of, … are called stop words. Typical pre-processing in standard NLP pipeline involves identifying and removing stop-words (except in some cases where context/ word adjacency information is important). Common techniques to remove stop words include :   TF-IDF – Term frequency inverse document frequency Leveraging manually curated stop word lists and eliminating…

Which is better to use while extracting features character n-grams or word n-grams? Why?

Both have their uses. Character n-grams are great where character level information is important : Example:  spelling correction,  language identification, writer identification (i.e. fingerprinting), anomaly detection. While word n-grams are more appropriate for tasks that understand word co-occurance, for instance machine translation, spam detection and so on. Character level n-grams are much more efficient. However…

Can you find the antonyms of a word given a large enough corpus? For ex. Black => white or rich => poor etc. If yes then how, otherwise justify your answer.

Pre existing Databases: There are several curated antonym databases such as wordpress and so on from which you can directly check if you can get antonyms of a given word. Hearst Patterns: Given some seed antonym pairs of words,  one can find patterns in text, how known antonyms tend to occur. X, not Y : “It…

How can you increase the recall of a search query (on search engine or e-commerce site) result without changing the algorithm ?

Since we are not allowed to change the algorithm, we can only play with modifying or augmenting the search query. (Note, we either change the algorithm/model or the data, here we can only change the data, in other words modifying the search query.) Modifying the query in a way that we get results relevant to…