MLNerds

What is the difference between word2Vec and Glove ?

Posted on February 14, 2019October 26, 2020 by MLNerds

Both word2vec and glove enable us to represent a word in the form of a vector (often called embedding). They are the two most popular algorithms for word embeddings that bring out the semantic similarity of words that captures different facets of the meaning of a word. They are used in many NLP applications such as sentiment…

What are some knowledge graphs you know. What is different between these ?

Posted on February 14, 2019February 14, 2019 by MLNerds

DBPedia : Entities and relationships are automatically extracted from wikipedia. Wordnet: Lexical database of english language. Groups english words as synsets and provides various relationships between words in a synset. It is a knowledge base that tracks specific kinds of relationships like synonym, antonym, hyponymy and so on. http://wordnetcode.princeton.edu/5papers.pdf Yago : Also extracts knowledge from…

What is the difference between paraphrasing and textual entailment ?

Posted on February 14, 2019 by MLNerds

Textual entailment is the process of determining if a source T implies the hypothesis text H. Example :It is a unidirectional relationship : text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences. Some techniques for textual entailment include lexical similarity based techniques to identify…

What is the state of the art technique for Machine Translation ?

Posted on February 14, 2019 by MLNerds

Rule based machine translation (Older techniques) : Uses dictionary between words of the two languages along with syntactic, semantic morphological analysis of the source sentence to define context. Linguistic Rules are defined to translate a specific word in a given context into target language. https://en.wikipedia.org/wiki/Rule-based_machine_translation Advantages of this approach : No requirement of parallel corpora…

How do you design a system that reads a natural language question and retrieves the closest FAQ answer?

Posted on February 14, 2019February 21, 2019 by MLNerds

There are multiple approaches for FAQ based question answering Keyword based search (Information retrieval approach): Tag each question with keywords. Extract keywords from query and retrieve all relevant questions answers. Easy to scale with appropriate indexes reverse indexing. Lexical matching approach : word level overlap between query and question. These approaches might be harder to…

How do you deal with dataset imbalance in a problem like spam filtering ?

Posted on February 14, 2019April 4, 2019 by MLNerds

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem. Designing an Assymetric cost function where the cost…

You have come up with a Spam classifier. How do you measure accuracy ?

Posted on February 14, 2019 by MLNerds

Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy : True positives : Those data points where the outcome is spam and the document is actually spam. True Negatives: Those data points where the outcome is not spam and the document is actually not…

What is the difference between translation and transliteration

Posted on February 14, 2019 by MLNerds

Transliteration is the process of converting a word written in one language into another language, phoneme by phoneme. Enabling transliteration for your search engine allows your site visitors to type a query phonetically in one language and have that query appear in another language. Translation helps convert text in one language to text in another…

What are the different ways of representing documents ?

Posted on February 14, 2019 by MLNerds

Bag of words: Commonly called BOW involves creating a vocabulary of words and representing the document as a count vector, dimension equivalent to the vocabulary size – each dimension representing the number of times a specific word occured in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions/features by…

What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?

Posted on February 14, 2019February 14, 2019 by MLNerds

Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP : TF-IDF : Term frequency, inverse document frequency (link to relevant article) Word2Vec / Glove : These are very popular recently. They are obtained by leveraging word co-occurrence, through an encoder –…

← Newer posts Older posts →