# Category: Deep Learning

## Skip or Residual Connections in Deep Networks

## GPT Model

## The BERT Score – Evaluating Text Generation

This video talks about the evaluation metric BERTScore, why it needed over existing metrics such as the BLEU score and so on and how it is computed and evaluated. Traditional metrics look at exact text match. BERTScore looks at semantic similarity leveraging contextual word embeddings of words in the candidate and the reference sentences.

## BERT Model

## Batch vs Mini-Batch vs Stochastic Gradient Descent

## Normalization in Deep Neural Networks

Batch norm and Layer norm are common normalization techniques. This brief video talks about the need for normalization and the types of norms in deep neural networks.

## When are deep learning algorithms more appropriate compared to traditional machine learning algorithms?

Deep learning algorithms are capable of learning arbitrarily complex non-linear functions by using a deep enough and a wide enough network with the appropriate non-linear activation function. Traditional ML algorithms often require feature engineering of finding the subset of meaningful features to use. Deep learning algorithms often avoid the need for the feature engineering step….

## Why do you typically see overflow and underflow when implementing an ML algorithms ?

A common pre-processing step is to normalize/rescale inputs so that they are not too high or low. However, even on normalized inputs, overflows and underflows can occur: Underflow: Joint probability distribution often involves multiplying small individual probabilities. Many probabilistic algorithms involve multiplying probabilities of individual data points that leads to underflow. Example : Suppose you…

## Suppose you build word vectors (embeddings) with each word vector having dimensions as the vocabulary size(V) and feature values as pPMI between corresponding words: What are the problems with this approach and how can you resolve them ?

Problems As the vocabulary size (V) is large, these vectors will be large in size. They will be sparse as a word may not have co-occurred with all possible words. Resolution Dimensionality Reduction using approaches like Singular Value Decomposition (SVD) of the term document matrix to get a K dimensional approximation. Other Matrix factorisation techniques…