Let us calculate the number of parameters for bi-gram hMM given as Let be the total number of states and be the vocabulary size and be the length of the sequence Before directly estimating the number of parameters, let us first try to see what are the different probabilities or rather probability matrix…
Category: Machine Learning
How do you generate text using a Hidden Markov Model (HMM) ?
The HMM is a latent variable model where the observed sequence of variables are assumed to be generated from a set of temporally connected latent variables . The joint distribution of the observed variables or data and the latent variables can be written as : One possible interpretation of the latent variables in…
If the average length of a sentence is 100 in all documents, should we build 100-gram language model ?
A 100 gram model will be more complex and will have lot of parameters. One way is to start with n-gram model with different values of n from 2 to 10 worst case. After some value of n, say n=7, the accuracy of the model becomes almost stagnant. One reason for this could be that…
How to measure the performance of the language model ?
While building language model, we try to estimate the probability of the sentence or a document. Given sequences(sentences or documents) like Language model(bigram language model) will be : for each sequence given by above equation. Once we apply Maximum Likelihood Estimation(MLE), we should have a value for the term . Perplexity…
What would you care more about – precision or recall for spam filtering problem?
False positive means it was not a spam and we called it spam, false negative means it was a spam and we didn’t label it spam Precision = (TP / TP + FP) and Recall = (TP / (TP + FN)). Increasing precision involves decreasing FP and increasing recall means decreasing FN. We don’t want…
What are the commonly used activation functions ? When are they used.
Ans. The commonly used loss functions are Linear : g(x) = x. This is the simplest activation function. However it cannot model complex decision boundaries. A deep network with linear activations can be shown incapable of handling non-linear decision boundaries. Sigmoid : This is a common activation function in the last layer of the neural…
I have used a 4 layered fully connected network to learn a complex classifier boundary. I have used tanh activations throughout except the last layer where I used sigmoid activation for binary classification. I train for 10K iterations with 100K examples (my data points are 3 dimensional and I initialized my weights to 0 to begin with). I see that my network is unable to fit the training data and is leading to a high training error. What is the first thing I try ?
Increase the number of training iterations Make a more complex network – increase hidden layer size Initialize weights to a random small value instead of zeros Change tanh activations to relu Ans : (3) . I will initialize weights to a non zero value since changing all the weights in the same…
What are the different ways of preventing over-fitting in a deep neural network ? Explain the intuition behind each
L2 norm regularization : Make the weights closer to zero prevent overfitting. L1 Norm regularization : Make the weights closer to zero and also induce sparsity in weights. Less common form of regularization Dropout regularization : Ensure some of the hidden units are dropped out at random to ensure the network does not overfit by…
I have designed a 2 layered deep neural network for a classifier with 2 units in the hidden layer. I use linear activation functions with a sigmoid at the final layer. I use a data visualization tool and see that the decision boundary is in the shape of a sine curve. I have tried to train with 200 data points with known class labels and see that the training error is too high. What do I do ?
Increase number of units in the hidden layer Increase number of hidden layers Increase data set size Change activation function to tanh Try all of the above The answer is d. When I use a linear activation function, the deep neural network is realizing a linear combination of linear functions which leads to modeling only…
Can you give an example of a classifier with high bias and high variance?
High bias means the data is being underfit. The decision boundary is not usually complex enough. High variance happens due to over fitting, the decision boundary is more complex than what it should be. High bias high variance happens when you fit a complex decision boundary that is also not fitting the training set…