Data Science Preparation
Quizsummary
0 of 20 questions completed
Questions:
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
Information
Information
This quiz is for you to practice and learn your data science skills and check how well you are prepared for your upcoming data science interview.
Good luck!
You have already completed the quiz before. Hence you can not start it again.
Quiz is loading...
You must sign in or sign up to start the quiz.
You have to finish following quiz, to start this quiz:
Results
0 of 20 questions answered correctly
Time has elapsed
You have reached 0 of 0 points, (0)
Average score 

Your score 

Categories
 Linear Algebra 0%
 Machine learning models 0%
 ML Application 0%
 Model tuning 0%
 NLP 0%
 Online Learning 0%
 Optimization 0%
 Probability 0%
 Statistics 0%
 timeseries 0%
 Timeseries 0%

Thanks $form{0} for taking the quiz!
Your results have been mailed to $form{1}Keep preparing for your interview on machinelearninginterview.com
We wish you good luck for any upcoming interviews!
Do let us know if you’ve any question regarding data science interviews at hello@machinelearninginterview.com
Pos.  Name  Entered on  Points  Result 

Table is loading  
No data available  
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 Answered
 Review

Question 1 of 20
1. Question
1 pointsCategory: OptimizationWith the maximum likelihood estimate are we guaranteed to find a global minimum?
Correct
Maximum likelihood estimate finds that value of parameters that maximize the likelihood. If the likelihood is strictly concave(or negative of likelihood is strictly convex), we are guaranteed to find a unique optimum. This is usually not the case and we end up finding a local optima.
Incorrect
Maximum likelihood estimate finds that value of parameters that maximize the likelihood. If the likelihood is strictly concave(or negative of likelihood is strictly convex), we are guaranteed to find a unique optimum. This is usually not the case and we end up finding a local optima.

Question 2 of 20
2. Question
1 pointsCategory: Machine learning modelsWill logistic regression always reach the optima in a binary classification problem where the dataset is perfectly linearly separable?
Correct
While the objective function is convex, the optimum value is at infinity when the data is perfectly separable. Hence logistic regression need not converge.
To understand this intuitively consider the picture above : There are an infinite number of sigmoids that can separate the same set of points if they are perfectly separable.
Logistic regression will keep trying to go to the ideal step function and the weights keep exploding. In this case, using a regularization term helps.
Incorrect
While the objective function is convex, the optimum value is at infinity when the data is perfectly separable. Hence logistic regression need not converge.
To understand this intuitively consider the picture above : There are an infinite number of sigmoids that can separate the same set of points if they are perfectly separable.
Logistic regression will keep trying to go to the ideal step function and the weights keep exploding. In this case, using a regularization term helps.

Question 3 of 20
3. Question
1 pointsCategory: Machine learning modelsWe know height and weight have a linear relationship. Let us say we are trying to fit a straight line through ridge regression. After visualizaing the result of the model, we realise that the model is overfitting. If there are 20 data points used in training, what should be done to reduce overfitting?
Correct
We need more data to avoid overfitting. But repeating the same data does not necessarily help since it simply has the effect of reducing the regularization weight by giving double the weight to the initial set of points leading to more overfitting. Repeating data does not add new information.
Incorrect
We need more data to avoid overfitting. But repeating the same data does not necessarily help since it simply has the effect of reducing the regularization weight by giving double the weight to the initial set of points leading to more overfitting. Repeating data does not add new information.

Question 4 of 20
4. Question
1 pointsCategory: Machine learning modelsHow do you eliminate underfitting ?
Correct
Underfitting is the opposite of overfitting and it occurs when model is too simple to fit the data well enough. This could happen if the right features were not selected or extracted, or there is high regularization .
 To avoid underfitting, one way is to make the model more complex, with more parameters for instance.
 If you’re convinced that your model is complex enough, try to increase the number of features or extract new features from the existing ones to solve the underfitting problem.
 Once you’ve performed both the steps, reduce the regularization hyperparameter.
Incorrect
Underfitting is the opposite of overfitting and it occurs when model is too simple to fit the data well enough. This could happen if the right features were not selected or extracted, or there is high regularization .
 To avoid underfitting, one way is to make the model more complex, with more parameters for instance.
 If you’re convinced that your model is complex enough, try to increase the number of features or extract new features from the existing ones to solve the underfitting problem.
 Once you’ve performed both the steps, reduce the regularization hyperparameter.

Question 5 of 20
5. Question
2 pointsCategory: ML ApplicationWhat are the different ways in which you can cluster movies into different genres for a movie recommendation engine ?
Correct
Kmeans is a clustering algorithm while PCA is used for dimensionality reduction (not necessarily clustering). Hence 1 and 3 can be used to find movie clusters while 2nd option does not make sense.
Incorrect
Kmeans is a clustering algorithm while PCA is used for dimensionality reduction (not necessarily clustering). Hence 1 and 3 can be used to find movie clusters while 2nd option does not make sense.

Question 6 of 20
6. Question
1 pointsCategory: Linear AlgebraWhat are the eigenvalues and eigen vectors of the following matrix ?
Correct
Eigenvalues of a diagonal matrix are the diagonal elements. In general, eigen values of a matrix A are defined as all those such that for some vector x. Consider vector . Then . Similarly for , . Hence the answer is 2nd option, i.e., eigenvalue is 1 and is the corresponding eigenvector.
Incorrect
Eigenvalues of a diagonal matrix are the diagonal elements. In general, eigen values of a matrix A are defined as all those such that for some vector x. Consider vector . Then . Similarly for , . Hence the answer is 2nd option, i.e., eigenvalue is 1 and is the corresponding eigenvector.

Question 7 of 20
7. Question
1 pointsCategory: ProbabilityIn a college, 200 students are randomly selected. 140 like tea, 120 like coffee and 80 like both tea and coffee. Which of the following is true?
Correct
140 like tea while 80 like both tea and coffee. So 60 like only tea. 120 like coffee and 80 like both tea and coffee, so 40 like only coffee. Then 60 + 40 + 80 = 180 like either tea or coffee.
Incorrect
140 like tea while 80 like both tea and coffee. So 60 like only tea. 120 like coffee and 80 like both tea and coffee, so 40 like only coffee. Then 60 + 40 + 80 = 180 like either tea or coffee.

Question 8 of 20
8. Question
3 pointsCategory: TimeseriesHow do you forecast the electricity demand based on past usage(typical time series problem) and additional features such as time of the day, temperature and so on. Which of the following are reasonable models to try for this usecase?
Correct 3 / 3Points Taking a moving average of the previous readings is a reasonable first thing to try, except that it does not capture changes due to external conditions. It is reactive rather than predictive.
 Learning a regression model from readings in past intervals +eternal factors solves this problem. But time series data typically is autocorrelated, which means the error term in the linear regression (or the residual) is not independent from that of previous terms
 ARIMA, which stands for Autoregressive Integrated Moving Average, is a model for time series data that incorporates both autoregressive and moving average features, along with detrending of the data. The AR part means that the values are regressed on their own lagged values, the MA part means that the regression error is a linear combination of past error terms. The I part means that the data have been differenced to remove trend. This is a common model for time series analysis.
 Recently, LSTMs are being used for timeseries models when there is enough training data.
 Principal component analysis(PCA) is a dimensionality reduction technique, not necessarily used for regression.
Incorrect / 3 Points Taking a moving average of the previous readings is a reasonable first thing to try, except that it does not capture changes due to external conditions. It is reactive rather than predictive.
 Learning a regression model from readings in past intervals +eternal factors solves this problem. But time series data typically is autocorrelated, which means the error term in the linear regression (or the residual) is not independent from that of previous terms
 ARIMA, which stands for Autoregressive Integrated Moving Average, is a model for time series data that incorporates both autoregressive and moving average features, along with detrending of the data. The AR part means that the values are regressed on their own lagged values, the MA part means that the regression error is a linear combination of past error terms. The I part means that the data have been differenced to remove trend. This is a common model for time series analysis.
 Recently, LSTMs are being used for timeseries models when there is enough training data.
 Principal component analysis(PCA) is a dimensionality reduction technique, not necessarily used for regression.

Question 9 of 20
9. Question
1 pointsCategory: ML ApplicationWhat are some practical problems you are likely to encounter while building a recommendation system for an application like Netflix ?
Correct
One can get data about movies from many public sources including IMDB, wikimedia and so on. But about a particular customer, we cannot get enough data about his or her preferences(for personalisation) until they start using the application.
Incorrect

Question 10 of 20
10. Question
2 pointsCategory: Machine learning modelsWe know height and weight have a linear relationship. Let us say we are trying to fit a straight line(curve fitting) through ridge regression(using L2 regularizer). After visualizaing the result of the model, we realise that the model is fitting a non linear curve. If there are 20 data points used in training, what should be done to correct the model from non linear to linear?
Correct 2 / 2PointsThe model has overfit the data leading to a nonlinear relationship, learning a more complex model than required to explain the data. To reduce overfitting one can either increase the regularizer(L2 regularizer in this case) weight or add more data in training.
Incorrect / 2 PointsThe model has overfit the data leading to a nonlinear relationship, learning a more complex model than required to explain the data. To reduce overfitting one can either increase the regularizer(L2 regularizer in this case) weight or add more data in training.

Question 11 of 20
11. Question
1 pointsCategory: Online LearningOnline learning is called online as the system is always up and training is done on the live system.
Correct
Online learning is also done in offline mode. Here the word online means incremental. So you train the algorithm by feeding the data sequentially. Incremental or online learning is useful when the algorithm needs to adapt to a rapidly changing data, or when resources, such as time for training or computing resources, are limited. Many a times, using entire dataset for training may not be feasible as explained here.
Incorrect
Online learning is also done in offline mode. Here the word online means incremental. So you train the algorithm by feeding the data sequentially. Incremental or online learning is useful when the algorithm needs to adapt to a rapidly changing data, or when resources, such as time for training or computing resources, are limited. Many a times, using entire dataset for training may not be feasible as explained here.

Question 12 of 20
12. Question
1 pointsCategory: Machine learning modelsWhat is naive about naive bayes binary classifier ?
Correct
The naive Baye’s classifier makes the naive assumption that various feature dimensions are independent conditioned on the class .
Naive Bayes classifier : Suppose you have M dimensional data of the form and you want to predict the class for this data. From Bayes rule we have :
Using the naive Baye’s assumption of independence of all features given y, we can write :
Incorrect
The naive Baye’s classifier makes the naive assumption that various feature dimensions are independent conditioned on the class .
Naive Bayes classifier : Suppose you have M dimensional data of the form and you want to predict the class for this data. From Bayes rule we have :
Using the naive Baye’s assumption of independence of all features given y, we can write :

Question 13 of 20
13. Question
1 pointsCategory: StatisticsIf the correlation coefficient between 2 random variables is 0, which of the following statement is true?
Correct
Correlation coefficient(Coeff) only measures linear correlations and may completely miss out on non linear relationships. But if Coeff is 0.0, it doesn’t mean there is definitely a nonlinear relationship. Instead, there could be non linear relationship but Coeff can’t find it.
Note that correlation coefficient can be a powerful tool to examine the relationships between the true labels and any single feature in training data or between any two features as part of feature engineering task.
Incorrect
Correlation coefficient(Coeff) only measures linear correlations and may completely miss out on non linear relationships. But if Coeff is 0.0, it doesn’t mean there is definitely a nonlinear relationship. Instead, there could be non linear relationship but Coeff can’t find it.
Note that correlation coefficient can be a powerful tool to examine the relationships between the true labels and any single feature in training data or between any two features as part of feature engineering task.

Question 14 of 20
14. Question
1 pointsCategory: Model tuningWhat is the bias variance tradeoff ?
Correct
The biasvariance tradeoff is a core concept in supervised learning.
We want to design models that best fit the training data capturing all the subtleties in the training data, at the same time generalize well to unseen test data. The biasvariance tradeoff says we cannot do both well simultaneously.
→ If we fit the training data very well, we might end up overfitting to the training data. This might cause high variance in predictions when we try the model on various versions of test data.
→ If we avoid overfitting training data by making the model simple, say, by using a regularizer, we might end up underfitting the training data. We end up with a biased predictor, but it might work well on unseen test data – the variability in predictions across different test data is low (low variance).
Ideally we want low bias (works best on training data), low variance (generalizes well to test data). But we need to pick a tradeoff point.
Incorrect
The biasvariance tradeoff is a core concept in supervised learning.
We want to design models that best fit the training data capturing all the subtleties in the training data, at the same time generalize well to unseen test data. The biasvariance tradeoff says we cannot do both well simultaneously.
→ If we fit the training data very well, we might end up overfitting to the training data. This might cause high variance in predictions when we try the model on various versions of test data.
→ If we avoid overfitting training data by making the model simple, say, by using a regularizer, we might end up underfitting the training data. We end up with a biased predictor, but it might work well on unseen test data – the variability in predictions across different test data is low (low variance).
Ideally we want low bias (works best on training data), low variance (generalizes well to test data). But we need to pick a tradeoff point.

Question 15 of 20
15. Question
2 pointsCategory: NLPWhat are the challenges building word embeddings from tweets vs that for wikipedia data ?
Correct
Twitter data differs from wikipedia data in a number of ways :
 Twitter data is very noisy.
 Spelling errors
 Abbreviations
 Code mixing – where multiple languages are used
 Grammatical mistakes
 Tweets are very short
 There is a lot of variability in the way the language is used, arising from informal and colloquial usages.
 Many out of vocabulary words that are not in wiki.
Wikipedia data on the other hand is characterized by long well formed sentences that are curated carefully and is much cleaner than tweets data. However, wiki data is crowdsourced, and hence not fully reliable as well.
From the perspective of building word embeddings, pretrained models, such as word2vec on google news dataset and glove are much readily transferable to wikipedia data than twitter. With twitter data, existing models fall short and embeddings must be trained on tweets.
Note that reliability of data is a less important factor in building word embeddings as long as the cooccurrence characteristics are met. So this is not necessarily a challenge.
Incorrect
Twitter data differs from wikipedia data in a number of ways :
 Twitter data is very noisy.
 Spelling errors
 Abbreviations
 Code mixing – where multiple languages are used
 Grammatical mistakes
 Tweets are very short
 There is a lot of variability in the way the language is used, arising from informal and colloquial usages.
 Many out of vocabulary words that are not in wiki.
Wikipedia data on the other hand is characterized by long well formed sentences that are curated carefully and is much cleaner than tweets data. However, wiki data is crowdsourced, and hence not fully reliable as well.
From the perspective of building word embeddings, pretrained models, such as word2vec on google news dataset and glove are much readily transferable to wikipedia data than twitter. With twitter data, existing models fall short and embeddings must be trained on tweets.
Note that reliability of data is a less important factor in building word embeddings as long as the cooccurrence characteristics are met. So this is not necessarily a challenge.

Question 16 of 20
16. Question
1 pointsCategory: Model tuning Out of the bias and variance errors, which one them is introduced due to data and the algorithm ?
Correct
If your model is overfitting that means it doesn’t generalize well on the new unseen data. An overfitting model is highly complex as it is trying to fit every example and even the noise in the data. So the prediction error on new data is due to the algorithm complexity. Increasing number of parameters in the model is one way of making it more complex model.
Bias error arises due to biases we have towards certain kind of data because we can’t access all kind of data, for instance. Due to this bias, there are not right number of features enough for the model to fit accurately.
Incorrect
If your model is overfitting that means it doesn’t generalize well on the new unseen data. An overfitting model is highly complex as it is trying to fit every example and even the noise in the data. So the prediction error on new data is due to the algorithm complexity. Increasing number of parameters in the model is one way of making it more complex model.
Bias error arises due to biases we have towards certain kind of data because we can’t access all kind of data, for instance. Due to this bias, there are not right number of features enough for the model to fit accurately.

Question 17 of 20
17. Question
1 pointsCategory: timeseriesIn which of the following cases shuffling of dataset should NOT be done ?
Correct
Time series data is contextual and any shuffling will make the data random and independent, i.e. the relation between adjacent data samples would be lost after shuffling. On the other hand, shuffling should be done while doing train, test and validation split to avoid any biases in dataset or any kind of ordering due to the way data is generated. For some online algorithms like SGD classifier, shuffling is necessary as their efficiency relies on the randomness in the training set.
Incorrect
Time series data is contextual and any shuffling will make the data random and independent, i.e. the relation between adjacent data samples would be lost after shuffling. On the other hand, shuffling should be done while doing train, test and validation split to avoid any biases in dataset or any kind of ordering due to the way data is generated. For some online algorithms like SGD classifier, shuffling is necessary as their efficiency relies on the randomness in the training set.

Question 18 of 20
18. Question
1 pointsCategory: Model tuningSuppose you’re building a Machine Learning model for anomaly detection on a website, which of the following metric you’d NOT choose to evaluate your model?
Correct
Usually traffic on a website is huge with large number of normal requests but anomalies are very less. This leads to large amount of data with very less anomalies. Hence a skewed dataset problem. In any skewed dataset, accuracy measure should not be used. As predicting all examples as a majority class would give a good accuracy but large number of false positives or false negatives.
Incorrect
Usually traffic on a website is huge with large number of normal requests but anomalies are very less. This leads to large amount of data with very less anomalies. Hence a skewed dataset problem. In any skewed dataset, accuracy measure should not be used. As predicting all examples as a majority class would give a good accuracy but large number of false positives or false negatives.

Question 19 of 20
19. Question
1 pointsCategory: Model tuningWhich of the following is not a tradeoff in Machine Learning?
Correct
MLE and MAP can both be increased or decreased at the same time but in other cases, increasing one parameter decreases the other and vice versa.
Incorrect
MLE and MAP can both be increased or decreased at the same time but in other cases, increasing one parameter decreases the other and vice versa.

Question 20 of 20
20. Question
1 pointsCategory: Model tuningPick true statements
Correct
Validation set is used for tuning the model parameters. Test set is used for generalization. Cross validation is a strategy used so that training set is not reduced much due to multiple splitting into test and validation. Hence cross validation doesn’t result into reduced training set but validation(not cross validation) does.
Incorrect
Validation set is used for tuning the model parameters. Test set is used for generalization. Cross validation is a strategy used so that training set is not reduced much due to multiple splitting into test and validation. Hence cross validation doesn’t result into reduced training set but validation(not cross validation) does.