Data Science Interview Preparation
Quizsummary
0 of 57 questions completed
Questions:
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
Information
You must specify a text. 

You must specify an email address. 

You must fill out this field. 

You must specify a text. 

You must specify a number. 
Information
This quiz is for you to practice and learn your data science skills and check how well you are prepared for your upcoming data science interview.
Good luck!
You have already completed the quiz before. Hence you can not start it again.
Quiz is loading...
You must sign in or sign up to start the quiz.
You have to finish following quiz, to start this quiz:
Results
0 of 57 questions answered correctly
Time has elapsed
You have reached 0 of 0 points, (0)
Average score 

Your score 

Categories
 Not categorized 0%
 Big Data 0%
 Data Wrangling & Cleanup 0%
 Deep Learning 0%
 Exploratory Data Analysis 0%
 Feature engineering 0%
 Generalization 0%
 Linear Algebra 0%
 Machine learning models 0%
 ML Application 0%
 ML fundamentals 0%
 ML Tools 0%
 Model tuning 0%
 NLP 0%
 Online Learning 0%
 Optimization 0%
 Probability 0%
 python concepts 0%
 Statistics 0%
 timeseries 0%
 Timeseries 0%
 Unsupervised Learning 0%

Thanks for taking the quiz!
Your results have been mailed toKeep preparing for your interview on machinelearninginterview.com
We wish you good luck for any upcoming interviews!
Do let us know if you’ve any question regarding data science interviews at hello@machinelearninginterview.com
Pos.  Name  Entered on  Points  Result 

Table is loading  
No data available  
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 Answered
 Review

Question 1 of 57
1. Question
1 pointsCategory: Machine learning modelsWill logistic regression always reach the optima in a binary classification problem where the dataset is perfectly linearly separable?
Correct
While the objective function is convex, the optimum value is at infinity when the data is perfectly separable. Hence logistic regression need not converge.
To understand this intuitively consider the picture above : There are an infinite number of sigmoids that can separate the same set of points if they are perfectly separable.
Logistic regression will keep trying to go to the ideal step function and the weights keep exploding. In this case, using a regularization term helps.
Incorrect
While the objective function is convex, the optimum value is at infinity when the data is perfectly separable. Hence logistic regression need not converge.
To understand this intuitively consider the picture above : There are an infinite number of sigmoids that can separate the same set of points if they are perfectly separable.
Logistic regression will keep trying to go to the ideal step function and the weights keep exploding. In this case, using a regularization term helps.

Question 2 of 57
2. Question
3 pointsCategory: TimeseriesHow do you forecast the electricity demand based on past usage(typical time series problem) and additional features such as time of the day, temperature and so on. Which of the following are reasonable models to try for this usecase?
Correct 3 / 3Points Taking a moving average of the previous readings is a reasonable first thing to try, except that it does not capture changes due to external conditions. It is reactive rather than predictive.
 Learning a regression model from readings in past intervals +eternal factors solves this problem. But time series data typically is autocorrelated, which means the error term in the linear regression (or the residual) is not independent from that of previous terms
 ARIMA, which stands for Autoregressive Integrated Moving Average, is a model for time series data that incorporates both autoregressive and moving average features, along with detrending of the data. The AR part means that the values are regressed on their own lagged values, the MA part means that the regression error is a linear combination of past error terms. The I part means that the data have been differenced to remove trend. This is a common model for time series analysis.
 Recently, LSTMs are being used for timeseries models when there is enough training data.
 Principal component analysis(PCA) is a dimensionality reduction technique, not necessarily used for regression.
Incorrect / 3 Points Taking a moving average of the previous readings is a reasonable first thing to try, except that it does not capture changes due to external conditions. It is reactive rather than predictive.
 Learning a regression model from readings in past intervals +eternal factors solves this problem. But time series data typically is autocorrelated, which means the error term in the linear regression (or the residual) is not independent from that of previous terms
 ARIMA, which stands for Autoregressive Integrated Moving Average, is a model for time series data that incorporates both autoregressive and moving average features, along with detrending of the data. The AR part means that the values are regressed on their own lagged values, the MA part means that the regression error is a linear combination of past error terms. The I part means that the data have been differenced to remove trend. This is a common model for time series analysis.
 Recently, LSTMs are being used for timeseries models when there is enough training data.
 Principal component analysis(PCA) is a dimensionality reduction technique, not necessarily used for regression.

Question 3 of 57
3. Question
1 pointsCategory: Machine learning modelsWe know height and weight have a linear relationship. Let us say we are trying to fit a straight line through ridge regression. After visualizaing the result of the model, we realise that the model is overfitting. If there are 20 data points used in training, what should be done to reduce overfitting?
Correct
We need more data to avoid overfitting. But repeating the same data does not necessarily help since it simply has the effect of reducing the regularization weight by giving double the weight to the initial set of points leading to more overfitting. Repeating data does not add new information.
Incorrect
We need more data to avoid overfitting. But repeating the same data does not necessarily help since it simply has the effect of reducing the regularization weight by giving double the weight to the initial set of points leading to more overfitting. Repeating data does not add new information.

Question 4 of 57
4. Question
2 pointsCategory: Machine learning modelsWe know height and weight have a linear relationship. Let us say we are trying to fit a straight line(curve fitting) through ridge regression(using L2 regularizer). After visualizaing the result of the model, we realise that the model is fitting a non linear curve. If there are 20 data points used in training, what should be done to correct the model from non linear to linear?
Correct 2 / 2PointsThe model has overfit the data leading to a nonlinear relationship, learning a more complex model than required to explain the data. To reduce overfitting one can either increase the regularizer(L2 regularizer in this case) weight or add more data in training.
Incorrect / 2 PointsThe model has overfit the data leading to a nonlinear relationship, learning a more complex model than required to explain the data. To reduce overfitting one can either increase the regularizer(L2 regularizer in this case) weight or add more data in training.

Question 5 of 57
5. Question
1 pointsCategory: Linear AlgebraWhat are the eigenvalues and eigen vectors of the following matrix ?
Correct
Eigenvalues of a diagonal matrix are the diagonal elements. In general, eigen values of a matrix A are defined as all those such that for some vector x. Consider vector . Then . Similarly for , . Hence the answer is 2nd option, i.e., eigenvalue is 1 and is the corresponding eigenvector.
Incorrect
Eigenvalues of a diagonal matrix are the diagonal elements. In general, eigen values of a matrix A are defined as all those such that for some vector x. Consider vector . Then . Similarly for , . Hence the answer is 2nd option, i.e., eigenvalue is 1 and is the corresponding eigenvector.

Question 6 of 57
6. Question
1 pointsCategory: ProbabilityIn a college, 200 students are randomly selected. 140 like tea, 120 like coffee and 80 like both tea and coffee. Which of the following is true?
Correct
140 like tea while 80 like both tea and coffee. So 60 like only tea. 120 like coffee and 80 like both tea and coffee, so 40 like only coffee. Then 60 + 40 + 80 = 180 like either tea or coffee.
Incorrect
140 like tea while 80 like both tea and coffee. So 60 like only tea. 120 like coffee and 80 like both tea and coffee, so 40 like only coffee. Then 60 + 40 + 80 = 180 like either tea or coffee.

Question 7 of 57
7. Question
2 pointsCategory: ML ApplicationWhat are the different ways in which you can cluster movies into different genres for a movie recommendation engine ?
Correct
Kmeans is a clustering algorithm while PCA is used for dimensionality reduction (not necessarily clustering). Hence 1 and 3 can be used to find movie clusters while 2nd option does not make sense.
Incorrect
Kmeans is a clustering algorithm while PCA is used for dimensionality reduction (not necessarily clustering). Hence 1 and 3 can be used to find movie clusters while 2nd option does not make sense.

Question 8 of 57
8. Question
1 pointsCategory: ML ApplicationWhat are some practical problems you are likely to encounter while building a recommendation system for an application like Netflix ?
Correct
One can get data about movies from many public sources including IMDB, wikimedia and so on. But about a particular customer, we cannot get enough data about his or her preferences(for personalisation) until they start using the application.
Incorrect

Question 9 of 57
9. Question
1 pointsCategory: Machine learning modelsHow do you eliminate underfitting ?
Correct
Underfitting is the opposite of overfitting and it occurs when model is too simple to fit the data well enough. This could happen if the right features were not selected or extracted, or there is high regularization .
 To avoid underfitting, one way is to make the model more complex, with more parameters for instance.
 If you’re convinced that your model is complex enough, try to increase the number of features or extract new features from the existing ones to solve the underfitting problem.
 Once you’ve performed both the steps, reduce the regularization hyperparameter.
Incorrect
Underfitting is the opposite of overfitting and it occurs when model is too simple to fit the data well enough. This could happen if the right features were not selected or extracted, or there is high regularization .
 To avoid underfitting, one way is to make the model more complex, with more parameters for instance.
 If you’re convinced that your model is complex enough, try to increase the number of features or extract new features from the existing ones to solve the underfitting problem.
 Once you’ve performed both the steps, reduce the regularization hyperparameter.

Question 10 of 57
10. Question
1 pointsCategory: Online LearningOnline learning is called online as the system is always up and training is done on the live system.
Correct
Online learning is also done in offline mode. Here the word online means incremental. So you train the algorithm by feeding the data sequentially. Incremental or online learning is useful when the algorithm needs to adapt to a rapidly changing data, or when resources, such as time for training or computing resources, are limited. Many a times, using entire dataset for training may not be feasible as explained here.
Incorrect
Online learning is also done in offline mode. Here the word online means incremental. So you train the algorithm by feeding the data sequentially. Incremental or online learning is useful when the algorithm needs to adapt to a rapidly changing data, or when resources, such as time for training or computing resources, are limited. Many a times, using entire dataset for training may not be feasible as explained here.

Question 11 of 57
11. Question
1 pointsCategory: Model tuningWhat is the bias variance tradeoff ?
Correct
The biasvariance tradeoff is a core concept in supervised learning.
We want to design models that best fit the training data capturing all the subtleties in the training data, at the same time generalize well to unseen test data. The biasvariance tradeoff says we cannot do both well simultaneously.
→ If we fit the training data very well, we might end up overfitting to the training data. This might cause high variance in predictions when we try the model on various versions of test data.
→ If we avoid overfitting training data by making the model simple, say, by using a regularizer, we might end up underfitting the training data. We end up with a biased predictor, but it might work well on unseen test data – the variability in predictions across different test data is low (low variance).
Ideally we want low bias (works best on training data), low variance (generalizes well to test data). But we need to pick a tradeoff point.
Incorrect
The biasvariance tradeoff is a core concept in supervised learning.
We want to design models that best fit the training data capturing all the subtleties in the training data, at the same time generalize well to unseen test data. The biasvariance tradeoff says we cannot do both well simultaneously.
→ If we fit the training data very well, we might end up overfitting to the training data. This might cause high variance in predictions when we try the model on various versions of test data.
→ If we avoid overfitting training data by making the model simple, say, by using a regularizer, we might end up underfitting the training data. We end up with a biased predictor, but it might work well on unseen test data – the variability in predictions across different test data is low (low variance).
Ideally we want low bias (works best on training data), low variance (generalizes well to test data). But we need to pick a tradeoff point.

Question 12 of 57
12. Question
1 pointsCategory: Machine learning modelsWhat is naive about naive bayes binary classifier ?
Correct
The naive Baye’s classifier makes the naive assumption that various feature dimensions are independent conditioned on the class .
Naive Bayes classifier : Suppose you have M dimensional data of the form and you want to predict the class for this data. From Bayes rule we have :
Using the naive Baye’s assumption of independence of all features given y, we can write :
Incorrect
The naive Baye’s classifier makes the naive assumption that various feature dimensions are independent conditioned on the class .
Naive Bayes classifier : Suppose you have M dimensional data of the form and you want to predict the class for this data. From Bayes rule we have :
Using the naive Baye’s assumption of independence of all features given y, we can write :

Question 13 of 57
13. Question
2 pointsCategory: NLPWhat are the challenges building word embeddings from tweets vs that for wikipedia data ?
Correct
Twitter data differs from wikipedia data in a number of ways :
 Twitter data is very noisy.
 Spelling errors
 Abbreviations
 Code mixing – where multiple languages are used
 Grammatical mistakes
 Tweets are very short
 There is a lot of variability in the way the language is used, arising from informal and colloquial usages.
 Many out of vocabulary words that are not in wiki.
Wikipedia data on the other hand is characterized by long well formed sentences that are curated carefully and is much cleaner than tweets data. However, wiki data is crowdsourced, and hence not fully reliable as well.
From the perspective of building word embeddings, pretrained models, such as word2vec on google news dataset and glove are much readily transferable to wikipedia data than twitter. With twitter data, existing models fall short and embeddings must be trained on tweets.
Note that reliability of data is a less important factor in building word embeddings as long as the cooccurrence characteristics are met. So this is not necessarily a challenge.
Incorrect
Twitter data differs from wikipedia data in a number of ways :
 Twitter data is very noisy.
 Spelling errors
 Abbreviations
 Code mixing – where multiple languages are used
 Grammatical mistakes
 Tweets are very short
 There is a lot of variability in the way the language is used, arising from informal and colloquial usages.
 Many out of vocabulary words that are not in wiki.
Wikipedia data on the other hand is characterized by long well formed sentences that are curated carefully and is much cleaner than tweets data. However, wiki data is crowdsourced, and hence not fully reliable as well.
From the perspective of building word embeddings, pretrained models, such as word2vec on google news dataset and glove are much readily transferable to wikipedia data than twitter. With twitter data, existing models fall short and embeddings must be trained on tweets.
Note that reliability of data is a less important factor in building word embeddings as long as the cooccurrence characteristics are met. So this is not necessarily a challenge.

Question 14 of 57
14. Question
1 pointsCategory: Model tuning Out of the bias and variance errors, which one them is introduced due to data and the algorithm ?
Correct
If your model is overfitting that means it doesn’t generalize well on the new unseen data. An overfitting model is highly complex as it is trying to fit every example and even the noise in the data. So the prediction error on new data is due to the algorithm complexity. Increasing number of parameters in the model is one way of making it more complex model.
Bias error arises due to biases we have towards certain kind of data because we can’t access all kind of data, for instance. Due to this bias, there are not right number of features enough for the model to fit accurately.
Incorrect
If your model is overfitting that means it doesn’t generalize well on the new unseen data. An overfitting model is highly complex as it is trying to fit every example and even the noise in the data. So the prediction error on new data is due to the algorithm complexity. Increasing number of parameters in the model is one way of making it more complex model.
Bias error arises due to biases we have towards certain kind of data because we can’t access all kind of data, for instance. Due to this bias, there are not right number of features enough for the model to fit accurately.

Question 15 of 57
15. Question
1 pointsCategory: StatisticsIf the correlation coefficient between 2 random variables is 0, which of the following statement is true?
Correct
Correlation coefficient(Coeff) only measures linear correlations and may completely miss out on non linear relationships. But if Coeff is 0.0, it doesn’t mean there is definitely a nonlinear relationship. Instead, there could be non linear relationship but Coeff can’t find it.
Note that correlation coefficient can be a powerful tool to examine the relationships between the true labels and any single feature in training data or between any two features as part of feature engineering task.
Incorrect
Correlation coefficient(Coeff) only measures linear correlations and may completely miss out on non linear relationships. But if Coeff is 0.0, it doesn’t mean there is definitely a nonlinear relationship. Instead, there could be non linear relationship but Coeff can’t find it.
Note that correlation coefficient can be a powerful tool to examine the relationships between the true labels and any single feature in training data or between any two features as part of feature engineering task.

Question 16 of 57
16. Question
1 pointsCategory: timeseriesIn which of the following cases shuffling of dataset should NOT be done ?
Correct
Time series data is contextual and any shuffling will make the data random and independent, i.e. the relation between adjacent data samples would be lost after shuffling. On the other hand, shuffling should be done while doing train, test and validation split to avoid any biases in dataset or any kind of ordering due to the way data is generated. For some online algorithms like SGD classifier, shuffling is necessary as their efficiency relies on the randomness in the training set.
Incorrect
Time series data is contextual and any shuffling will make the data random and independent, i.e. the relation between adjacent data samples would be lost after shuffling. On the other hand, shuffling should be done while doing train, test and validation split to avoid any biases in dataset or any kind of ordering due to the way data is generated. For some online algorithms like SGD classifier, shuffling is necessary as their efficiency relies on the randomness in the training set.

Question 17 of 57
17. Question
1 pointsCategory: Model tuningSuppose you’re building a Machine Learning model for anomaly detection on a website, which of the following metric you’d NOT choose to evaluate your model?
Correct
Usually traffic on a website is huge with large number of normal requests but anomalies are very less. This leads to large amount of data with very less anomalies. Hence a skewed dataset problem. In any skewed dataset, accuracy measure should not be used. This is because, even if we learn a very bad classifier predicting all examples as the majority class, we would get a good accuracy score – but this does not mean we want to pick this classifier.
The F1 score based on precision and recall and the AUC metric are better metrics on an imbalanced dataset.
Incorrect
Usually traffic on a website is huge with large number of normal requests but anomalies are very less. This leads to large amount of data with very less anomalies. Hence a skewed dataset problem. In any skewed dataset, accuracy measure should not be used. This is because, even if we learn a very bad classifier predicting all examples as the majority class, we would get a good accuracy score – but this does not mean we want to pick this classifier.
The F1 score based on precision and recall and the AUC metric are better metrics on an imbalanced dataset.

Question 18 of 57
18. Question
1 pointsCategory: Model tuningWhich of the following is not really a tradeoff while developing an ML algorithm?
Correct
When building an ML model, often one needs to trade off between
 Bias and variance tradeoff : Tradeoff between how well the model fits the existing training data vs how well we want the model to generalize to new data)
 Precision and recall : Recall measures how many of all the positives are actually classified positive. Precision measures how many of the items identified as positive are actually positive. It is easy to get a recall of 1 or a precision of 1, but hard to get high precision and recall – we want to pick the option that leads to the highest combination of precision and recall. F1 score is a metric to measure this.
 MLE and MAP : MLE stands for the maximum likelihood estimate – the value that maximises the likelihood function. MAP refers to the Maximum a posterior estimate – the value that maximises the posterior distribution. There is no tradeoff here as in the ideal case, with infinite data, both estimates converge.
Incorrect
When building an ML model, often one needs to trade off between
 Bias and variance tradeoff : Tradeoff between how well the model fits the existing training data vs how well we want the model to generalize to new data)
 Precision and recall : Recall measures how many of all the positives are actually classified positive. Precision measures how many of the items identified as positive are actually positive. It is easy to get a recall of 1 or a precision of 1, but hard to get high precision and recall – we want to pick the option that leads to the highest combination of precision and recall. F1 score is a metric to measure this.
 MLE and MAP : MLE stands for the maximum likelihood estimate – the value that maximises the likelihood function. MAP refers to the Maximum a posterior estimate – the value that maximises the posterior distribution. There is no tradeoff here as in the ideal case, with infinite data, both estimates converge.

Question 19 of 57
19. Question
1 pointsCategory: Model tuningWhich of the following makes sense during kfold cross validation on a relatively small training dataset of 100 samples
Correct
Crossvalidation involves partitioning training data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (validation set). The validation set is rotated between various partitions to reduce variance of the estimated model.
With a very small dataset, when K is small, say 2, the training set has only half the available data, leading to the possibility of overfitting. A higher K, such as 10, ensures 90 data samples during training which is better than just 50 samples. Hence, for this question with just 100 data points, a higher value of K makes sense.
With a larger dataset, a small value of K might suffice, if each fold has enough data points. With a lower K in such a case, the training time might be lower.
Incorrect
Crossvalidation involves partitioning training data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (validation set). The validation set is rotated between various partitions to reduce variance of the estimated model.
With a very small dataset, when K is small, say 2, the training set has only half the available data, leading to the possibility of overfitting. A higher K, such as 10, ensures 90 data samples during training which is better than just 50 samples. Hence, for this question with just 100 data points, a higher value of K makes sense.
With a larger dataset, a small value of K might suffice, if each fold has enough data points. With a lower K in such a case, the training time might be lower.

Question 20 of 57
20. Question
1 pointsCategory: Linear AlgebraIf , then which of the following is true?
Correct
Matrix multiplication is associative, i.e. (AB)C = A(BC)
Matrix multiplication is distributive, i.e. A(B+C) = AB+AC
Matrix multiplication may not be commutative, i.e. AB = BA only if m=q in the given cas.eIncorrect
Matrix multiplication is associative, i.e. (AB)C = A(BC)
Matrix multiplication is distributive, i.e. A(B+C) = AB+AC
Matrix multiplication may not be commutative, i.e. AB = BA only if m=q in the given cas.e 
Question 21 of 57
21. Question
1 pointsCategory: OptimizationFor f(x) = x ,
Correct
x is a linear function which has a minimum at x=0. If you plot the graph of x, you’d realise it has a unique minimum at x=0. Also by property of convex function, it should be clear that x is convex function. Note that convex function definition have nothing to do with function differentiability.
Incorrect
x is a linear function which has a minimum at x=0. If you plot the graph of x, you’d realise it has a unique minimum at x=0. Also by property of convex function, it should be clear that x is convex function. Note that convex function definition have nothing to do with function differentiability.

Question 22 of 57
22. Question
1 pointsCategory: OptimizationFor f(x) = , which of the following is true.
Correct
Exponential function on open interval is convex.
Apply the convex function property to realise that this is a convex function.
What it means is a monotonically non decreasing function can be convex too.
Incorrect
Exponential function on open interval is convex.
Apply the convex function property to realise that this is a convex function.
What it means is a monotonically non decreasing function can be convex too.

Question 23 of 57
23. Question
1 pointsCategory: OptimizationFor f(x) = , which of the following is true ?
Correct
Note that double differentiation of this function is f”(x) =
f”(x) >= 0 for all values of x because of
Strictly convex function means the function has not more than one minimum.
This function has only one minimum which is at x = 0 as f'(x) = = 0 at x=0
Incorrect
Note that double differentiation of this function is f”(x) =
f”(x) >= 0 for all values of x because of
Strictly convex function means the function has not more than one minimum.
This function has only one minimum which is at x = 0 as f'(x) = = 0 at x=0

Question 24 of 57
24. Question
3 pointsWhich of the following are true?
Correct
Incorrect
Hint
Try constructing a 2 by 2 matrix and infer the results

Question 25 of 57
25. Question
1 pointsCategory: GeneralizationWhile training a machine learning model
Correct
Preprocessing is done after train test split. Note that the purpose of train test split is to ensure better generalization. If test set is also included in preprocessing or preprocessing is done on entire dataset, the purpose test set is lost which was to mimic new unseen data. Hence first data is split into train and test set then preprocessing is done on train set.
Incorrect
Preprocessing is done after train test split. Note that the purpose of train test split is to ensure better generalization. If test set is also included in preprocessing or preprocessing is done on entire dataset, the purpose test set is lost which was to mimic new unseen data. Hence first data is split into train and test set then preprocessing is done on train set.

Question 26 of 57
26. Question
1 pointsCategory: Feature engineeringPick the right statement
Correct
Feature engineering includes feature selection and feature extraction. Features are nothing but the attributes of each data point. Feature engineering is about the entire problem set and requires domain expertise. It is not much dependent on dataset but more on the problem in hand. Data preprocessing is performed on entire training set. If during data preprocessing one figures out two features are correlated, one can get single feature from both features by combining them. Hence both are different tasks as feature engineering is on feature(selection and extraction) and data preprocessing is on the dataset by removing or filling missing values for each feature, removing irrelevant features etc.
Incorrect
Feature engineering includes feature selection and feature extraction. Features are nothing but the attributes of each data point. Feature engineering is about the entire problem set and requires domain expertise. It is not much dependent on dataset but more on the problem in hand. Data preprocessing is performed on entire training set. If during data preprocessing one figures out two features are correlated, one can get single feature from both features by combining them. Hence both are different tasks as feature engineering is on feature(selection and extraction) and data preprocessing is on the dataset by removing or filling missing values for each feature, removing irrelevant features etc.
Hint
Feature Engineering includes feature selection and feature extraction

Question 27 of 57
27. Question
1 pointsCategory: GeneralizationPick true statements
Correct
Training set, test set and validation set all must have same features. Only difference is data preprocessing is done(fit) on training set(combined with validation set) and applied(transform) on test set. Actual steps involved are
 Split dataset into train and test set
 Preprocessing on training set
 Split training set into training and validation sets
 Transform test set using same preprocessing done in (2)
 Evaluate model on transformed test set
Though test set is used for testing generalization but it may not have same number of points. Standard is to keep 20% of entire dataset for test set but even this can change depending on the total number of points. For eg, if there are more than million samples, even 10% can be good for test set but if there are 10,000 samples, 20% is good for test set.
Incorrect
Training set, test set and validation set all must have same features. Only difference is data preprocessing is done(fit) on training set(combined with validation set) and applied(transform) on test set. Actual steps involved are
 Split dataset into train and test set
 Preprocessing on training set
 Split training set into training and validation sets
 Transform test set using same preprocessing done in (2)
 Evaluate model on transformed test set
Though test set is used for testing generalization but it may not have same number of points. Standard is to keep 20% of entire dataset for test set but even this can change depending on the total number of points. For eg, if there are more than million samples, even 10% can be good for test set but if there are 10,000 samples, 20% is good for test set.

Question 28 of 57
28. Question
1 pointsWhich of the tasks can be skipped in any Machine Learning task ?
Correct
All tasks are important for the success of any ML algorithm. Unless we know what problem in hand or how is the dataset, it is difficult to know what steps can be skipped.
Incorrect
All tasks are important for the success of any ML algorithm. Unless we know what problem in hand or how is the dataset, it is difficult to know what steps can be skipped.

Question 29 of 57
29. Question
1 pointsCategory: StatisticsWhich of the following is true in supervised learning after all data preprocessing and feature engineering?
Correct
True labels should be correlated to some features. If they’re not correlated we may not be able to predict correct values from the features. Only due to some correlation between true labels and features, supervised learning models are able to predict correct value. This correlation can be either linear or nonlinear.
Incorrect
True labels should be correlated to some features. If they’re not correlated we may not be able to predict correct values from the features. Only due to some correlation between true labels and features, supervised learning models are able to predict correct value. This correlation can be either linear or nonlinear.

Question 30 of 57
30. Question
1 pointsCategory: python conceptsWhy does python convention not permit local variables to begin with an underscore
Correct
Private variables of a class with an underscore by convention. Hence local variables do not usually start with an underscore. Refer to the PEP guide on coding conventions for more details. https://www.python.org/dev/peps/pep0008/
Incorrect
Private variables of a class with an underscore by convention. Hence local variables do not usually start with an underscore. Refer to the PEP guide on coding conventions for more details. https://www.python.org/dev/peps/pep0008/

Question 31 of 57
31. Question
1 pointsCategory: python conceptsWhich of the following is not a built in type in python?
Correct
Python core library does not contain an array. It has a list that is a generalized array of dynamic length and capable of holding dissimilar elements. Numpy contains an array that is optimized for performance in terms of space and speed. All the remaining choices are builtin types in python.
Incorrect
Python core library does not contain an array. It has a list that is a generalized array of dynamic length and capable of holding dissimilar elements. Numpy contains an array that is optimized for performance in terms of space and speed. All the remaining choices are builtin types in python.

Question 32 of 57
32. Question
1 pointsCategory: python conceptsWhat Does the Map function in python do ?
Correct
The map function executes a function given as the first argument on all the elements of the list (iterable) given as the second argument. Check out the following link to learn more about map https://www.geeksforgeeks.org/pythonmapfunction/.
Incorrect
The map function executes a function given as the first argument on all the elements of the list (iterable) given as the second argument. Check out the following link to learn more about map https://www.geeksforgeeks.org/pythonmapfunction/.

Question 33 of 57
33. Question
1 pointsCategory: Data Wrangling & CleanupWhich of the following are valid ways of handling missing data when some feature values are missing in some data instances ?
Correct
There is no great way to deal with missing data but use many heuristics such as those mentioned above.
 The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data.
 If more than 5060% of rows of a specific column are missing data, it is common to remove the column. The main problem with removing missing data thus, is that it could introduce substantial bias.
 Imputation of data is also a common technique used to deal with missing data where the data is substituted with the best guess.
 Imputation with mean : Missing data is replaced by mean of the column
 Imputation with median : Missing data is replaced by mean of the column
 Imputation with Mode: Missing data is replaced with mode of the column
 Imputation with linear regression : With real valued data, this is a common technique. The missing value is replaced by performing linear regression based on the other feature values.
Incorrect
There is no great way to deal with missing data but use many heuristics such as those mentioned above.
 The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data.
 If more than 5060% of rows of a specific column are missing data, it is common to remove the column. The main problem with removing missing data thus, is that it could introduce substantial bias.
 Imputation of data is also a common technique used to deal with missing data where the data is substituted with the best guess.
 Imputation with mean : Missing data is replaced by mean of the column
 Imputation with median : Missing data is replaced by mean of the column
 Imputation with Mode: Missing data is replaced with mode of the column
 Imputation with linear regression : With real valued data, this is a common technique. The missing value is replaced by performing linear regression based on the other feature values.

Question 34 of 57
34. Question
1 pointsCategory: Big DataWhy would one use Scala vs python on spark ?
Correct
Python is more prevalent as the defacto language for data science. Python has better ML libraries and visualization tools.
However Scala is more efficient to use on spark. Scala uses Java Virtual Machine (JVM) during runtime which gives it some speed over Python in most cases. Python is dynamically typed and this reduces the speed. Compiled languages are faster than interpreted.
Incorrect
Python is more prevalent as the defacto language for data science. Python has better ML libraries and visualization tools.
However Scala is more efficient to use on spark. Scala uses Java Virtual Machine (JVM) during runtime which gives it some speed over Python in most cases. Python is dynamically typed and this reduces the speed. Compiled languages are faster than interpreted.

Question 35 of 57
35. Question
2 pointsCategory: python conceptsWhich of the following about python is true ? (Select all that apply)
Correct
 Python is dynamically typed (unlike Java, C, C++). It doesn’t know about the type of the variable until the code is run and the binding happens on the go.
 Python is an interpreted language. This means it uses an interpreter and is not compiled directly to machine code. Python is compiled to byte code and an interpreter executes the statements onebyone from the code, on the go. In compiled languages like c++ on the other hand, the compiler executes the entire code and shows all errors before creating a binary.
 Python has automatic garbage collection unlike c++ where badly written programs can lead to memory leaks.
Incorrect
 Python is dynamically typed (unlike Java, C, C++). It doesn’t know about the type of the variable until the code is run and the binding happens on the go.
 Python is an interpreted language. This means it uses an interpreter and is not compiled directly to machine code. Python is compiled to byte code and an interpreter executes the statements onebyone from the code, on the go. In compiled languages like c++ on the other hand, the compiler executes the entire code and shows all errors before creating a binary.
 Python has automatic garbage collection unlike c++ where badly written programs can lead to memory leaks.

Question 36 of 57
36. Question
1 pointsCategory: ML fundamentalsYou observe low error on training set and high error on test set. Your ML model is most likely :
Correct
The model is overfitting to the training data possibly leading to low error on the training data, however the model is not able to generalize well to the test data.
Note that the other choice of “underfitting” is wrong, since underfitting the training data would have lead to high training error.
A fundamental aspect of Machine Learning is the ability to learn from the available training data and “generalize” to new unseen test data.
Incorrect
The model is overfitting to the training data possibly leading to low error on the training data, however the model is not able to generalize well to the test data.
Note that the other choice of “underfitting” is wrong, since underfitting the training data would have lead to high training error.
A fundamental aspect of Machine Learning is the ability to learn from the available training data and “generalize” to new unseen test data.

Question 37 of 57
37. Question
1 pointsCategory: ML fundamentalsWhat is the curse of dimensionality?
Correct
The curse of dimensionality is a core concept in Machine Learning. As the number of dimensions grows (i.e number of features), we need exponentially more data to fit the model.
 As the number of data points grows, more computational effort is much more to search in a higher dimensional space through more parameter combinations
 If we try to increase the dimensionality of the function we are trying to fit, data points in the training data, are sparser in the higher dimensional space than the lower dimensional space (average distance between points increases). So more data is needed.
A common technique to combat this is dimensionality reduction, to remove redundant dimensions.
Look at this article for more information https://medium.freecodecamp.org/thecurseofdimensionalityhowwecansavebigdatafromitselfd9fa0f872335
Incorrect
The curse of dimensionality is a core concept in Machine Learning. As the number of dimensions grows (i.e number of features), we need exponentially more data to fit the model.
 As the number of data points grows, more computational effort is much more to search in a higher dimensional space through more parameter combinations
 If we try to increase the dimensionality of the function we are trying to fit, data points in the training data, are sparser in the higher dimensional space than the lower dimensional space (average distance between points increases). So more data is needed.
A common technique to combat this is dimensionality reduction, to remove redundant dimensions.
Look at this article for more information https://medium.freecodecamp.org/thecurseofdimensionalityhowwecansavebigdatafromitselfd9fa0f872335

Question 38 of 57
38. Question
1 pointsCategory: ML fundamentalsWhat is the difference between test data set and validation data set ?
Correct
Train data is typically used for training the model.
Validation data is used for hyper parameter tuning. Usually validation data is a part of the train data. Often kfold cross validation is used where 1 in k parts of the training data is used at a time for validation.
Test data The test data is usually held out data with gold standard labels available. Once hyperparameters are learnt on the validation data, a final metric is computed on the testing data based on which a decision can be taken whether the model is acceptable.
Incorrect
Train data is typically used for training the model.
Validation data is used for hyper parameter tuning. Usually validation data is a part of the train data. Often kfold cross validation is used where 1 in k parts of the training data is used at a time for validation.
Test data The test data is usually held out data with gold standard labels available. Once hyperparameters are learnt on the validation data, a final metric is computed on the testing data based on which a decision can be taken whether the model is acceptable.

Question 39 of 57
39. Question
1 pointsCategory: Machine learning modelsWhich of the following metrics would you measure w.r.t a logistic regression classifier ? (Select all that apply)
Correct
F1 score is the only classification metric. Note that logistic regression is a classification algorithm. All the others metrics Mean absolute error, Root mean square error, Goodness of fit metric are regression metrics.
For more information on F1 score and classification metrics, refer to :
https://mlitest224c419.ingressbaronn.easywp.com/topics/machinelearning/youhavecomeupwithaspamclassifierhowdoyoumeasureaccuracy/
Incorrect
F1 score is the only classification metric. Note that logistic regression is a classification algorithm. All the others metrics Mean absolute error, Root mean square error, Goodness of fit metric are regression metrics.
For more information on F1 score and classification metrics, refer to :
https://mlitest224c419.ingressbaronn.easywp.com/topics/machinelearning/youhavecomeupwithaspamclassifierhowdoyoumeasureaccuracy/

Question 40 of 57
40. Question
1 pointsCategory: Machine learning modelsWhat is the time complexity of Kmeans clustering algorithm (Let N be the number of data points, K be the number of clusters, D is the number of features or dimensions and I the number of iterations)?
Correct
The Kmeans algorithm has two steps :
 For each point compute closest cluster center. If there are K clusters and N data points, D dimensions each, this step takes O(N*K*D)
 For each cluster, compute new cluster mean : O(N*K*D) overall
 If there are I iterations, we repeat this process I times making the complexity O(K*N*D*I).
For more information on the kmeans clustering algorithm, please refer to :
https://www.datascience.com/blog/kmeansclusteringIncorrect
The Kmeans algorithm has two steps :
 For each point compute closest cluster center. If there are K clusters and N data points, D dimensions each, this step takes O(N*K*D)
 For each cluster, compute new cluster mean : O(N*K*D) overall
 If there are I iterations, we repeat this process I times making the complexity O(K*N*D*I).
For more information on the kmeans clustering algorithm, please refer to :
https://www.datascience.com/blog/kmeansclustering 
Question 41 of 57
41. Question
1 pointsCategory: Deep LearningWhy are dropouts used in a deep neural network ?
Correct
Dropout ensures some of the hidden units are dropped out at random to ensure the network does not overfit by becoming too reliant on a neuron by letting it overfit.
Note that dropout is often considered a form of regularization to ensure the network does not memorize the target during learning.
For more information on Dropouts – see here.Incorrect
Dropout ensures some of the hidden units are dropped out at random to ensure the network does not overfit by becoming too reliant on a neuron by letting it overfit.
Note that dropout is often considered a form of regularization to ensure the network does not memorize the target during learning.
For more information on Dropouts – see here. 
Question 42 of 57
42. Question
1 pointsCategory: Machine learning modelsWhen would you use SVM over KNN for classification ?
Correct
Both SVM and KNN can fit nonlinear decision boundaries. Note that SVM can do arbitrarily complex decision boundaries with the kernel trick. KNN can also fit nonlinear decision boundaries by playing with the K value.
SVM is relatively fast during prediction time (depending on the kernel used). The prediction runtime is O(num of support vectors * number of features) since the process involves taking the dot product of the datapoint (for which you want to predict) with each support vector.
Prediction with KNN could be slow. Because Computing the K nearest neighbours for a new datapoint depends on the specific data structure used (A Kdtree is a common data structure used) and. However the prediction complexity still has a log(n) term where n is the number of data points. For huge datasets KNN does not work well.
Note that recently approximate KNN techniques based on Locality Sensitive Hashing are also commonly used.Incorrect
Both SVM and KNN can fit nonlinear decision boundaries. Note that SVM can do arbitrarily complex decision boundaries with the kernel trick. KNN can also fit nonlinear decision boundaries by playing with the K value.
SVM is relatively fast during prediction time (depending on the kernel used). The prediction runtime is O(num of support vectors * number of features) since the process involves taking the dot product of the datapoint (for which you want to predict) with each support vector.
Prediction with KNN could be slow. Because Computing the K nearest neighbours for a new datapoint depends on the specific data structure used (A Kdtree is a common data structure used) and. However the prediction complexity still has a log(n) term where n is the number of data points. For huge datasets KNN does not work well.
Note that recently approximate KNN techniques based on Locality Sensitive Hashing are also commonly used. 
Question 43 of 57
43. Question
3 pointsCategory: ML fundamentalsYou see that the training error is high, and the test error is high when you trained your deep learning model. Which of the following makes sense ?
Correct
Training error is high : You probably built a bad model: You have underfit at least some portion of your training data (it is possible you overfit some other portions simultaneously leading to high variance! )
You can :
 Make the model more complex by increasing the number of parameters (this could be number of neurons/hidden layers in a deep learning model)
 Use more features to make the model more complex
 Clean up your data: Your data could be inconsistent and full of outliers that is causing a bad fit. Cleaning the data might result in a better model.
 Decrease the regularizer if you used very high regularization
Incorrect
Training error is high : You probably built a bad model: You have underfit at least some portion of your training data (it is possible you overfit some other portions simultaneously leading to high variance! )
You can :
 Make the model more complex by increasing the number of parameters (this could be number of neurons/hidden layers in a deep learning model)
 Use more features to make the model more complex
 Clean up your data: Your data could be inconsistent and full of outliers that is causing a bad fit. Cleaning the data might result in a better model.
 Decrease the regularizer if you used very high regularization

Question 44 of 57
44. Question
1 pointsCategory: Machine learning modelsWhat is “naive” about naive bayes binary classifier ?
Correct
The naive Baye’s classifier makes the naive assumption that various feature dimensions are independent for a given class.
Naive Bayes classifier : Suppose you have M dimensional data of the form and you want to predict the class for this data.
From Bayes rule we have :
Using the naive Baye’s assumption of independence of all features given y, we can write :
Each of the values of , and and are parameters of the algorithm that can be learnt. Once these parameters are learnt, for a new data point, we can use the formula above to make a prediction.
Incorrect
The naive Baye’s classifier makes the naive assumption that various feature dimensions are independent for a given class.
Naive Bayes classifier : Suppose you have M dimensional data of the form and you want to predict the class for this data.
From Bayes rule we have :
Using the naive Baye’s assumption of independence of all features given y, we can write :
Each of the values of , and and are parameters of the algorithm that can be learnt. Once these parameters are learnt, for a new data point, we can use the formula above to make a prediction.

Question 45 of 57
45. Question
1 pointsCategory: python conceptsa = np.arange(3) b = np.arange(3)[:, np.newaxis] print(a) print(b) [0 1 2] [[0] [1] [2]] c=a + b
Correct
In python, when two arrays have a different shape, the smaller array is “broadcast” across the larger array. In the above example a is broadcast to
array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
b is broadcast to
array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
The resulting vector is the following :
array([[0, 1, 2], [1, 2, 3], [2, 3, 4]])
Note the following rules of broadcasting : (https://docs.scipy.org/doc/numpy1.13.0/user/basics.broadcasting.html)
When operating on two arrays, NumPy compares their shapes elementwise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
 they are equal, or
 one of them is 1
Incorrect
In python, when two arrays have a different shape, the smaller array is “broadcast” across the larger array. In the above example a is broadcast to
array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
b is broadcast to
array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
The resulting vector is the following :
array([[0, 1, 2], [1, 2, 3], [2, 3, 4]])
Note the following rules of broadcasting : (https://docs.scipy.org/doc/numpy1.13.0/user/basics.broadcasting.html)
When operating on two arrays, NumPy compares their shapes elementwise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
 they are equal, or
 one of them is 1

Question 46 of 57
46. Question
1 pointsCategory: ML ToolsWhich of the following deep learning frameworks is hardest to debug with?
Correct
Keras provides many high level abstractions making it easy to get going with deep learning models with minimal code. However, debugging code is hard since it is harder to locate the actual line of code that breaks, since one cannot examine the values of intermediate results in the deep learning model.
Pytorch on the other hand, while more verbose to build models, is much more convenient to debug.
Check out https://deepsense.ai/kerasorpytorch/
Incorrect
Keras provides many high level abstractions making it easy to get going with deep learning models with minimal code. However, debugging code is hard since it is harder to locate the actual line of code that breaks, since one cannot examine the values of intermediate results in the deep learning model.
Pytorch on the other hand, while more verbose to build models, is much more convenient to debug.
Check out https://deepsense.ai/kerasorpytorch/

Question 47 of 57
47. Question
1 pointsCategory: Big DataWhich of the following is true of spark vs Hadoop Map Reduce?
Correct
Spark is a framework for performing general data analytics on top of a distributed computing cluster like Hadoop. Note that spark leverages HDFS for the underlying file system and does not have its own file system support. It provides in memory computations for increased speed and data processing over Hadoop’s Mapreduce framework.
The key difference between Hadoop MapReduce and Spark is that Spark can do processing inmemory, while Hadoop MapReduce has to read from and write to a disk. As a result, the speed of processing is better with Spark. For the same reason, MapReduce is typically more suited for batch processing.
Incorrect
Spark is a framework for performing general data analytics on top of a distributed computing cluster like Hadoop. Note that spark leverages HDFS for the underlying file system and does not have its own file system support. It provides in memory computations for increased speed and data processing over Hadoop’s Mapreduce framework.
The key difference between Hadoop MapReduce and Spark is that Spark can do processing inmemory, while Hadoop MapReduce has to read from and write to a disk. As a result, the speed of processing is better with Spark. For the same reason, MapReduce is typically more suited for batch processing.

Question 48 of 57
48. Question
1 pointsCategory: ML ToolsWhich of the following is true when you compare a docker to a VM ?
Correct
A VM runs a unique version of the OS, complete with memory management. Multiple VMs only share hardware resources.
Docker containers are typically more lightweight compared to VMs. A container does not have its own memory management layer. A kernel bug can be exploited from an app running in the container on the host OS. Data in another container can be read/ changed if such a kernel bug allows access to contents of another container.
Further, Docker images are usually smaller than VM images, makes it easy to build, copy, share. Second, Docker containers can start in several milliseconds, while VM starts in seconds.
Incorrect
A VM runs a unique version of the OS, complete with memory management. Multiple VMs only share hardware resources.
Docker containers are typically more lightweight compared to VMs. A container does not have its own memory management layer. A kernel bug can be exploited from an app running in the container on the host OS. Data in another container can be read/ changed if such a kernel bug allows access to contents of another container.
Further, Docker images are usually smaller than VM images, makes it easy to build, copy, share. Second, Docker containers can start in several milliseconds, while VM starts in seconds.

Question 49 of 57
49. Question
1 pointsCategory: python conceptsWhat does the equality operator in python do ?
Correct
Equality operator in python makes both variables refer to the same object. It does not copy. To copy an object in python, one can do a deep copy or a shallow copy.
Shallow copy creates a new object and copies references to the constituents of the original object to the new object. For instance if you have a list of lists, shallow copying it will create a new top level list, but the constituent lists will still refer to the list objects in the original object. list.copy() does a shallow copy.
Deep copy not only makes a copy of the top level object, but recursively deep copies all constituent objects. For more details see the article below :
https://medium.com/@thawsitt/assignmentvsshallowcopyvsdeepcopyinpythonf70c2f0ebd86
Incorrect
Equality operator in python makes both variables refer to the same object. It does not copy. To copy an object in python, one can do a deep copy or a shallow copy.
Shallow copy creates a new object and copies references to the constituents of the original object to the new object. For instance if you have a list of lists, shallow copying it will create a new top level list, but the constituent lists will still refer to the list objects in the original object. list.copy() does a shallow copy.
Deep copy not only makes a copy of the top level object, but recursively deep copies all constituent objects. For more details see the article below :
https://medium.com/@thawsitt/assignmentvsshallowcopyvsdeepcopyinpythonf70c2f0ebd86

Question 50 of 57
50. Question
1 pointsCategory: python conceptsWhich of the following is true w.r.t Numpy arrays and Pandas dataframes
Correct
Pandas is a wrapper over Numpy to support working with tabular data easily such as column names, SQL like query support, merging and joining dataframes.
Numpy supports highly optimized multidimensional numeric arrays for high performance computing. For performance optimized ML code, NumPy is often preferred over Pandas.
Lists in Python are a native data type. Lists are dynamic in size and need not necessarily contain homogeneous elements. Numpy arrays are homogeneously typed and are not built over Lists.
Incorrect
Pandas is a wrapper over Numpy to support working with tabular data easily such as column names, SQL like query support, merging and joining dataframes.
Numpy supports highly optimized multidimensional numeric arrays for high performance computing. For performance optimized ML code, NumPy is often preferred over Pandas.
Lists in Python are a native data type. Lists are dynamic in size and need not necessarily contain homogeneous elements. Numpy arrays are homogeneously typed and are not built over Lists.

Question 51 of 57
51. Question
1 pointsCategory: Exploratory Data AnalysisWhy do you need to do correlation analysis on data ? (Select all that apply)
Correct
Analyzing which variables correlate more with the target variable that we are trying to predict is a standard technique in supervised ML: The pearson correlation coefficient is the most commonly used measure of correlation – it ranges between 1 and 1.
 A highly positive or negative value indicates a strong relationship
 Features that are highly correlated with target are important.
 If two features are very highly correlated with each other, one of them might be redundant. Example Age and salary might be highly correlated. Age+2 and salary are also likely to be correlated !
 A value close to 0 between a feature and target indicates very little relationship
 Such features can be removed reduce dimension of data
 Derived features can be computed which are better correlated with target to replace features that show very little correlation. Example: You are predicting whether a person is claiming unemployed allowance. The salary of the person might be correlated with the unemployment allowance received. But a derived feature (salary>0) that is indicative of employment is a better predictor.
 Note: The pearson’s correlation coefficient captures only the linear correlation.
Incorrect
Analyzing which variables correlate more with the target variable that we are trying to predict is a standard technique in supervised ML: The pearson correlation coefficient is the most commonly used measure of correlation – it ranges between 1 and 1.
 A highly positive or negative value indicates a strong relationship
 Features that are highly correlated with target are important.
 If two features are very highly correlated with each other, one of them might be redundant. Example Age and salary might be highly correlated. Age+2 and salary are also likely to be correlated !
 A value close to 0 between a feature and target indicates very little relationship
 Such features can be removed reduce dimension of data
 Derived features can be computed which are better correlated with target to replace features that show very little correlation. Example: You are predicting whether a person is claiming unemployed allowance. The salary of the person might be correlated with the unemployment allowance received. But a derived feature (salary>0) that is indicative of employment is a better predictor.
 Note: The pearson’s correlation coefficient captures only the linear correlation.

Question 52 of 57
52. Question
1 pointsCategory: Big DataIs Hive useful for OLTP systems (Online Transaction Processing) ?
Correct
In Online transaction processing (OLTP), systems typically facilitate and manage transactionoriented applications, where they need to retrieve information and update the database in an atomic fashion for each transaction.
Hive is not suited for OLTP systems since it does not support insert and update at the row level.
For more information on Hive, take a look at https://cwiki.apache.org/confluence/display/Hive/Home
Incorrect
In Online transaction processing (OLTP), systems typically facilitate and manage transactionoriented applications, where they need to retrieve information and update the database in an atomic fashion for each transaction.
Hive is not suited for OLTP systems since it does not support insert and update at the row level.
For more information on Hive, take a look at https://cwiki.apache.org/confluence/display/Hive/Home

Question 53 of 57
53. Question
1 pointsCategory: Data Wrangling & CleanupSuppose while scaling the training data during data preprocessing, you want your feature to be bounded within some fixed range, which of these scaling techniques you’d use ?
Correct
MinMax Scaling results into values bounded between 0 and 1. It is performed by subtracting all values with the minimum value and dividing the result by the maximum value minus minimum value, i.e.
processed_value = (original_value – minimum_value) / (maximum_value – minimum_value). Standardization by subtracting the mean and dividing by the variance results into 0 mean but unbounded values.Incorrect
MinMax Scaling results into values bounded between 0 and 1. It is performed by subtracting all values with the minimum value and dividing the result by the maximum value minus minimum value, i.e.
processed_value = (original_value – minimum_value) / (maximum_value – minimum_value). Standardization by subtracting the mean and dividing by the variance results into 0 mean but unbounded values. 
Question 54 of 57
54. Question
1 pointsCategory: Data Wrangling & CleanupSuppose you’re doing regression analysis and there is a categorical textual data in one of the features, how do you convert it to do numeric data for analysis.
Correct
Categorical features are usually present in textual format. For eg, three categories in a size column could be LOW, MEDIUM, HIGH. These categories if converted straightaway to 0,1,2 within the same column, the machine learning model will learn assuming there is some ordering as 0 < 1 < 2. Though in example there is ordering but in reality categories may not have any such ordering. To avoid such problems, onehot encoding is used. In onehot encoding, one column is added for each category. And the original column with all categories is removed. Each column takes value of 0 or 1 for each row depending upon whether that row has that category or not.
Incorrect
Categorical features are usually present in textual format. For eg, three categories in a size column could be LOW, MEDIUM, HIGH. These categories if converted straightaway to 0,1,2 within the same column, the machine learning model will learn assuming there is some ordering as 0 < 1 < 2. Though in example there is ordering but in reality categories may not have any such ordering. To avoid such problems, onehot encoding is used. In onehot encoding, one column is added for each category. And the original column with all categories is removed. Each column takes value of 0 or 1 for each row depending upon whether that row has that category or not.

Question 55 of 57
55. Question
1 pointsCategory: Machine learning modelsWhich of the following models scales poorly with number of examples used in training?
Correct
Complexity of SVM algorithm can be quadratic in the size of training set. This is especially true for kernel SVM due to the quadratic size of kernel matrix. Note that it is possible most of them will take more time as the number of examples are increased, but we need to answer out of all the models mentioned, which one will scale poorly, relatively.
Incorrect
Complexity of SVM algorithm can be quadratic in the size of training set. This is especially true for kernel SVM due to the quadratic size of kernel matrix. Note that it is possible most of them will take more time as the number of examples are increased, but we need to answer out of all the models mentioned, which one will scale poorly, relatively.

Question 56 of 57
56. Question
2 pointsCategory: ML fundamentalsArrange the following in chronological order –

Focus on Problem statement

Select whether to apply Supervised or Unsupervised learning

TrainTest Split

Data Preprocessing

Model Selection
Correct
For any Machine Learning task in hand, we need to first give some time to understand the problem. Sometimes it is done with the help of domain experts. Once the problem has been understood, we need to check if it is a supervised or unsupervised learning problem. This can be determined by knowing what is the desired goal and whether we have labels or not. Once we know if its a clustering or classification problem, we do the train test split for better generalization. In most cases it is trainvalidationtest split but for simplicity we kept traintest split. Notice that traintest split happens before the data preprocessing as we don’t want test data to be used in preprocessing for better generalization. Whatever preprocessing is applied(sklearn’s fit function) on training data, test data is transformed similarly(using sklearn’s transform function). Post traintest split , we select the right features and then preprocess the data by scaling, normalization, replacing missing values in the features and other required processing in this step. And then lastly, we select the right model. Note that model selection would also include the task of determining what loss function to use.
Incorrect
For any Machine Learning task in hand, we need to first give some time to understand the problem. Sometimes it is done with the help of domain experts. Once the problem has been understood, we need to check if it is a supervised or unsupervised learning problem. This can be determined by knowing what is the desired goal and whether we have labels or not. Once we know if its a clustering or classification problem, we do the train test split for better generalization. In most cases it is trainvalidationtest split but for simplicity we kept traintest split. Notice that traintest split happens before the data preprocessing as we don’t want test data to be used in preprocessing for better generalization. Whatever preprocessing is applied(sklearn’s fit function) on training data, test data is transformed similarly(using sklearn’s transform function). Post traintest split , we select the right features and then preprocess the data by scaling, normalization, replacing missing values in the features and other required processing in this step. And then lastly, we select the right model. Note that model selection would also include the task of determining what loss function to use.


Question 57 of 57
57. Question
1 pointsCategory: Data Wrangling & CleanupSuppose you’re doing regression analysis on a dataset with D number of features. One feature is a categorical and has K number of categories in text. What is the total number of features once all categories have been converted using one hot encoding.
Correct
Categorical features are usually present in textual format. For eg, three categories in a size column could be LOW, MEDIUM, HIGH. These categories if converted straightaway to 0,1,2 within the same column, the machine learning model will learn assuming there is some ordering as 0 < 1 < 2. Though in example there is ordering but in reality categories may not have any such ordering. To avoid such problems, onehot encoding is used. In onehot encoding, one column is added for each category and hence extra K column for K categories. And the original column with all categories is removed leading to D + K – 1 features. Each column takes value of 0 or 1 for each row depending upon whether that row has that category or not.
Incorrect
Categorical features are usually present in textual format. For eg, three categories in a size column could be LOW, MEDIUM, HIGH. These categories if converted straightaway to 0,1,2 within the same column, the machine learning model will learn assuming there is some ordering as 0 < 1 < 2. Though in example there is ordering but in reality categories may not have any such ordering. To avoid such problems, onehot encoding is used. In onehot encoding, one column is added for each category and hence extra K column for K categories. And the original column with all categories is removed leading to D + K – 1 features. Each column takes value of 0 or 1 for each row depending upon whether that row has that category or not.