Missing data is caused either due to issues in data collection or sometimes, the data model could allow for missing data (for instance, the field ‘maximum credit limit on any of your cards’ might not make sense for someone who has no credit cards…). With missing data, typically the ML algorithm implementation might fail with an error due to unexpected values / blanks in the data set. Hence missing data must be dealt with before applying an ML algorithm.
There is no fixed rule to deal with missing data but one could use any of the heuristics mentioned below.
- The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data.
- If more than 50-60% of rows of a specific column are missing data, it is common to remove the column. The main problem with removing missing data thus, is that it could introduce substantial bias.
- Imputation of data is also a common technique used to deal with missing data where the data is substituted with the best guess.
- Imputation with mean : Missing data is replaced by the mean of the column. This is a commonly used technique. However, this might not be appropriate if the data is not unimodal (for example suppose we fill missing value of weights, the mean of weights for males might be different from females and this might not be a unimodal distribution).
- Imputation with median : Missing data is replaced by the median of the column. A median is better than the mean when there are outliers, but once again, if the data is multi-model with multiple clusters, median might not work.
- Imputation with Mode: Missing data is replaced with mode of the column. This also leads to similar problems as the above two methods.
- Imputation with linear regression : With real valued data, this is another common technique. The missing value is replaced by performing linear regression based on the other feature values. This overcomes the problems with the above simpler forms of imputation.
For more information, look up : https://en.wikipedia.org/wiki/Missing_data