Some of the problems encountered by a data analyst are :
- Biased Data : Data could be biased due to the source from which it is collected. For instance, suppose you collect data to determine the winner of an electoral campaign, collecting from a specific region alone introduces one form of a bias, while collecting data from a specific income group introduces another form of bias.
- Duplicates in the data: Data could have duplicates which may impact the result of analysis.
- Missing data: All data points might not have the values for all attributes you are analyzing
- Noisy data: The data could be noisy, usually a high value of variance indicates noise.
- Outliers in the data: Points outside the expected range of data that introduce inconsistencies in the model.
- Difference in formats in various data sources : Some data could be crawled and collected in html format, while other data might be collected from online reviews in text format. A third source of data might be structured data already in the database. A data analyst usually has to ingest several data sources to get richer data.
- Data Volume : A large amount of data will require a different class of algorithms for processing to handle efficiently.