1. Remove duplicate or irrelevant observations:
Duplicate or irrelevant observations can be identified by looking for values that are too similar or otherwise out of place in the data set. For example, if there is more than one observation with identical information, it should be removed as it does not add any value to the data set. Irrelevant observations can be filtered out by looking for observations that are not relevant to the question being asked.
2. Fix structural errors in data:
Structural errors can occur due to a variety of reasons, such as typing mistakes or incorrect coding. These errors should be identified and corrected so that the data can be used for analysis. Structural errors may require manual intervention in order to fix the errors.
3. Identify and correct inaccurate or missing values:
Inaccurate or incomplete data should be identified and corrected so that it can be used for analysis. This includes looking for values that are outside of expected ranges, invalid formats, incorrect units of measure, etc. It is also important to check for missing data (e.g., empty cells) as this can have an effect on the accuracy of results if not addressed properly.
4. Check for outliers, incorrect formats, and inconsistencies in units of measurement:
Outliers can skew results when analyzing data sets, so it's important to check for these before starting any analysis work. Additionally, incorrect formats (e.g., dates that are not in the specified format) and inconsistencies in units of measurement should also be flagged so they can be addressed properly.
5. Ensure all data is consistent with established standards and conventions:
Data should adhere to an accepted set of standards and conventions, such as ISO date formats or specific naming conventions for columns, etc. This will ensure accuracy when comparing different sources of data and enables teams to easily identify potential errors.
6. Evaluate the quality of the source data and assess whether it is suitable for intended use:
It’s important to evaluate the quality of source data before using it for analysis work, as poor-quality data could lead to unreliable results. Source data should be assessed against criteria such as accuracy, completeness, timeliness and relevance.
7. Document the various algorithms and methods used to cleanse the data:
A record of all the algorithms and methods used for cleaning the data should be documented so that any changes or adjustments to these processes can be tracked over time. This will also help teams identify any potential errors or problems with the data.
8. Filter unwanted outliers:
It is important to filter out outliers as they can have a significant impact on results if not addressed properly. Outliers should be identified and removed, so that the analysis of the data set is accurate and reliable.
9. Handle missing data:
Any missing data should be addressed in order to ensure accuracy when analyzing the data set. This could involve filling in missing values with estimates or averages, or simply removing rows containing incomplete information.
10. Validate and QA:
It is important to validate and quality-assure all the cleansed data prior to analysis. This will help identify any mistakes that were made during the cleaning process and also provide assurance that the cleaned data can be used for reliable analysis. Quality assurance techniques such as unit testing, cross validation, and automated tests can also be employed to further improve accuracy of results.