How do I clean the data, and what kind of data do I remove?
There are two rounds of data transformation within Mimic Analytics. You are asked to clean datasets, remove any outliers, bad data, or replace missing data. Be sure to review the data transformation requirements for each dataset.
Once you have downloaded the Excel file for the dataset and opened it, you may start cleaning the data.
The best way to find the data that needs to be cleaned is by taking advantage of Excel’s abilities.
Here is where you can find helpful videos on how to use Excel’s different functionalities:
Here is how you can clear cells:
There are four types of data that need to be removed:
- Impossible Values- These are values that should be removed because they aren’t possible. An example of this would be a date range that ends before the start date, or a negative value for purchased items.
- Erroneous Formats- This is data that does not follow the format of the column. This could be when a date is in the gender column, or a money amount is in the state column.
- Extreme Outliers - Data that does not conform to the typical range for that column is considered an extreme outlier. An example of this would be the amount purchased being 1200 when most users purchased 1-10 items.
- Complex Problems - In the second round of Data Transformation, there are datasets that have complex problems that need to be solved. This takes a bit more digging through the data to find. This could be when the state is Montana, but the country is Canada.
For more information on how to interpret your results, see this article: