Clean Data

#datafriday

I was going to have an awesome post for you today, showing you how to import, and slice-and-dice data. But I couldn’t make it work. The data that I was working with wasn’t in the right format. My database wouldn’t accept it, and my opportunity to analyze died before I could even get started.

My experience highlights the reality that when working with data, the hardest part is not drawing insights from the data. The most challenging leg of the journey is to get the right data into the right place so that you can start to ask the right questions from the data. Once you can ask the right questions, the insights are relatively easy to find

It is reported that data scientists spend 80% of their time cleaning their data. The rest of the time is spent on analyzing, reporting, and looking for additional data that can add value to what you have learned. 

Data Cleaning

Once you have your data file and you are ready to do something with it, here are a few things that you have to consider to make sure that your dataset is complete.

  1. File type and encoding. If you have UTF-16 encoding, but your database is expecting ISO-8995-1, you may find yourself with odd-looking data. Or it’s possible that you won’t even be able to move the data to the new location.
  2. Missing data pieces. Are there pieces of data for every field and every row. Are each of the rows the same number of pieces of data. If you have inconsistent columns, it isn’t a problem – you will need to make sure that the right data is aligned with the correct label (column heading.) 
  3. Inconsistent data. Is a piece of data sometimes an integer (with no decimals) and sometimes a number with up to 3 decimal points? What does your database do with the numbers if they don’t have any decimal points? What happens when you sum or average those numbers, will it give you an integer, or go to the highest number of decimal points? Knowing how your system handles inconsistent data, and what the inconsistent data represents is essential to know before you start your analysis.  
  4. The right data type defined for each column. An employee number field might start off looking like a number, but further down you may run into the convention of lettering individual employees (e.g. 10001A, 10001B.) If you try to treat the letters like numbers, the data might disappear, or your queries may fail. Be sure that you have the right data types associated with each collumn so that you can get the best results when you are pulling that type of data. 

There are many other considerations to think about when you have data that you will be analyzing. Once you know that you have accurate, consistent, and relevant data, you can start to look for the insights that will move you forward.