Data cleansing (or cleaning), is used to refer to the process of detecting and correcting inaccurate, corrupt or unusable data. It is an essential step before any data analysis project, since every step after it assumes the data is “clean” or, in other words, trustworthy and accurate.
In this 2-part series, we'll be going over the steps required for understanding what makes data "dirty", and how to clean it. Get your rubber gloves ready!
There’s a high chance you’ll always have to do some cleaning. When working with datasets, especially large ones, you will almost inevitably find errors among the records- either because of human error or other issues during data collection and storage. The data can also be messy or unstructured in its raw form — which is often the case with data extracted from websites via web scraping, for example.
It is thus good practice to always take the time to familiarize yourself with the dataset you’re working with, examine it for potential errors and corrupt records, and deal with them before actually trying to make sense of the data.
Most small to medium sized datasets can be properly analyzed and cleansed using Google Sheets or Excel. In this article, we will work with a small a dataset in Google Sheets- think a few hundreds or thousands of records at most.
The dataset we will be using as an example (as well as the final, "cleansed", version) can be found here.
Before we do any cleaning, let's take the time to familiarize ourselves with the dataset. In this case, we will use a sample of records from the OpenFoodFacts database, containing information about a variety of food products and their nutritional values, brand, etc.
The first step to cleaning your data is being able to verbally describe your dataset, or in other words, speak to its characteristics.
To describe your data, you will need to know:
Knowing how many records you’re working with will not only give you an idea of how much data you have available, but it will also let you know how much data was expunged after you are done cleaning.
Specifically, we need to know how many columns (features, or properties) and rows (records) are in a dataset.
You can easily get the # of columns and rows by using the COUNTA formula, which returns the number of cells with values in a given range.
For rows, it should look something like:
=counta(A:A)
You should see a value of 2000. What's happening here is we're asking COUNTA to give us the amount of populated cells on column A.
Now, let's try the same for getting the number of columns:
=counta(A1:EY1)
Protip: You can avoid having to scroll to the right to get the last column by just clicking on row 1. That will select the entire first row for you, with all the columns it spans across.
You should get 155. That means the sample we’re using contains 155 columns and 2000 rows.
Great! Now we know how big our dataset is.
Now that we've taken a general glance at the dataset, we need to start looking at the types of data contained in each column. Most data can be categorized into 4 basic types:
Once you understand the contents of your data, the data types you will be dealing with, and the general shape of your dataset, you are ready to tackle the real cleaning. It's now time to head on over to Part 2 of this article and find out how you can polish raw data into a jewel of insights.