Introduction
Every single company I’ve worked at and talked to has the same problem without a single exception so far - poor data quality, especially tracking data. Either there’s incomplete data, missing tracking data, duplicative tracking data. - DJ Patil
I spend a lot of my time digging into data at various companies. Most of the time I’m surprised by what I see and so are the engineers and analysts that work at these companies. I’ve seen missing data, bad data, data nobody knows anything about, and many other oddities.
As a data scientist or an analyst, the quality of the data you work with is crucial to your success. The old GIGO acronym, which stands for “garbage in, garbage out”, is very true. In this blog post, we’ll discuss methods and practices that will make your first contact with data successful and will save you from a lot of grief down the road.
I’ll be using Python 3.8 and pandas version 1.1 in this post. The code shown here is an IPython session.
Schema
All data has a schema in one way or the other. Sometimes it’s properly documented and sometimes it’s spread out in a thousand different places all over the code. It’s important to find this documentation and don’t trust it until you compare it with the real data.
Let’s have a look at weather data from NOAA. The data is provided by NOAA in a CSV format, but I’ve changed the format here for readability.
Listing 1: Weather Data
DATE SNOW TMAX TMIN PGTM
2000-01-01 0 100 11 1337
2000-01-02 0 156 61 2313
2000-01-03 0 178 106 320
2000-01-04 0 156 78 1819
2000-01-05 0 83 -17 843
Without a properly documented schema, it’s hard to know what’s going on. The DATE
column is obvious, but it’s unclear what the rest of the fields mean.
- What does
SNOW
represent? Is it describing how much snow fell? If so, is that measurement in inches or maybe centimeters? Maybe it’s a boolean value for representing yes or no? TMAX
and TMIN
are probably the maximum and minimum temperature for the day. Once again, what are the units since Celsius and Fahrenheit don’t make sense being there is an 89 degree difference in 2000-01-01?- I have no clue what
PGTM
stands for?
It’s clear that only knowing the type of the data (string, integer, float …) isn’t enough. We need to know the units and sometimes more to understand the information.
Listing 2: Weather Schema
TMAX - Maximum temperature (tenths of degrees C)
TMIN - Minimum temperature (tenths of degrees C)
SNOW - Snowfall (mm)
PGTM - Peak gust time (hours and minutes, i.e., HHMM)
NOAA did publish a schema and after reading it, the representation of the data is now clear. TMAX
and TMIN
are Celsius temperatures but in tenths of a degree. SNOW
represents snowfall, but in millimeters. Finally, PGTM
represents time values for when the peak wind gust occured.
If you have a say in your company, do your best to make sure all the data you are working with has a formal documented written schema, and that this schema is kept up to date.
Even if your company maintains schemas and keeps them up to date, the raw data you’re processing can still have errors. Always look at the raw data and check if it matches your documented schemas. As agent Mulder said: “Trust no one!”. It’s a good practice to have data validation and data quality metrics as part of your data pipeline.
Size Matters
pandas by default will load all the data you need to work with into memory. Some datasets are too big to fit all of it in memory, and once you exhaust the computer’s physical memory and start to swap to disk, performance processing the data goes down the drain.
My advice is to load a small amount of the data into memory initially, and then extrapolate to figure out how much memory you will need for the entire dataset. If the dataset is coming from a database, add a LIMIT
clause to your SELECT
statement to reduce the initial size of the dataset. If the dataset is coming from a file, you’ll need a different strategy like reading a limited number of lines of text or a limited number of parsable documents.
Let’s have a look at a different dataset that is part of the NYC Taxi Dataset. This data comes as a compressed CSV and I’d like to know how much memory I’ll need to load all of it in pandas.
First, let’s look at how large the dataset is on disk.
Listing 3: Disk Size
In [1]: csv_file = 'yellow_tripdata_2018-05.csv.bz2'
In [2]: from pathlib import Path
In [3]: MB = 2**20
In [4]: Path(csv_file).stat().st_size / MB
Out[4]: 85.04909038543701
The compressed CSV is consuming about 85MB of disk space. bz2
compressed text is about 10-15% smaller from its original size, which means the uncompressed data will consume around 780MB of disk space.
The next thing to know is how many lines of text are there in the file.
List 4: Line Count
In [5]: import bz2
In [6]: with bz2.open(csv_file, 'rt') as fp:
...: num_lines = sum(1 for _ in fp)
In [7]: num_lines
Out[7]: 9224065
In [8]: f'{num_lines:,}'
Out[8]: '9,224,065'
[5]
we import the bz2
library[6]
we use a generator expression to count the number of lines of text in the file. This took about 40 seconds to run on my machine.[8]
I use an f-string
to print the result in a human readable format.
We have 9.2 million lines of text in the file. Let’s load the first 10,000 lines into pandas and measure the amount of memory being used. Then we can calculate the total amount of memory needed to load the complete file.
Listing 5: Calculating Size
In [9]: import pandas as pd
In [10]: nrows = 10_000
In [11]: df = pd.read_csv(csv_file, nrows=nrows)
In [12]: df.memory_usage(deep=True).sum() / MB
Out[12]: 3.070953369140625
In [13]: Out[12] * (num_lines / nrows)
Out[13]: 2832.667348892212
[11]
we load 10,000 rows to a DataFrame.[12]
we calculate how much memory the DataFrame is consuming in MB.[13]
we calculate the total memory consumption for the whole data file. ~2.8GB
It’s safe to load all of the data into memory.
Note: If the data doesn’t fit in your computer’s memory, don’t despair! There are ways to load parts of data and reduce memory consumption. But probably the most effective solution is to lease a cloud machine with a lot of memory. Some cloud providers have machines with several terabytes of memory.
Raw Data
Before you load any data, it’s a good idea to look at it in it’s raw format and see if it matches your understanding of the schema.
Listing 6: Raw Data
In [14]: with bz2.open(csv_file, 'rt') as fp:
...: for i, line in enumerate(fp):
...: print(line.strip())
...: if i == 3:
...: break
...:
Output:
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
1,2018-05-01 00:13:56,2018-05-01 00:22:46,1,1.60,1,N,230,50,1,8,0.5,0.5,1.85,0,0.3,11.15
1,2018-05-01 00:23:26,2018-05-01 00:29:56,1,1.70,1,N,263,239,1,7.5,0.5,0.5,2,0,0.3,10.8
It looks like this file is a CSV with a header on the first line. Let’s use the csv
module to get a rows count.
Listing 7: Raw Data - CSV
In [15]: from pprint import pprint
In [16]: with bz2.open(csv_file, 'rt') as fp:
...: rdr = csv.DictReader(fp)
...: for i, row in enumerate(rdr):
...: pprint(row)
...: if i == 3:
...: break
...:
Output:
{'DOLocationID': '50',
'PULocationID': '230',
'RatecodeID': '1',
'VendorID': '1',
'extra': '0.5',
'fare_amount': '8',
'improvement_surcharge': '0.3',
'mta_tax': '0.5',
'passenger_count': '1',
'payment_type': '1',
'store_and_fwd_flag': 'N',
'tip_amount': '1.85',
'tolls_amount': '0',
'total_amount': '11.15',
'tpep_dropoff_datetime': '2018-05-01 00:22:46',
'tpep_pickup_datetime': '2018-05-01 00:13:56',
'trip_distance': '1.60'}
...
[15]
load the pprint
module for human readable printing.[16]
use a csv.DictReader
to read 3 records and print them.
Looking at the data, things seem OK. The datetime fields look like date and time, also the amounts look like floating point numbers.
Data Types
Once you see the raw data and verify you can load the data into memory, you can load the data into pandas. However, remember that in CSV everything is text and pandas is guessing the types for you. After loading the data, check that column types match what you expect.
Listing 8: Checking Types
In [16]: df = pd.read_csv(csv_file)
In [17]: df.dtypes
Out[17]:
VendorID int64
tpep_pickup_datetime object
tpep_dropoff_datetime object
passenger_count int64
trip_distance float64
RatecodeID int64
store_and_fwd_flag object
PULocationID int64
DOLocationID int64
payment_type int64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
dtype: object
[16]
we load the whole data into a DataFrame, this took about 45 seconds on my machine.[17]
we print out the data type for each column.
Most of the column types seem OK, but tpep_pickup_datetime
and tpep_dropoff_datetime
are of type object
. The object
type usually means a string, but in this case we’d like these columns to be a timestamp. This is a case where we need to help pandas figure out the correct type for these columns.
Note: I hate the CSV format with a passion - there’s no type information, no formal specification, and don’t get me started on Unicode … If you have a say - pick a different format which has type information. My default storage format is SQlite which is a one-file SQL database.
Let’s help pandas figure out the correct types.
Listing 9: Fixing Types
In [18]: time_cols = ['tpep_pickup_datetime', 'tpep_dropoff_datetime']
In [19]: df = pd.read_csv(csv_file, parse_dates=time_cols)
In [20]: df.dtypes
Out[20]:
VendorID int64
tpep_pickup_datetime datetime64[ns]
tpep_dropoff_datetime datetime64[ns]
passenger_count int64
trip_distance float64
RatecodeID int64
store_and_fwd_flag object
PULocationID int64
DOLocationID int64
payment_type int64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
dtype: object
[19]
we tell pandas to parse the two time columns as dates. Looking at the output of [20]
, we see we now have the right types.
Looking for Bad Values
Once the data is loaded and the correct types are being used, it’s time to look for “bad” values. The definition of a “bad” value depends on the data you’re working with. For example, if you have a temperature
column, the maximal value probably shouldn’t be more than 60°C (the highest temperature ever recorded on earth was 56.7°C). However, if the data represents engine temperatures, the “bad” value markers would need to be much higher.
One of the easiest ways to look for bad data is to use the DataFrame’s describe
method. Since our DataFrame has many columns, I’m going to look at a subset of the columns.
Listing 10: Looking for Bad Data
In [21]: df[['trip_distance', 'total_amount', 'passenger_count']].describe()
Out[21]:
trip_distance total_amount passenger_count
count 9.224063e+06 9.224063e+06 9.224063e+06
mean 3.014031e+00 1.681252e+01 1.596710e+00
std 3.886332e+00 7.864489e+01 1.245703e+00
min 0.000000e+00 -4.858000e+02 0.000000e+00
25% 1.000000e+00 8.760000e+00 1.000000e+00
50% 1.650000e+00 1.225000e+01 1.000000e+00
75% 3.100000e+00 1.835000e+01 2.000000e+00
max 9.108000e+02 2.346327e+05 9.000000e+00
Right away we see some fishy data:
- The minimal
total_amount
is negative - The maximal
trip_distance
is 910 miles - There are rides with 0 passengers
Sometimes you’ll need to run a calculation to find bad data.
Listing 11: Trip Duration
In [22]: (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).describe()
Out[22]:
count 9224063
mean 0 days 00:16:29.874877155
std 2 days 10:35:51.816665095
min -7410 days +13:10:08
25% 0 days 00:06:46
50% 0 days 00:11:28
75% 0 days 00:19:01
max 0 days 23:59:59
dtype: object
[22]
we calculate the trip duration and use describe
to display statistics on it.
The minimal duration is negative (maybe someone invented a time machine?)
The maximal duration is a full day
In some cases, you’ll need to run more complex queries to find bad data. For example - the speed can’t be more than 55mph. Or, if you look at the weather data, there shouldn’t be any snow when the temperature is above 20°C.
Conclusion
I haven’t worked with real data that didn’t have errors in it. I’ve learned to keep my eyes open and challenge everything I think I know about the data before processing it to make decisions. I urge you to follow these steps every time you start working with thinka new dataset:
- Find out the schema
- Calculate data size
- Look at the raw data
- Check data types
- Look for bad data
This might seem like a lot of work, but I guarantee it’ll save you much more work down the road when the models you’ve worked hard to develop start to misbehave.
I’d love to hear your data horror stories, and how you handled them. Reach out to me at miki@353solutions and amaze me.