Data work is dirty work

"The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the chowky dar (village watchman in India), who just puts down what he damn pleases."

Stamp's Law, Sir Josiah Stamp
Quoted often in my Stats classes

The recent popularity of something called “data science” has created an environment where every day I run into some post about some data set and how you can analyze it in a few lines of Python or R or Julia or something else. These posts come with pretty and colorful pictures. They ever so subtly seem to send the message that anyone can become a data “scientist”.

I hate to break it to you, no matter how good it sounds, the phrase “data science” does not make sense.

The definition of the word science seems straightforward enough:

The intellectual and practical activity encompassing the systematic study of the structure and behaviour of the physical and natural world through observation and experiment.

Data are what you generate “through observation and experiment.”

Data are not the objective of science. Every field adopts certain techniques suitable for the analysis of the types of data studies in such fields generate. For example, most data economists have to work with are observational and hence we have econometrics.

Dealing with observational data, organizing them, double-checking them, reshaping them to something suitable for analysis is hard work. In fact, if I had one-tenth the talent of James Mickens, I would write something like this about data work.

I started working with real, observational data some decades ago. A popular assignment when teaching Linear Programming is to calculate the minimum cost of obtaining a healthy diet (since both the objective function and constraints are linear), assume about 40% of one’s income goes to nutrition, and calculate a “living wage” on the basis of that. To complete the assignment, students are supposed to collect data on prices of main food items available locally etc. It’s a fun way to introduce people to all sorts of issues they should be familiar with if they are going to deal with empirical models without being too overwhelming.

After that, I dealt with some micro- and some macro-economic data. Then, during my senior year, I worked as an analyst at the Central Bank of Turkey where I was responsible for the specification of the labor market component of the annual macroeconometric model of the Turkish economy. Of course, I was not the first person who had dealt with this model. But, I think I was the first person who had decided to check the underlying data.

Up to that point, the underlying data tables had been entered by hand from the publications of the State Planning Organization and the State Institute of Statistics (DİE, now TÜİK). I decided to double check some past entries, and I realized the numbers we had in our tables did not match the numbers of found in some of those publications. That led me to the realization that central planning is impossible: Central planning requires timely, extensive, and correct data. On the other hand, keeping one’s job in the state apparatus requires one to please one’s political bosses (one of the reasons I decided not to stick with the lucrative position with the prestiguous Research Department of the Central Bank). That means, you write a five year plan that forecasts impressive wage and productivity gains, great increases in employment etc. If, within the horizon of the plan, your political bosses change, past forecasts must be modified to make sure the new administration starts from a lower base so their achievements look more impressive. Or, if important measures are falling behind the rosy forecasts, you may want to make sure unfavorable data are not included in a report right before a crucial debate in the parliament or before a visit by the IMF.

So, I decided to look into the actual data underlying these published reports. It took some doing, but I got access to the original tabulations still stored in a basement at the State Institute of Statistics. Those archived pieces of yellow, low quality paper were moldy. They were falling apart. I gleaned whatever I could from them but that wasn’t enough to come up with a consistent and well defined series for the cost of labor in industry.

The effort was not wasted though: I learned the lesson of Stamp’s Law first hand before I had even heard of it.

By the mid-90s, in the United States, putting together data tables, at least for data sets produced by the Census Bureau, Bureau of Labor Statistics, and Bureau of Economic Analysis had become considerably easier. You could actually FTP stuff. Many more data sets were now coming on CD-ROMs and personal computers with CD drives were everywhere. But, intuitively understanding how individual pieces of information ended up as bits and bytes on digital media has been invaluable to me.

These days, the power to tabulate, aggregate, and collate data sets of sizes previously unimaginable is available to pretty much anyone. On the one hand, this democratization of the ability to access and analyze data is a good thing. On the other hand, the ease with which data flow from sources through analysis tools into pretty pictures has created an environment where really hard questions are cast aside in favor of pretty graphs and over-glamourization of applying canned techniques to data which may or may not fit assumptions implicit in those techniques.

This Monday’s Perl Weekly featured a post by James Keenan on the so-called Gun Violence Archive (GVA) published by the Guardian. As far as I understand, the data are produced using an attempt at a census of all incidents in the United States. An incomplete census may have biases that cannot be controlled for using straightforward statistical techniques. Ultimately, the underlying data come from individual cities and counties and their police departments etc. These data are voluntarily reported (or not reported, as the case might be) by those units. One cannot assume that missing data are missing at random. So, that’s the first problem, from the get go, with using that data set to do serious analysis.

In his post, Jim notices some oddities. For example:

If we look at our report more closely, we see that there are many large cities where the 2014 Murder Rate was listed as zero.

The problem is simple. Not all departments report rates. When a rate is not reported, it will be a missing entry in the data set (see also my comment on Twitter for a screenshot). Clearly, Honolulu reported a positive count of murders, but did not report a murder rate. The value was missing in the original data set provided by the FBI, but Guardian’s “data scientists” replaced missing values with zeros in putting together this data set.

So, lesson one in data work: Do not ever trust regurgitated data, no matter how much you trust the entity doing the processing. We are not little birds.

If you use FBI’s oddly restricted data access tool (which means you’d have to write a bot to gather all available data, or, order a CD from FBI’s CJIS division if you can figure out whom to contact), you find out that Honolulu’s reporting has become sparser since 2011 (see CSV file produced by the extraction tool. It is horribly formatted, but I left it untouched for your reference). The table below shows the number of months in each year in which Honolulu Police Department submitted data to the UCR:

Year    Months
1985    12
1986    12
1987    12
1988    12
1990    12
1991    12
1992    12
1993    12
1994    12
1995    12
1996    12
1997    12
1998    12
1999    12
2000    12
2001    12
2002    12
2003    12
2004    12
2005    12
2006    12
2007    12
2008    12
2009    12
2010    12
2011    5
2012    3
2013    6
2014    4

So, immediately you run into the question of what determines which months were reported and why. I can’t answer that, so we’ll move on, but there can be no analysis without a satisfactory answer to that question. A cursory look suggests that the sparse reporting regime started with the previous mayor of Honolulu and continued with the current one. Also, it seems to have started right after the former police chief took the position. He retired early this year. We’ll have to wait and see if data for 2015, 2016, and 2017 will be more complete. The important thing to keep in mind is that if the probability of month’s data not being reported is not independent of the phenomena we want to analyze, then straightforward inference or comparisons are really difficult. We first need to model the process of sparse reporting.

Guardian also provides an incident database which seems to have been derived from FBI’s NIBRS. There is a problem though. The Guardian article says this:

Incident-level data – each row includes a gun homicide incident.

Well, try this:

$ curl | wc -l

Yes, there are zero rows in the incident file!

The file is not empty though. No, it is a single 2,302,651 byte string with no newlines in it! Seriously?!

What is going on here?

Let’s look at the underlying bytes:

$ xxd gva_release_2015_raw_incidents.csv | head -n 25
00000000: 6776 615f 6964 2c69 6e63 6964 656e 745f  gva_id,incident_
00000010: 6461 7465 2c73 7461 7465 2c63 6974 795f  date,state,city_
00000020: 6f72 5f63 6f75 6e74 795f 6775 6172 6469  or_county_guardi
00000030: 616e 5f63 6f72 7265 6374 6564 2c63 6974  an_corrected,cit
00000040: 795f 6f72 5f63 6f75 6e74 795f 6f72 6967  y_or_county_orig
00000050: 696e 616c 5f67 7661 2c61 6464 7265 7373  inal_gva,address
00000060: 2c6e 756d 5f6b 696c 6c65 642c 6e75 6d5f  ,num_killed,num_
00000070: 696e 6a75 7265 642c 6c61 7469 7475 6465  injured,latitude
00000080: 2c6c 6f6e 6769 7475 6465 2c67 7661 5f75  ,longitude,gva_u
00000090: 726c 2c66 6970 735f 6675 6c6c 2c66 6970  rl,fips_full,fip
000000a0: 735f 7374 6174 652c 6669 7073 5f63 6f75  s_state,fips_cou
000000b0: 6e74 792c 6669 7073 5f74 7261 6374 2c66  nty,fips_tract,f
000000c0: 6970 735f 626c 6f63 6b2c 6669 7073 5f66  ips_block,fips_f
000000d0: 756c 6c5f 7472 6163 742c 7472 6163 745f  ull_tract,tract_
000000e0: 6c61 6e64 5f73 7175 6172 655f 6d69 6c65  land_square_mile
000000f0: 732c 7472 6163 745f 7761 7465 725f 7371  s,tract_water_sq
00000100: 7561 7265 5f6d 696c 6573 0d33 3736 3435  uare_miles.37645
00000110: 372c 372f 3136 2f31 352c 416c 6162 616d  7,7/16/15,Alabam
00000120: 612c 416e 6461 6c75 7369 612c 416e 6461  a,Andalusia,Anda
00000130: 6c75 7369 612c 4c65 6f6e 2057 6967 6769  lusia,Leon Wiggi
00000140: 6e73 2052 6f61 642c 312c 302c 3331 2e32  ns Road,1,0,31.2
00000150: 3631 362c 2d38 362e 3337 3639 2c68 7474  616,-86.3769,htt
00000160: 703a 2f2f 7777 772e 6775 6e76 696f 6c65  p://www.gunviole
00000170: 6e63 6561 7263 6869 7665 2e6f 7267 2f69
00000180: 6e63 6964 656e 742f 3337 3634 3537 2c30  ncident/376457,0

The header ends at byte 266 where there is a single solitary carriage return, 0x0d:

00000100: 7561 7265 5f6d 696c 6573 *0d*33 3736 3435  uare_miles.37645

In fact, this file uses carriage returns for line endings. Curious. Either Guardian did all this work on a Macintosh, or, more likely, someone tried to replace CRLFs with LFs but instead of using a capable program such as dos2unix, used a one-liner incorrectly.

Let’s fix that:

$ perl -pi.bak -e "s/\R/\n/g" gva_release_2015_raw_incidents.csv

$ head gva_release_2015_raw_incidents.csv
376457,7/16/15,Alabama,Andalusia,Andalusia,Leon Wiggins Road,1,0,31.2616,-86.3769,,010399623003016,01,039,962300,3016,01039962300,107.68,1.04

Still, I would rather try to get source data from the FBI for anything serious. I am not going to cross check each individual record. Given these amateur hour mistakes by Guardian’s “data science” team, I don’t have much trust in the rest of the data set (replacing missing values with zeros is extremely problematic).

I am going to close with some of my own recommendations:

Make sure all steps in your analysis, including data retrieval are replicable

Don’t issue individual command lines in a console. If possible put raw data files in version control, so you notice changes when they occur. Write a script to retrieve the data file. In this particular case, something like this should work:

#!/usr/bin/env perl

use v5.24; # why not?!
use warnings;

use Const::Fast;
use File::Basename qw( basename );
use HTTP::Tiny;

const my @URI => qw(

my $http = HTTP::Tiny->new;

for my $uri ( @URI ) {
    my $file = basename $uri;
    my $response = $http->mirror( $uri => $file );
    unless ( $response->{success} ) {
        warn "Problem fetching '$uri': $response->{status} $response->{reason}\n";

If the source files are too large for source control, have the download script generate hashes and save them in a file. Then, you can track that file.

When you download files, run wc -l on them

It is important to ensure that the number of records in your extract are within the ballpark of your expectations.

Don’t parse the source data repeatedly

Instead, write a script to put the data in a database. I prefer SQLite at this point. SQL is easier than custom programs in any programming language for ad hoc inspection of a data set.

For this purpose, I find it handy to keep around an sqlite3 binary which is compiled with my favorite options. If you are using gcc, make sure taking advantage of -O3 -march=native. With Visual C, I like /Ox /favor:INTEL64. If your CPU supports it, also use /arch:AVX2.

Note that both gva_release_2015_grouped_by_city_and_ranked.csv and gva_release_2015_grouped_by_tract.csv also suffer from having bare carriage returns (0x0d aka Macintosh line endings), so make sure to fix that before running this step.

After correcting the EOL problems, you can import the all the tables using this simple SQLite script:


.separator ,

.import gva_release_2015_grouped_by_city_and_ranked.csv city_level

.import gva_release_2015_grouped_by_tract.csv tract_level

.import gva_release_2015_raw_incidents.csv incidents

.import UCR-1985-2015.csv ucr

run it using sqlite3 gva_20170329.db < import.sql. By distinguishing the database using the date it was created, you retain the option of comparing differences in case source files are updated.

Come up with short, mnemonic variable names

While the Guardian data set already has short variable names, the CSV file I downloaded from FBI has column headings such as “Murder and nonnegligent Manslaughter” which is not easy to work with. Come up with a short name for it. Other data sets come with codebooks which list the mnemonic for each variable along with the possible values it can take. E.g, see BRFSS.

Reshape the data tables

For example, the UCR data contains umpteen columns named:

1985_raw_murder_num, 1986_raw_murder_num, …, 2014_raw_murder_num, 2015_raw_murder_num, 1985_murder_rate, 1986_murder_rate, …, 2014_murder_rate, 2015_murder_rate

Your life will be simpler if you instead transform the table so that you have the columns agency, city, state, state_short, year, murder_count, and murder_rate.

This data table is completely unsuitable for anything even semi-serious because it does not include a variable telling us how many months of data are missing for each record. The numbers from agencies that reported all 12 months of data each year are mingled with numbers from agencies that only reported data for a few months in some years.

To actually analyze the data, make inferences and arrive at conclusion and predictions, you need to develop a model and combine these observations with other relevant sociological, demographic, and economic information. You need to build your model before looking at the data: If you let the data tell you which model to choose, you are not going anything remotely scientific. That kind of data torture has its uses, but it is not science.

Finally, collating, tabulating, and graphing data is not analysis. Those comprise just the dirty work we must do to get to a point where it is possible to analyze data.

PS: You can discuss this post on r/perl.