README

Automate Your Data Work With Make

Miklós Koren — Thu, 25 Nov 2021 15:41:43 GMT

I like to think that you can remain productive over 40. Make is 43 this year and it is still my tool of choice to automate my data cleaning or data analysis. It is versatile and beautifully simple. (At first.) Yet, in a recent survey, we found that less than 5 percent of data savvy economists use Make regularly.

What is Make?

Most build systems are meant to, well, build things. Compile code in Java, C, and the like. Make is supposed to do that, too, and most tutorials and StackOverflow questions will feature examples about how to build C code.

But at its very basic, Make is indeed beautifully simple. I create a text file called Makefile in my folder with the following content.

clean_data.csv: raw_data.csv data_cleaner.py
    python data_cleaner.py

Then I say make in the shell and Make creates clean_data.csv from raw_data.csv.

In other words, I need to specify

target: source
    recipe

and Make will run the recipe for me.

This information is something I want to note for my documentation anyway. What does my script need and what does it produce? I might as well put it in a Makefile.

This way, I can link up a chain of data work,

visualization.pdf: clean_data.csv visualize.py
    python visualize.py
clean_data.csv: raw_data.csv data_cleaner.py
    python data_cleaner.py

When I enter make in the shell, I get my visualization.pdf recreated right from raw data.

Order matters here. Typing make without any arguments recreates the first target found in the file called Makefile. I can also type make clean_data.csv if I want to recreate a specific target.

Only do what is needed

Suppose I don't like the color in my graph and decide to edit visualize.py. But data cleaning takes a lot of time! If clean_data.csv is already up to date (relative to the time stamps of raw_data.csv and data_cleaner.py), Make will skip that step and only redo the visualization recipe.

You don't have to rerun everything. Lazy is good. One more reason why you want to write modular code.

Variables and functions

As soon as you feel the power of your first few simple Makefiles, you will crave for more. Can I do this? Can I do that? The answer is yes, you can, but it will take a lot of searching on StackOverflow.

One feature I use regularly is automatic variables. If I don't want to hard code file names into my neat Python script (you'll see shortly why), I can pass the names of target and source as variables.

clean_data.csv: raw_data.csv data_cleaner.py
    python data_cleaner.py < $< > $@

This passes raw_data.csv (the variable $< refers to the first source file) to the STDIN of data_cleaner.py and saves the output on STDOUT to clean_data.csv (the variable $@ denotes the target).

Why these symbols? Don't ask me. They don't look pretty but they get the job done.

I can also use functions like

clean_data.csv: input/complicated-path/raw_data.csv data_cleaner.py
    python data_cleaner.py $(basename $(notdir $@))

and many more.

Parallel execution

And now for the best part. Make can execute my jobs in parallel. On a nicely equipped AWS server, I gladly launch make -j60 to do the tasks on 60 threads. Make serves as a job scheduler. Because it knows what depends on what, I will not run into a race condition.

Knock, knock.

Race condition.

Who's there?

Parallel execution doesn't help if I have a linear chain of recipe as above. But if I can split my dependency graph in parallel branches, they will be executed in the correct order.

So suppose my data is split into two (or many more). The following code would allow for parallel execution of the data cleaning recipe.

visualization.pdf: merged_data.csv visualize.py
    python visualize.py
merged_data.csv: clean_data_1.csv clean_data_2.csv merge_data.py
    python merge_data.py
clean_data_%.csv: raw_data_%.csv data_cleaner.py
    python data_cleaner.py < $< > $@

I have used the pattern matching character % to match both clean_data_1.csv and clean_data_2.csv.

Invoking make with the option j, make -j2 will start two processes to clean the data. When both finished, the merge data recipe runs, then the visualization. (These will be single threaded.)

I regularly use parallel execution to do Monte Carlo simulations or draw bootstrap samples. Even if I have 500 parallel tasks and only 40 processors, make -j40 will patiently grind away at those tasks. And if I kill my jobs to let someone run Matlab for the weekend (why would they do that?), I can simply restart on Monday with only 460 tasks to go.

Simple real-world Makefile with variables and for loops.
Not-so simple Makefile with variables, for loops, functions and pattern matching.

Those who still don't like Make? $< $@ them.

Wish I Could Be Like David Watts

Miklós Koren — Tue, 23 Apr 2019 19:30:31 GMT

title: Wish I Could Be Like David Watts published: true description: How do we know that this David Watts is the same as that David Watts?

tags: entity resolution, data quality

Which David Watts? Names are not unique and we want to use keys instead. But how does David Watts become P-12345678? More importantly, how do we know that this David Watts is the same as that David Watts?

This problem is known as entity resolution (ER), a.k.a. record linkage, deduplication, or fuzzy matching. (It is different from named entity recognition, where you have to recognize entities in flow text.) It is as complicated as it looks. Names and other fields are misspelled, so if you are too strict, you fail to link two related observations. If you are too fuzzy, you mistakenly link unrelated observations.

Photo by Steve Harvey on Unsplash

The first guiding principle of entity resolution is to embrace the imperfections. There is no perfect method, you are just balancing two types of error. False positives occur when you link two observations that, in reality, refer to two different entities. False negatives occur when you fail to link two observations that, in reality, represent the same entity. You can always decrease one type of error at the expense of the other by selecting a more or less stringent matching method.

The second guiding principle is to appreciate the computational complexity. If you are unsure about your data, you have to compare every observation with every other, making N(N-1)/2 comparisons in a dataset with N observations. (See box on why it is sufficient to make pairwise comparisons.) In a large dataset this becomes prohibitively many comparisons. For example, if you want to deduplicate users from a dataset with 100,000 observations (a small dataset), you have to make 10 billion comparisons. Throughout the ER process, you should be looking for ways to reduce the number of necessary comparisons.

Methods aside

An entity resolution defines groups of observations that belong to the same entity: e = {o1,o2,o3,...}. Maybe surprisingly, it is sufficient to define when a pair of observations denote the same entity, when e(o1) = e(o2). Because equality is transitive, we can propagate the pairwise relation to the entire dataset: if e(o1) = e(o2) and e(o2) = e(o3) then e(o1) = e(o3) and e = {o1,o2,o3}.

With fuzzy matching, we cannot tell precisely whether the entities behind two observations are equal. We can just calculate a distance between the two observations, d(o1,o2) ≥ 0. The problem with this is that distances are not transitive: if o1 and o2 are "very close" and so are o2 and o3, that does not make o1 and o3 "very close." We have the triangle inequality, d(o1,o2) + d(o2,o3) ≥ d(o1,o3), but this is much weaker than transitivity.

The goal of fuzzy matching is to transform a distance into an equality relation. For example, e(o1) = e(o2) whenever d(o1,o2) ≤ D is a simple formula to use. But beware of being too fuzzy: when D is too big, you can end up linking observations that are very different. For example, if you allow for a Levenshtein distance of 2 between a pair of words, you will find that book = back = hack = hacker. I bet you didn't believe book = hacker.

The three steps to efficient ER are to Normalize, Match, and Merge.

First you normalize your data by eliminating typos, alternative spellings, to bring the data to a more structured, more comparable format. For example, a name "Dr David George Watts III" may be normalized to "watts, david." Normalization can give you a lot of efficiency because your comparisons in the next step will be much easier. However, this is also where you can loose the most information if you are over-normalizing.

Normalization (a.k.a. standardization) is a function that maps your observation to a simpler (often text) representation. During a normalization, you only use one observation and do not compare it to any other observation. That comes later. You can compare to (short) white lists, though. For example, if your observations represent cities, it is useful to compare the city_name field to a list of known cities and correct typos. You can also convert text fields to lower case, drop punctuation and stop words, round or bin numerical values.

If there is a canonical way to represent the information in your observations, use that. For example, the US Postal Services standardizes US addresses (see figure) and provides an API to do that.

Then you match pairs of observations which are close enough according to your metric. The metric can allow for typos, such as a Levenshtein distance. It can rely on multiple fields such as name, address, phone number, date of birth. You can assign weights to each of these fields: matching on phone number may carry a large weight than matching on name. You can also opt for a decision tree: only check the date of birth and phone number for very common names, for example.

To minimize the number of comparisons, you typically only evaluate potential matches. This is where normalization can be helpful, as you only need to compare observations with normalized names of "watts, david," or those within the same city, for example.

Once you matched related observations, you have to merge the information they provide about the entity they represent. For example, if you are matching "Dr David Watts" and "David Watts," you have to decide whether the person is indeed a "Dr" and whether you are keeping that information. The merge step involves aggregating information from the individual observations with whatever aggregation function you feel appropriate. You can fill in missing fields (if, say, you find the phone number for David Watts in one observation, use it throughout), use the most complete text representation (such as "Dr David George Watts III"), or simply keep all the variants of a field (by creating a set of name variants, for example, {"David Watts", "Dr David Watts", "Dr David George Watts III"}).

Follow through with all three steps to avoid mistakes later.

Spatial Relations

Miklós Koren — Wed, 17 Apr 2019 07:54:46 GMT

title: Spatial Relations published: true description: Start seeing the world in points, lines, and polygons.

tags: GIS

Measurements often have a spatial dimension. If thinking about time intervals feels complicated, welcome to spatial relations. Where in time there are only points and intervals, there are many more different types of objects in space and many more different relations. An observation may be related to a point, such as a sensor, a line, such as a river or a highway, or an area (often called polygon in spatial analysis) such as a city.

These spatial entities may have many relations to one another. A sensor may be inside a city. A highway may intersect a river at a certain point. A highway may intersect the city. A river may serve as the boundary of the city.

Simple Features

A point is given by a pair of coordinates (x,y). (We ignore 3D and only deal with the surface of the Earth.) A line is a list of connected points (x1,y1)--(x2,y2)--... An area is a polygon surrounded by a closed line, (x1,y1)--(x2,y2)--...--(x1,y1). You can have a collection of each of these items. Countries are, often, a collection of exclaves.

The first business of understanding spatial relations is to understand the type of spatial observations you have. Cities are not points, though they certainly have midpoints or centers which come up when you enter the city name in Google Maps. Cities are areas. Indeed, very few entities are actual points, though some can be reasonably approximated as such. A precise street address including the street number can be safely be approximated with its geocoordinates.

Getting from human-readable addresses to machine-readable GPS coordinates is called geocoding. We do this every day when we enter addresses in Google Maps. To do this in a scalable way for all the observations in your dataset, you need a geocoding service. Google Maps has an API, but only allows geocoding for the purposes of showing points on their maps. For bulk geocoding you should turn to other providers such as Nominatim, using OpenStreetMap data.

Projections and Spatial Reference Systems

Geocoding convert addresses to a pair of coordinates: latitude and longitude. But what do these coordinates mean? Since two numbers represent a plane, the problem is how to map points on the surface of the Earth (which, contrary to some claims, is not flat) to points on a flat plane. This mapping is called a projection. There are many projections, depending on what shapes they assume about the Earth, which is slighly different from a perfect sphere. Yes, there is a classification of projections, called the Spatial Reference System Identifier. By far the most widely used is the World Geodetic System, WGS84, which has an SRID of 4326. This is what you see in Google Maps and in your GPS. (Mercator projection is what you see on old printed maps, where Greenland looks larger than Africa. Don't ever use Mercator in real data.)

If you regularly work with spatial data, you should invest in knowing more about geographic information systems (GIS). There are specialized GIS software to map spatial data or do spatial analysis, such as ESRI ArcGIS, MapInfo, or the open source Quantum GIS. Many database management tools also implement spatial queries, so you can easily select "all gas stations within 10km of this road."

Whereas points in space can easily be represented by just two numbers, richer spatial features require their special file format. Well-known text provides a simple text representation of spatial features, such as LINESTRING (30 10, 10 30, 40 40). This is very intuitive, but not very helpful in practice, where lines and polygons have thousands of vertices. GeoJSON is an open standard extension of JSON. If you are used to working with web apps and JSON data, convert your spatial information to the GeoJSON standard. By now all major GIS packages can read and write GeoJSON. There is also the proprietary binary file format of ESRI Shapefiles. These are widely used because of the ubiquity of the ArcGIS software package. The US Bureau of the Census, for example, published the boundaries of Census tracts in ESRI Shapefiles.

Eggs Are Easier To Ship Than Omelettes

Miklós Koren — Mon, 25 Mar 2019 09:17:34 GMT

title: Eggs Are Easier To Ship Than Omelettes published: true description: I estimated the regression model we discussed last week and it didn’t work. Which regression model? What do you mean it didn’t work?

tags: agile, data science team, explorative analysis

I estimated the regression model we discussed last week and it didn’t work.

Which regression model? What do you mean it didn’t work?

How often have you had this conversation in your research team? We have the tendency to assume that our coworkers’ minds are magically connected to ours. They’re not. In fact, there is a very hard boundary between my thoughts and yours. It always takes real effort to transcend this boundary, and this affects how we collaborate.

Photo by Jakub Kapusnak on Unsplash

I have recently introduced a simple template when sharing my work with coauthors. I answer the following four questions and I ask them to do the same.

What deliverables have I completed?
What did I learn?
What actions do I need from you?
What are my next steps?

For example,

Estimated a Poisson regression of post office counts on a bridge proximity indicator: see Table 2.
After bridges are built, post offices become more frequent within 10km. The effect disappears beyond 20km.
Review Table 2 and tell me what additional controls to include.
Download data on river width to be used as an instrument for bridge location.

It is motivated by daily scrum meetings, but I have adapted it to the explorative nature of research projects.

In the answer to Question 1, you should list actual deliverables (Table 2), not just vague concepts (regression model). You should format the tables and figures for publishing, including notes and labels. You will have to do this at some point anyway, you might as well help your coworker understand what precisely you did to generate Figure 3.

Research is an explorative process, and your insights are an essential input. In Question 2, you can share what you learned. What was most surprising to you? Do not just repeat what is in the table or the figure. You don’t want to insult your coworker’s intelligence. This is an opportunity to exercise your analytical judgement.

“FYI” and “What do you think?” don’t cut it. What specific actions do you need to go on with your work? If you are stuck somewhere, let them know. If you are unsure about some parts and would need more feedback, let them know.

Much as in scrum, sharing what you are planning next helps bring the team to a common understanding. You are the best positioned to decide on next steps, because you are the one who best understands the data and the model you are working with. (If not, request for feedback in Question 3.) So don’t be afraid to map out your work.

I sometimes just say to Question 4: “Next steps: None. I am happy to answer clarification questions by email or Skype Monday afternoon.” It is better for your teammates to know what they can expect from you, even if it is “nothing.” This is especially important if you are not sharing an office. I have had way too many email ping-pongs about who did what, and if people are not in sync, this can easily take a week or more.

I certainly feel the benefits of this approach. I can catch up faster on my coauthors’ work. We need synchronous status meetings less often, and if we do, they are more productive.

This is just one example of how creating an analytics product with hard boundaries can make you more productive. You should also write modular code that is free of side effects. And assume (next to) nothing about your teammate’s computing environment. But more on this later.

Spells

Miklós Koren — Thu, 21 Mar 2019 15:49:19 GMT

title: Spells published: true description: Find a model that fits your data as it is. Don’t torture your data to conform to models you know.

tags: data science, temporal data, data modeling

I often work with time spells in my data. For example, a firm may be managed by different managers for different time spells. Gyöngyi leaves the firm on December 31, 1996, and Gábor starts on January 1, 1997.

   firm    manager   valid_from    valid_to  
 -------- --------- ------------ ------------   
  123456   Gyöngyi   1992-01-01   1996-12-31    
  123456   Gábor     1997-01-01   1999-12-31

The standard econometrics toolbox is not well suited for time spells. Often, the first thing an economist does is to convert this data to a format they know: an annual panel. (Or monthly, or weekly, same idea.)

You can get rid of time spells by temporal sampling

Take a number of time instances and select the observations that were valid at that instance. Take all the managers who were at the firm on June 21, 1997, for example. This reduces the time dimension to time stamps, which are easier to study.

Why June 21?
You may be tempted to sample your data at dates like January 1 or December 31. As firms and data entry users prefer to report round dates, this is potentially dangerous. SolidWork and Co. may report all its changes on December 31, Hungover Ltd. may hold their reporting until January 1. If you sample on December 31, you get the correct data for SolidWork Co, but last year’s data for Hungover Ltd! To avoid such bunching around round dates, our standard operating procedure at CEU MicroData is to pick a day of the year that is in the middle and is not round: June 21. This also happens to be Midsummer.

Photo by Robson Hatsukami Morgan on Unsplash

This will result in the following data.

firm    manager   year    
 -------- --------- ------   
  123456   Gyöngyi   1992    
  123456   Gyöngyi   1993    
  123456   Gyöngyi   1994    
  123456   Gyöngyi   1995    
  123456   Gyöngyi   1996    
  123456   Gábor     1997    
  123456   Gábor     1998  
  123456   Gábor     1999

What’s wrong with this?

For starters, we are repeating observations. What used to be two lines is now eight. This wastes storage and grossly violates the DRY principle.

Even worse, even though our data set takes up more space, it contains less information. We don’t know precisely when Gyöngyi started in 1992 and when Gábor took over. We don’t even know if they ever spent time together at the firm. Maybe the snowed-in December of 1996? (We know Gábor was not yet there on June 21.)

If you believe these are silly arguments, you’re wrong. Serious academic blood has been spilled on this. It took us more than a decade to realize that the first year of a firm is only a partial year.

We put up with all this mess, because intervals can get tricky. Did you know that there are 13 different relations between time intervals? X may take place before Y, they may overlap, it may finish Y, and so forth. Allen’s interval algebra captures these relations formally.

CC BY Wikimedia

This is confusing, but you are unlikely to need all these possible relations. You will need to measure which interval is earlier (ranking the start time of intervals, for example), and to measure overlap. For example, have Gyöngyi and Gábor served at the firm at the same time? This is a question of overlap. Can Gyöngyi be responsible for hiring Gábor? Has she arrived earlier than him? This is a question of precedence.

How do you go about modeling your data if you don’t want to lose information?

There are statistical models for time spells: they are called survival or hazard models. You can model the duration of a manager’s spell: what makes some managers stay longer than others? Or you can model a certain event occurring during their spell: are female managers more likely to start exporting than male managers? Here it is important that some spells are longer than others. Gyöngyi has five years to start exporting, Gábor has only three.

To be sure, hazard models are harder than linear panel models, but since when does hard stop you?

Find a model that fits your data as it is. Don’t torture your data to conform to models you know.

As a practical consideration, many database management tools implement what is called a temporal database, capturing the time spell for which an entity or a relation is valid. This makes it even easier to conduct temporal queries such as the examples above.

Everything is a Function

Miklós Koren — Tue, 12 Mar 2019 17:40:06 GMT

title: Everything is a Function published: true description: Procedural programming comes natural to scientists, because it reads like a precise protocol for an experiment. But everything in data analysis is a function.

tags: data science, functional

Most scientists start programming in a procedural style. I certainly did. Procedural programming comes natural to scientists, because it reads like a precise protocol for an experiment. Do this. Then do that.

Photo by Hans Reniers on Unsplash

I haven’t seen anyone doing data analysis in Clojure, Erlang, Haskell or another functional language.

output = function(inputs)

Strange, because if you think about it, everything in data analysis is a function. Data cleaning maps from messy data to tidy data. A statistical estimator maps from a sample to a real number. A visualization maps from data to a colorful bitmap. For data analysis, we almost exclusively write code that does not require user interaction and would be well suited to the functional paradigm.

The conventional definition of functional programming is “no side effects.” You only compute output from inputs. You cannot rely on any other information, and you cannot pass on any other information. This very tight discipline is super useful for science, as it easier to argue about correctness. For example, the ordinary least squares estimator of multivariate regressions,

is a mathematical function which you can characterize using pencil and paper. The Julia equivalent,

function OLS(X, Y)  
    return inv(X' * X) * X' * Y  
end

works independently of what you have done somewhere else in the code. (By the way, X\Y is a better way to write this in Julia.)

Moreover, it is easier to automate computations as a chain of functions. If f(X,Y) is the estimator of multivariate coefficients and g(b,X) is a prediction rule, then g(f(X,Y),X) is your fitted machine learning model. Relying on pure functions makes the data science process more reproducible.

What are some existing implementations of the chain of functions approach?

You can chain small tools in a Unix-like shell via the pipe operator. The tool reads from STDIN and writes to STDOUT and (hopefully) does not touch anything else in between. As a data scientist, you can focus on implementing the function correctly, instead of worrying how you get the data and who does what with it. This is why I am a big fan of “data science from the command line.”

An even better example is %>% piping in R. (Julia has a similar pipe operator.) As I understand from my R colleagues, most idiomatic code now uses this syntax.

x %>% log() %>% diff() %>% exp() %>% round(1)

At some level, even scripting languages such as Stata do-files can be thought of as a chain of functions. A strict limitation of Stata is that you can only carry out computations on a single dataframe at a time. This limitation has huge benefits, though. You can write functional code that maps from one state of your dataframe to the next. For example,

generate y = log(x)  
replace y = 0 if x < 0

is a chain of two functions. Easy to read, easy to debug. It does the same as the Pandas code

df['y'] = math.log(df['x'])  
df['y'][df['x'] < 0] = 0

Er, what? This reads more complicated because of a vastly wider state we have to control. What log function do we want to use? Which dataframe are we selecting over? Which dataframe are we changing?

What is not functional?

Notebooks and other REPL are not and Joel Spolsky hates them with a passion. When you move up and down between cells, saving all kinds of variables in your workspace, you confuse yourself about what is an input to your current computation. I sometimes play around in ipython notebooks, but I always feel guilty.

Jenny Bryan from RStudio and tidyverse also has something to say about side effects.

%[INVALID_URL]

A wish list (or New Year’s resolution) for better data science

Implement pipe operator in Python. I know it’s hard, but can we just have tidyverse for Python?
Write purely functional Stata code. Separate out input/output and even model estimation, graphing from pure data manipulation code.
Explore data science libraries for real functional languages. I know, SQL is functional, but it reads very complicated.
More generally, keep an eye out for side effects. Do I need this global parameter? Do I need to write this to disk? Aim to write as pure functions as possible.

The Tupperware Approach to Coding

Miklós Koren — Tue, 05 Mar 2019 21:29:12 GMT

title: The Tupperware Approach to Coding published: true description: A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.

tags: coding, script, software architecture

Coding is like ultra running. It is a huge, often daunting task. If you don’t want to go crazy, you have to break it into smaller chunks. Before lunch, I will finish this function. At the next aid station, I have to refill my water bottles.

Dividing the problem into many small, manageable chunks is one way to deal with complex problems. But if you split the problem into too small chunks, you will end with too many of them. Again you will feel overwhelmed.

A nested structure with multiple layers is often helpful. When running an ultra, I like to split the race into thirds, the thirds into sections between aid stations, and, indeed, I often just focus on single breaths. For coding, there are libraries, modules, classes, functions and single statements.

A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.

Perhaps the best known example is how we think about time. Time is naturally modular. There are about 30 days in a month and 12 months in a year. (We are lucky with this arrangement. A Saturn year takes about 25,000 Saturn days.) This way, we can have both small and few. I can plan for today. For this week. I can estimate how many weeks a project takes. I can select projects to work on next year.

Notice how I am moving up and down across multiple levels of abstraction. When I make plans for today, I do not pause to think about how these activities affect my goals for the year. (Maybe I should.) When I schedule different projects across the coming weeks, I do not pause to think about whether I will do them in the morning or the afternoon. I just assume that my daily plan will fall in line.

Another well known example is the folder structure on most operating systems. (The earliest mentions of folder hierarchies are from 1958 and 1965.) I can put a folder inside another folder, down to an arbitrary depth. This way, I can look around in my current folder and have an understanding quickly. If I need more details, I dig deeper into a folder inside.

Much as a structured calendar and a nice folder structure, a well structured program helps organize your thoughts.

I have written scripts, especially early in my career, that did everything at once. Thousands of lines of code, executing line by line. Looking through and trying to edit these scripts later is like an ultra runner’s nightmare.

Later on, I erred on the side of too many. In a research project I could easily have 20–30 do files with little organization. Looking back, this makes me nauseous.

So what is the right level of abstraction? What is small enough? How many are few enough?

Each of your chunks should be small enough to keep in your head.

You should not look at another piece of code to find out what the current function does. Often, this means only a couple of lines of code per function and a couple of functions per module. Object oriented languages are modular by design, but you can split up even simple Stata scripts in many smaller pieces.

And you should not refer to more than 6–8 other chunks in any one layer.

More than that and you will get lost. Having 10 or more scripts to look at and run is a good indication that you want to introduce additional layers. Can these scripts be differentiated by function? By how often they are called? By what inputs they need? Anything to make you more organized.

Nurture your code with the same love you nurture your calendar.

The Power of Plain Text

Miklós Koren — Fri, 01 Mar 2019 20:26:39 GMT

title: The Power of Plain Text published: true description: I believe portability and ease of exploration beats a tight schema-conforming database any time.

tags: csv, data, json

I sometimes get excited by binary file formats for storing data. A couple of years ago it was HDF5. Now Apache Parquet looks pretty promising. But most of my data work, especially if I share it with others, is stored in just simple, plain text.

I believe portability and ease of exploration beats a tight schema-conforming database any time.

Be it CSV, JSON or YAML, I love it that I can just peek into the data real quick.

head -n100 data.csv
wc -l data.csv

are commands I use quite often. And nothing beats the human readability of a nice YAML document.

Sure, performance is sometimes an issue. If you are regularly reading and writing tens of millions of rows, you probably don’t want to use plain text. But in most of our use cases, a data product is read and written maybe a couple times a day by its developer and then shared with several users who read it once or twice. It is more important to facilitate sharing and discovery than to save some bytes. And you can always zip of gzip. (Never rar or 7z or the like. Do you really expect me to install an app just to read your data?)

Besides size (big) and speed (slow), there are three issues with CSV files:

No standard definition. Should all strings be encapsulated in quotes? What happens to quotes inside quotes? Never write your own csv parser. There will be special cases you didn’t think of. Use a standard library like Python3 csv or pandas.
Character encoding. As with all plain text files, you have to realize there is no such thing as plain text. Your file is just a sequence of bytes, and you have to tell your computer what your bytes mean. In our daily work, conversion to UTF-8 is the first order of business.
No schema. This is a big headache. Is this column a string? A date? I am constantly struggling with leading zeros and weird date formats. (But I would struggle with these in a proprietary data format, too. Date/time functions are impossible to remember in any programming language.) I have played around with schema validation in Cerberus and it looks cool, but we haven’t adopted anything formal yet.

So why am I a big fan of plain text data despite all these problems? I believe portability and ease of exploration beats a tight schema-conforming database any time. (Mind you, I am not working in a bank. Or health care.) See your data for what it is and play with it.

Semantic Versioning for Data Products

Miklós Koren — Wed, 27 Feb 2019 06:35:40 GMT

title: Semantic Versioning for Data Products published: true description: Semantic versioning lets you communicate your promises effectively within your data analysis team.

tags: semantic versioning,data science,data analysis

In one of my research projects, I study how Hungarian firms managed by foreign CEOs perform relative to those managed by domestic CEOs. I need to merge data on firm performance to data on manager nationality. This latter data we have collected ourselves in our research lab, based on manager names.

I recently noticed a data fluke that made us classify some Hungarian names in the early 1990s as foreign. Once found, it was relatively easy to fix. Now I want to make sure that my team uses the newer, better data product for manager nationality as opposed to the old one.

Enter semantic versioning: I have to release and refer to manager-db-1.0.1.

What is semantic versioning?

As you probably know from software development, Semantic Versioning is a set of rules on how to number your releases:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,

MINOR version when you add functionality in a backwards-compatible manner, and

PATCH version when you make backwards-compatible bug fixes.

I think the exact same rules can be applied to data products. Fixing a bug in the data is a patch, hence I incremented the patch number. Adding a new column to a data table is added functionality, so you should increment the minor version. But when do you increment the major version? What is the API of a data product?

The work of my fellow analysts depends on my data product in two ways:

Their code is dependent on the schema of my data files. If I delete or rename a column, or change its meaning, their code may stop running. Schema is like compile-time dependence for code.
Their output is dependent on the content of my data files. If I add new rows, their statistical analysis may yield different numbers. Content is like runtime dependence for code.

Because of that, schema should be protected even more than content. Analysts are used to iterative work and changing outcomes. But they hate breaking code. So schema is the public contract I am offering with my data product, and I should increment the major version every time I change the schema in an incompatible way.

How does this work in practice?

Take this simple csv table of foods, for example.

---food-1.0.0.csv
food,category,diet_quality_score
apple,fruit,2
tomato,vegetable,2

Before I release it as food-1.0.0.csv, I have to explicitly declare its schema, for example, in Cerberus:

{'food': {'type': 'string'}, 'category': {'type': 'string', 'allowed': ['fruit', 'vegetable']}, 'diet_quality_score': {'type': 'integer'}}

Then I realize that tomato is, scientifically speaking, a fruit, not a vegetable. This is patch, released as food-1.0.1.csv:

---food-1.0.1.csv
food,category,diet_quality_score
apple,fruit,2
tomato,fruit,2

Reading more about nutrition, I add carrot to my dataset. This is added functionality, so the new name is food-1.1.0.csv. Notice that the patch number was reset to zero.

All the downstream code is still compatible with this new dataset. My team will see the new versions and will expect runtime changes, but they can safely reuse their old code.

---food-1.1.1.csv
food,category,diet_quality_score
apple,fruit,2
tomato,fruit,2
carrot,vegetable,1

Time to break the API. I added deep fried Mars bar, which is neither a fruit nor a vegetable.

---food-2.0.0.csv
food,category,diet_quality_score
apple,fruit,2
tomato,fruit,2
carrot,vegetable,1
deep fried mars bar,fried stuff,-10

Is this really breaking the API? The dataset has the same columns and the meaning of each column is the same as before. But I promised to only have fruit or vegetable in column 2 and I broke that promise. If your code relies on that promise, it will break too. Hence I bumped the major version.

How much you promise and how hard you fight to keep that promise is a judgement call. But semantic versioning lets you communicate your promises effectively within your data analysis team.