<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[README]]></title><description><![CDATA[README]]></description><link>https://readme.codedthinking.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1696062037300/qICXj87l_.png</url><title>README</title><link>https://readme.codedthinking.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 18 Apr 2026 10:32:19 GMT</lastBuildDate><atom:link href="https://readme.codedthinking.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Automate Your Data Work With Make]]></title><description><![CDATA[I like to think that you can remain productive over 40. Make is 43 this year and it is still my tool of choice to automate my data cleaning or data analysis. It is versatile and beautifully simple. (At first.) Yet, in a recent survey, we found that l...]]></description><link>https://readme.codedthinking.com/automate-your-data-work-with-make</link><guid isPermaLink="true">https://readme.codedthinking.com/automate-your-data-work-with-make</guid><category><![CDATA[Makefile]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Thu, 25 Nov 2021 15:41:43 GMT</pubDate><content:encoded><![CDATA[<hr />
<p>I like to think that you can remain productive over 40. <a target="_blank" href="https://en.wikipedia.org/wiki/Make_(software)">Make</a> is 43 this year and it is still my tool of choice to automate my data cleaning or data analysis. It is versatile and beautifully simple. (At first.) Yet, <a target="_blank" href="https://gist.github.com/csokaimola/219911140de94e01851cc621f50ea794">in a recent survey</a>, we found that less than 5 percent of data savvy economists use Make regularly.</p>
<h2 id="heading-what-is-make">What is Make?</h2>
<p>Most build systems are meant to, well, build things. Compile code in Java, C, and the like. Make is supposed to do that, too, and most tutorials and StackOverflow questions will feature examples about how to build C code.</p>
<p>But at its very basic, Make is indeed beautifully simple. I create a text file called <code>Makefile</code> in my folder with the following content.</p>
<pre><code class="lang-makefile"><span class="hljs-section">clean_data.csv: raw_data.csv data_cleaner.py</span>
    python data_cleaner.py
</code></pre>
<p>Then I say <code>make</code> in the shell and Make creates <code>clean_data.csv</code> from <code>raw_data.csv</code>.</p>
<p>In other words, I need to specify</p>
<pre><code class="lang-makefile"><span class="hljs-section">target: source</span>
    recipe
</code></pre>
<p>and Make will run the recipe for me.</p>
<p>This information is something I want to note for my documentation anyway. What does my script need and what does it produce? I might as well put it in a Makefile.</p>
<p>This way, I can link up a chain of data work,</p>
<pre><code class="lang-makefile"><span class="hljs-section">visualization.pdf: clean_data.csv visualize.py</span>
    python visualize.py
<span class="hljs-section">clean_data.csv: raw_data.csv data_cleaner.py</span>
    python data_cleaner.py
</code></pre>
<p>When I enter <code>make</code> in the shell, I get my <code>visualization.pdf</code> recreated right from raw data.</p>
<blockquote>
<p>Order matters here. Typing <code>make</code> without any arguments recreates the <em>first</em> target found in the file called <code>Makefile</code>. I can also type <code>make clean_data.csv</code> if I want to recreate a specific target.</p>
</blockquote>
<h2 id="heading-only-do-what-is-needed">Only do what is needed</h2>
<p>Suppose I don't like the color in my graph and decide to edit <code>visualize.py</code>. But data cleaning takes a lot of time! If <code>clean_data.csv</code> is already up to date (relative to the time stamps of <code>raw_data.csv</code> and <code>data_cleaner.py</code>), Make will skip that step and only redo the visualization recipe.</p>
<p>You don't have to rerun everything. Lazy is good. One more reason why you want to write <a target="_blank" href="https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74">modular code</a>.</p>
<h2 id="heading-variables-and-functions">Variables and functions</h2>
<p>As soon as you feel the power of your first few simple Makefiles, you will crave for more. Can I do this? Can I do that? The answer is <em>yes, you can, but it will take a lot of searching on StackOverflow</em>.</p>
<p>One feature I use regularly is <em>automatic variables</em>. If I don't want to hard code file names into my neat Python script (you'll see shortly why), I can pass the names of target and source as variables.</p>
<pre><code class="lang-makefile"><span class="hljs-section">clean_data.csv: raw_data.csv data_cleaner.py</span>
    python data_cleaner.py &lt; <span class="hljs-variable">$&lt;</span> &gt; <span class="hljs-variable">$@</span>
</code></pre>
<p>This passes <code>raw_data.csv</code> (the variable <code>$&lt;</code> refers to the first source file) to the STDIN of <code>data_cleaner.py</code> and saves the output on STDOUT to <code>clean_data.csv</code> (the variable <code>$@</code> denotes the target).</p>
<p>Why these symbols? Don't ask me. They don't look pretty but they get the job done.</p>
<p>I can also use <a target="_blank" href="https://www.gnu.org/software/make/manual/html_node/Functions.html#Functions">functions</a> like</p>
<pre><code class="lang-makefile"><span class="hljs-section">clean_data.csv: input/complicated-path/raw_data.csv data_cleaner.py</span>
    python data_cleaner.py <span class="hljs-variable">$(<span class="hljs-built_in">basename</span> $(<span class="hljs-built_in">notdir</span> <span class="hljs-variable">$@</span>)</span>)
</code></pre>
<p>and many more.</p>
<h2 id="heading-parallel-execution">Parallel execution</h2>
<p>And now for the best part. Make can execute my jobs in parallel. On a nicely equipped AWS server, I gladly launch <code>make -j60</code> to do the tasks on 60 threads. Make serves as a job scheduler. Because it knows what depends on what, I will not run into a race condition.</p>
<blockquote>
<ul>
<li><p>Knock, knock.</p>
</li>
<li><p>Race condition.</p>
</li>
<li><p>Who's there?</p>
</li>
</ul>
</blockquote>
<p>Parallel execution doesn't help if I have a linear chain of recipe as above. But if I can split my dependency graph in parallel branches, they will be executed in the correct order.</p>
<p>So suppose my data is split into two (or many more). The following code would allow for parallel execution of the data cleaning recipe.</p>
<pre><code class="lang-makefile"><span class="hljs-section">visualization.pdf: merged_data.csv visualize.py</span>
    python visualize.py
<span class="hljs-section">merged_data.csv: clean_data_1.csv clean_data_2.csv merge_data.py</span>
    python merge_data.py
<span class="hljs-section">clean_data_%.csv: raw_data_%.csv data_cleaner.py</span>
    python data_cleaner.py &lt; <span class="hljs-variable">$&lt;</span> &gt; <span class="hljs-variable">$@</span>
</code></pre>
<p>I have used the <em>pattern matching</em> character <code>%</code> to match both <code>clean_data_1.csv</code> and <code>clean_data_2.csv</code>.</p>
<p>Invoking make with the option <code>j</code>, <code>make -j2</code> will start two processes to clean the data. When <em>both</em> finished, the merge data recipe runs, then the visualization. (These will be single threaded.)</p>
<p>I regularly use parallel execution to do Monte Carlo simulations or draw bootstrap samples. Even if I have 500 parallel tasks and only 40 processors, <code>make -j40</code> will patiently grind away at those tasks. And if I kill my jobs to let someone run Matlab for the weekend (why would they do that?), I can simply restart on Monday with only 460 tasks to go.</p>
<ul>
<li><p><a target="_blank" href="https://github.com/korenmiklos/per-shipment-costs-replication/blob/master/Makefile">Simple real-world Makefile</a> with variables and for loops.</p>
</li>
<li><p><a target="_blank" href="https://github.com/korenmiklos/imported-inputs-and-productivity-replication/blob/master/code/Makefile">Not-so simple Makefile</a> with variables, for loops, functions and pattern matching.</p>
</li>
</ul>
<p>Those who still don't like Make? <code>$&lt; $@</code> them.</p>
]]></content:encoded></item><item><title><![CDATA[Wish I Could Be Like David Watts]]></title><description><![CDATA[title: Wish I Could Be Like David Watts
published: true
description: How do we know that this David Watts is the same as that David Watts?
tags: entity resolution, data quality

Which David Watts? Names are not unique and we want to use keys instead....]]></description><link>https://readme.codedthinking.com/wish-i-could-be-like-david-watts</link><guid isPermaLink="true">https://readme.codedthinking.com/wish-i-could-be-like-david-watts</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Tue, 23 Apr 2019 19:30:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062847324/c5f0e1d0-f5b3-4d23-94a9-5e359b4924ab.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: Wish I Could Be Like David Watts
published: true
description: How do we know that <em>this</em> David Watts is the same as <em>that</em> David Watts?</p>
<h2 id="heading-tags-entity-resolution-data-quality">tags: entity resolution, data quality</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062847324/c5f0e1d0-f5b3-4d23-94a9-5e359b4924ab.png" alt /></p>
<p>Which David Watts? Names are not unique and we want to <a target="_blank" href="https://medium.com/data-architect/choose-great-keys-d9ebe0485ec5">use keys instead</a>. But how does David Watts become <code>P-12345678</code>? More importantly, how do we know that <em>this</em> David Watts is the same as <em>that</em> David Watts?</p>
<p>This problem is known as <strong>entity resolution</strong> (ER), a.k.a. record linkage, deduplication, or fuzzy matching. (It is different from <em>named entity recognition</em>, where you have to recognize entities in flow text.) It is as complicated as it looks. Names and other fields are misspelled, so if you are too strict, you fail to link two related observations. If you are too fuzzy, you mistakenly link unrelated observations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062849905/9b1c22f2-8eb3-4d36-9767-e0ff527232da.jpeg" alt />
Photo by Steve Harvey on Unsplash</p>
<p>The first guiding principle of entity resolution is to embrace the imperfections. There is no perfect method, you are just balancing two types of error. <em>False positives</em> occur when you link two observations that, in reality, refer to two different entities. <em>False negatives</em> occur when you fail to link two observations that, in reality, represent the same entity. You can always decrease one type of error at the expense of the other by selecting a more or less stringent matching method.</p>
<p>The second guiding principle is to appreciate the computational complexity. If you are unsure about your data, you have to compare every observation with every other, making <code>N(N-1)/2</code> comparisons in a dataset with <code>N</code> observations. (See box on why it is sufficient to make <em>pairwise</em> comparisons.) In a large dataset this becomes prohibitively many comparisons. For example, if you want to deduplicate users from a dataset with 100,000 observations (a small dataset), you have to make 10 <em>billion</em> comparisons. Throughout the ER process, you should be looking for ways to reduce the number of necessary comparisons.</p>
<blockquote>
<h2 id="heading-methods-aside">Methods aside</h2>
<p>An entity resolution defines groups of observations that belong to the same entity: <code>e = {o1,o2,o3,...}</code>. Maybe surprisingly, it is sufficient to define when a <em>pair of observations</em> denote the same entity, when <code>e(o1) = e(o2)</code>. Because equality is <em>transitive</em>, we can propagate the pairwise relation to the entire dataset: if <code>e(o1) = e(o2)</code> and <code>e(o2) = e(o3)</code> then <code>e(o1) = e(o3)</code> and <code>e = {o1,o2,o3}</code>.</p>
<p>With fuzzy matching, we cannot tell precisely whether the entities behind two observations are <em>equal</em>. We can just calculate a <em>distance</em> between the two observations, <code>d(o1,o2) ≥ 0</code>. The problem with this is that distances are not transitive: if <code>o1</code> and <code>o2</code> are "very close" and so are <code>o2</code> and <code>o3</code>, that does not make <code>o1</code> and <code>o3</code> "very close." We have the <em>triangle inequality</em>, <code>d(o1,o2) + d(o2,o3) ≥ d(o1,o3)</code>, but this is much weaker than transitivity. </p>
<p>The goal of fuzzy matching is to transform a distance into an equality relation. For example, <code>e(o1) = e(o2)</code> whenever <code>d(o1,o2) ≤ D</code> is a simple formula to use. But beware of being too fuzzy: when <code>D</code> is too big, you can end up linking observations that are very different. For example, if you allow for a <em>Levenshtein distance</em> of 2 between a pair of words, you will find that
<code>book</code> <code>=</code> <code>back</code> <code>=</code> <code>hack</code> <code>=</code> <code>hacker</code>. I bet you didn't believe <code>book</code> <code>=</code> <code>hacker</code>.</p>
</blockquote>
<p>The three steps to efficient ER are to Normalize, Match, and Merge.</p>
<p>First you <strong>normalize</strong> your data by eliminating typos, alternative spellings, to bring the data to a more structured, more comparable format. For example, a name "Dr David George Watts III" may be normalized to "watts, david." Normalization can give you a lot of efficiency because your comparisons in the next step will be much easier. However, this is also where you can loose the most information if you are over-normalizing. </p>
<p>Normalization (a.k.a. standardization) is a function that maps your observation to a simpler (often text) representation. During a normalization, you only use one observation and do not compare it to any other observation. That comes later. You can compare to (short) <em>white lists</em>, though. For example, if your observations represent cities, it is useful to compare the <code>city_name</code> field to a list of known cities and correct typos. You can also convert text fields to lower case, drop punctuation and <em>stop words</em>, round or bin numerical values.</p>
<p>If there is a canonical way to represent the information in your observations, use that. For example, the US Postal Services standardizes US addresses (see figure) and <a target="_blank" href="https://www.usps.com/business/web-tools-apis/address-information-api.htm">provides an API</a> to do that. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062852268/245c479c-604f-457d-aa4e-22643462ec81.png" alt /></p>
<p>Then you <strong>match</strong> pairs of observations which are close enough according to your metric. The metric can allow for typos, such as a <em>Levenshtein distance</em>. It can rely on multiple fields such as name, address, phone number, date of birth. You can assign weights to each of these fields: matching on phone number may carry a large weight than matching on name. You can also opt for a <em>decision tree</em>: only check the date of birth and phone number for very common names, for example.</p>
<p>To minimize the number of comparisons, you typically only evaluate <em>potential matches</em>. This is where normalization can be helpful, as you only need to compare observations with normalized names of "watts, david," or those within the same city, for example.</p>
<p>Once you matched related observations, you have to <strong>merge</strong> the information they provide about the entity they represent. For example, if you are matching "Dr David Watts" and "David Watts," you have to decide whether the person is indeed a "Dr" and whether you are keeping that information. The merge step involves aggregating information from the individual observations with whatever aggregation function you feel appropriate. You can fill in missing fields (if, say, you find the phone number for David Watts in one observation, use it throughout), use the most complete text representation (such as "Dr David George Watts III"), or simply keep all the variants of a field (by creating a <em>set</em> of name variants, for example, {"David Watts", "Dr David Watts", "Dr David George Watts III"}). </p>
<p>Follow through with all three steps to avoid mistakes later.</p>
]]></content:encoded></item><item><title><![CDATA[Spatial Relations]]></title><description><![CDATA[title: Spatial Relations
published: true
description: Start seeing the world in points, lines, and polygons.
tags: GIS
Measurements often have a spatial dimension. If thinking about time intervals feels complicated, welcome to spatial relations. Wher...]]></description><link>https://readme.codedthinking.com/spatial-relations</link><guid isPermaLink="true">https://readme.codedthinking.com/spatial-relations</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Wed, 17 Apr 2019 07:54:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062859515/ffddfbea-90be-4826-b443-b9889944da47.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: Spatial Relations
published: true
description: Start seeing the world in points, lines, and polygons.</p>
<h2 id="heading-tags-gis">tags: GIS</h2>
<p>Measurements often have a spatial dimension. If <a target="_blank" href="https://dev.to/korenmiklos/spells-221a">thinking about time intervals</a> feels complicated, welcome to <a target="_blank" href="https://en.wikipedia.org/wiki/Spatial_relation"><strong>spatial relations</strong></a>. Where in time there are only points and intervals, there are many more different types of objects in space and many more different relations. An observation may be related to a point, such as a sensor, a line, such as a river or a highway, or an area (often called <em>polygon</em> in spatial analysis) such as a city.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062859515/ffddfbea-90be-4826-b443-b9889944da47.jpeg" alt="Photo by Fleur Treurniet on Unsplash" /></p>
<p>These spatial entities may have many relations to one another. A sensor may be inside a city. A highway may intersect a river at a certain point. A highway may intersect the city. A river may serve as the boundary of the city.</p>
<blockquote>
<h3 id="heading-simple-features">Simple Features</h3>
<p>A <strong>point</strong> is given by a pair of coordinates (x,y). (We ignore 3D and only deal with the surface of the Earth.) A <strong>line</strong> is a list of connected points (x1,y1)--(x2,y2)--... An <strong>area</strong> is a polygon surrounded by a closed line, (x1,y1)--(x2,y2)--...--(x1,y1).
You can have a collection of each of these items. Countries are, often, a collection of exclaves.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062861665/bebf6c80-084e-4aea-8c10-57a7af0f4f2b.png" alt="By Krauss - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=21299138" /></p>
<p>The first business of understanding spatial relations is to understand the type of spatial observations you have. Cities are not points, though they certainly have midpoints or centers which come up when you enter the city name in Google Maps. Cities are areas. Indeed, very few entities are actual points, though some can be reasonably approximated as such. A precise street address including the street number can be safely be approximated with its geocoordinates. </p>
<p>Getting from human-readable addresses to machine-readable GPS coordinates is called <strong>geocoding</strong>. We do this every day when we enter addresses in Google Maps. To do this in a scalable way for all the observations in your dataset, you need a geocoding service. Google Maps has an API, but only allows geocoding for the purposes of showing points on their maps. For bulk geocoding you should turn to other providers such as <a target="_blank" href="https://nominatim.openstreetmap.org/">Nominatim</a>, using OpenStreetMap data.</p>
<blockquote>
<h3 id="heading-projections-and-spatial-reference-systems">Projections and Spatial Reference Systems</h3>
<p>Geocoding convert addresses to a pair of coordinates: latitude and longitude. But what do these coordinates mean? Since two numbers represent a plane, the problem is how to map points on the surface of the Earth (which, contrary to some claims, is not flat) to points on a flat plane. This mapping is called a <strong>projection</strong>. There are many projections, depending on what shapes they assume about the Earth, which is slighly different from a perfect sphere. Yes, there is a classification of projections, called the <a target="_blank" href="https://en.wikipedia.org/wiki/Spatial_reference_system">Spatial Reference System Identifier</a>. By far the most widely used is the <a target="_blank" href="https://en.wikipedia.org/wiki/World_Geodetic_System#WGS84">World Geodetic System</a>, WGS84, which has an SRID of 4326. This is what you see in Google Maps and in your GPS. (Mercator projection is what you see on old printed maps, where Greenland looks larger than Africa. Don't ever use Mercator in real data.)</p>
</blockquote>
<p>If you regularly work with spatial data, you should invest in knowing more about <strong>geographic information systems</strong> (GIS). There are specialized GIS software to map spatial data or do spatial analysis, such as ESRI ArcGIS, MapInfo, or the open source <a target="_blank" href="https://www.qgis.org/en/site/">Quantum GIS</a>. Many database management tools also implement spatial queries, so you can easily select "all gas stations within 10km of this road."</p>
<p>Whereas points in space can easily be represented by just two numbers, richer spatial features require their special file format. <a target="_blank" href="https://en.wikipedia.org/wiki/Well-known_text">Well-known text</a> provides a simple text representation of spatial features, such as <code>LINESTRING (30 10, 10 30, 40 40)</code>. This is very intuitive, but not very helpful in practice, where lines and polygons have thousands of vertices. <a target="_blank" href="https://en.wikipedia.org/wiki/GeoJSON">GeoJSON</a> is an open standard extension of JSON. If you are used to working with web apps and JSON data, convert your spatial information to the GeoJSON standard. By now all major GIS packages can read and write GeoJSON. There is also the proprietary binary file format of ESRI Shapefiles. These are widely used because of the ubiquity of the ArcGIS software package. The US Bureau of the Census, for example, published the <a target="_blank" href="https://www.census.gov/geo/maps-data/data/tiger-line.html">boundaries of Census tracts</a> in ESRI Shapefiles.  </p>
]]></content:encoded></item><item><title><![CDATA[Eggs Are Easier To Ship Than Omelettes]]></title><description><![CDATA[title: Eggs Are Easier To Ship Than Omelettes
published: true
description: I estimated the regression model we discussed last week and it didn’t work. Which regression model? What do you mean it didn’t work?
tags: agile, data science team, explorativ...]]></description><link>https://readme.codedthinking.com/eggs-are-easier-to-ship-than-omelettes</link><guid isPermaLink="true">https://readme.codedthinking.com/eggs-are-easier-to-ship-than-omelettes</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Mon, 25 Mar 2019 09:17:34 GMT</pubDate><content:encoded><![CDATA[<hr />
<p>title: Eggs Are Easier To Ship Than Omelettes
published: true
description: I estimated the regression model we discussed last week and it didn’t work. Which regression model? What do you mean it didn’t work?</p>
<h2 id="heading-tags-agile-data-science-team-explorative-analysis">tags: agile, data science team, explorative analysis</h2>
<blockquote>
<ul>
<li>I estimated the regression model we discussed last week and it didn’t work.  </li>
<li>Which regression model? What do you mean it didn’t work?</li>
</ul>
</blockquote>
<p>How often have you had this conversation in your research team? We have the tendency to assume that our coworkers’ minds are magically connected to ours. They’re not. In fact, there is a very <strong>hard boundary</strong> between my thoughts and yours. It always takes real effort to transcend this boundary, and this affects how we collaborate.</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/0*Z3lxEHR8vumzwAfV" alt="Photo by [Jakub Kapusnak](https://unsplash.com/@foodiesfeed?utm_source=medium&amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;utm_medium=referral)" />
Photo by <a target="_blank" href="https://unsplash.com/@foodiesfeed?utm_source=medium&amp;utm_medium=referral">Jakub Kapusnak</a> on <a target="_blank" href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<p>I have recently introduced a simple template when sharing my work with coauthors. I answer the following four questions and I ask them to do the same.</p>
<ol>
<li>What deliverables have I completed?</li>
<li>What did I learn?</li>
<li>What actions do I need from you?</li>
<li>What are my next steps?</li>
</ol>
<p>For example,</p>
<ol>
<li>Estimated a Poisson regression of post office counts on a bridge proximity indicator: see Table 2.  </li>
<li>After bridges are built, post offices become more frequent within 10km. The effect disappears beyond 20km.  </li>
<li>Review Table 2 and tell me what additional controls to include.  </li>
<li>Download data on river width to be used as an instrument for bridge location.</li>
</ol>
<p>It is motivated by <a target="_blank" href="https://en.wikipedia.org/wiki/Scrum_%28software_development%29#Daily_Scrum">daily scrum meetings</a>, but I have adapted it to the explorative nature of research projects.</p>
<p>In the answer to Question 1, you should list <strong>actual deliverables</strong> (Table 2), not just vague concepts (regression model). You should format the tables and figures for publishing, including notes and labels. You will have to do this at some point anyway, you might as well help your coworker understand what precisely you did to generate Figure 3.</p>
<p>Research is an explorative process, and your insights are an essential input. In Question 2, you can share what you learned. What was <strong>most surprising</strong> to you? Do not just repeat what is in the table or the figure. You don’t want to insult your coworker’s intelligence. This is an opportunity to exercise your analytical judgement.</p>
<p>“<em>FYI</em>” and “<em>What do you think?</em>” don’t cut it. What <strong>specific actions</strong> do you need to go on with your work? If you are stuck somewhere, let them know. If you are unsure about some parts and would need more feedback, let them know.</p>
<p>Much as in scrum, sharing what you are planning next helps bring the team to a common understanding. You are the best positioned to decide on <strong>next steps</strong>, because you are the one who best understands the data and the model you are working with. (If not, request for feedback in Question 3.) So don’t be afraid to map out your work.</p>
<p>I sometimes just say to Question 4: “<em>Next steps: None. I am happy to answer clarification questions by email or Skype Monday afternoon.</em>” It is better for your teammates to know what they can expect from you, even if it is “<em>nothing</em>.” This is especially important if you are not sharing an office. I have had way too many email ping-pongs about who did what, and if people are not in sync, this can easily take a week or more.</p>
<p>I certainly feel the benefits of this approach. I can catch up faster on my coauthors’ work. We need synchronous status meetings less often, and if we do, they are more productive.</p>
<p>This is just one example of how creating an analytics product with hard boundaries can make you more productive. You should also write <a target="_blank" href="https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74">modular code</a> that is <a target="_blank" href="https://dev.to/korenmiklos/everything-is-a-function-4171">free of side effects</a>. And assume (next to) nothing about your teammate’s computing environment. But more on this later.</p>
]]></content:encoded></item><item><title><![CDATA[Spells]]></title><description><![CDATA[title: Spells
published: true
description: Find a model that fits your data as it is. Don’t torture your data to conform to models you know.
tags: data science, temporal data, data modeling
I often work with time spells in my data. For example, a fir...]]></description><link>https://readme.codedthinking.com/spells</link><guid isPermaLink="true">https://readme.codedthinking.com/spells</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Thu, 21 Mar 2019 15:49:19 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062872696/d8621ce1-b045-475e-a679-493f613e7802.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: Spells
published: true
description: Find a model that fits your data as it is. Don’t torture your data to conform to models you know.</p>
<h2 id="heading-tags-data-science-temporal-data-data-modeling">tags: data science, temporal data, data modeling</h2>
<p>I often work with time spells in my data. For example, a firm <a target="_blank" href="https://github.com/korenmiklos/expat-analysis">may be managed</a> by different managers for different time spells. Gyöngyi leaves the firm on December 31, 1996, and Gábor starts on January 1, 1997.</p>
<pre><code>   firm    manager   valid_from    valid_to  
 -------- --------- ------------ ------------   
  <span class="hljs-number">123456</span>   Gyöngyi   <span class="hljs-number">1992</span><span class="hljs-number">-01</span><span class="hljs-number">-01</span>   <span class="hljs-number">1996</span><span class="hljs-number">-12</span><span class="hljs-number">-31</span>    
  <span class="hljs-number">123456</span>   Gábor     <span class="hljs-number">1997</span><span class="hljs-number">-01</span><span class="hljs-number">-01</span>   <span class="hljs-number">1999</span><span class="hljs-number">-12</span><span class="hljs-number">-31</span>
</code></pre><p>The standard econometrics toolbox is not well suited for time spells. Often, the first thing an economist does is to convert this data to a format they know: an annual panel. (Or monthly, or weekly, same idea.)</p>
<h4 id="heading-you-can-get-rid-of-time-spells-by-temporal-sampling">You can get rid of time spells by <strong>temporal sampling</strong></h4>
<p>Take a number of time <em>instances</em> and select the observations that were valid at that instance. Take all the managers who were at the firm on June 21, 1997, for example. This reduces the time dimension to time stamps, which are easier to study.</p>
<blockquote>
<p><strong>Why June 21?</strong><br />You may be tempted to sample your data at dates like January 1 or December 31. As firms and data entry users prefer to report round dates, this is potentially dangerous. SolidWork and Co. may report all its changes on December 31, Hungover Ltd. may hold their reporting until January 1. If you sample on December 31, you get the correct data for SolidWork Co, but last year’s data for Hungover Ltd! To avoid such bunching around round dates, our standard operating procedure at CEU MicroData is to pick a day of the year that is in the middle and is not round: June 21. This also happens to be Midsummer.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062872696/d8621ce1-b045-475e-a679-493f613e7802.jpeg" alt />
Photo by <a target="_blank" href="https://unsplash.com/@robsonhmorgan?utm_source=medium&amp;utm_medium=referral">Robson Hatsukami Morgan</a> on <a target="_blank" href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<p>This will result in the following data.</p>
<pre><code>firm    manager   year    
 -------- --------- ------   
  <span class="hljs-number">123456</span>   Gyöngyi   <span class="hljs-number">1992</span>    
  <span class="hljs-number">123456</span>   Gyöngyi   <span class="hljs-number">1993</span>    
  <span class="hljs-number">123456</span>   Gyöngyi   <span class="hljs-number">1994</span>    
  <span class="hljs-number">123456</span>   Gyöngyi   <span class="hljs-number">1995</span>    
  <span class="hljs-number">123456</span>   Gyöngyi   <span class="hljs-number">1996</span>    
  <span class="hljs-number">123456</span>   Gábor     <span class="hljs-number">1997</span>    
  <span class="hljs-number">123456</span>   Gábor     <span class="hljs-number">1998</span>  
  <span class="hljs-number">123456</span>   Gábor     <span class="hljs-number">1999</span>
</code></pre><h4 id="heading-whats-wrong-with-this">What’s wrong with this?</h4>
<p>For starters, we are repeating observations. What used to be two lines is now eight. This wastes storage and grossly violates the <a target="_blank" href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself">DRY principle</a>.</p>
<p>Even worse, even though our data set takes up more space, it contains less information. We don’t know precisely when Gyöngyi started in 1992 and when Gábor took over. We don’t even know if they ever spent time together at the firm. Maybe the snowed-in December of 1996? (We know Gábor was not yet there on June 21.)</p>
<blockquote>
<p>If you believe these are silly arguments, you’re wrong. Serious academic blood has been spilled on this. It took us more than a decade to realize that the <a target="_blank" href="https://www.aeaweb.org/articles?id=10.1257/aer.20141070">first year of a firm</a> is only a partial year.</p>
</blockquote>
<p>We put up with all this mess, because intervals can get tricky. Did you know that there are 13 different relations between time intervals? <strong>X</strong> may <em>take place before</em> <strong>Y</strong>, they may <em>overlap</em>, it may <em>finish</em> <strong>Y</strong>, and so forth. Allen’s <a target="_blank" href="https://en.wikipedia.org/wiki/Allen%27s_interval_algebra">interval algebra</a> captures these relations formally.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062874778/59ba94ed-dd69-4dd3-bab7-ab671f712d32.png" alt />
CC BY Wikimedia</p>
<p>This is confusing, but you are unlikely to need all these possible relations. You will need to measure which interval is earlier (ranking the start time of intervals, for example), and to measure overlap. For example, have Gyöngyi and Gábor served at the firm at the same time? This is a question of <em>overlap</em>. Can Gyöngyi be responsible for hiring Gábor? Has she arrived earlier than him? This is a question of <em>precedence</em>.</p>
<h4 id="heading-how-do-you-go-about-modeling-your-data-if-you-dont-want-to-lose-information">How do you go about modeling your data if you don’t want to lose information?</h4>
<p>There are statistical models for time spells: they are called <a target="_blank" href="https://en.wikipedia.org/wiki/Survival_analysis">survival or hazard models</a>. You can model the duration of a manager’s spell: what makes some managers stay longer than others? Or you can model a certain event occurring <em>during</em> their spell: are female managers more likely to start exporting than male managers? Here it is important that some spells are longer than others. Gyöngyi has five years to start exporting, Gábor has only three.</p>
<p>To be sure, hazard models are harder than linear panel models, but since when does hard stop you?</p>
<blockquote>
<p>Find a model that fits your data as it is. Don’t torture your data to conform to models you know.</p>
</blockquote>
<p>As a practical consideration, many database management tools implement what is called a <a target="_blank" href="https://en.wikipedia.org/wiki/Temporal_database">temporal database</a>, capturing the time spell for which an entity or a relation is valid. This makes it even easier to conduct temporal queries such as the examples above.</p>
]]></content:encoded></item><item><title><![CDATA[Everything is a Function]]></title><description><![CDATA[title: Everything is a Function
published: true
description:  Procedural programming comes natural to scientists, because it reads like a precise protocol for an experiment. But everything in data analysis is a function.
tags: data science, functiona...]]></description><link>https://readme.codedthinking.com/everything-is-a-function</link><guid isPermaLink="true">https://readme.codedthinking.com/everything-is-a-function</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Tue, 12 Mar 2019 17:40:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062880998/bc16a531-938d-4b7e-8d9e-53900851a455.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: Everything is a Function
published: true
description:  Procedural programming comes natural to scientists, because it reads like a precise protocol for an experiment. But everything in data analysis is a function.</p>
<h2 id="heading-tags-data-science-functional">tags: data science, functional</h2>
<p>Most scientists start programming in a <a target="_blank" href="https://en.wikipedia.org/wiki/Procedural_programming">procedural style</a>. I certainly did. Procedural programming comes natural to scientists, because it reads like a precise <a target="_blank" href="https://www.protocols.io/">protocol</a> for an experiment. <em>Do this</em>. <em>Then do that</em>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062880998/bc16a531-938d-4b7e-8d9e-53900851a455.jpeg" alt /></p>
<blockquote>
<p>Photo by <a target="_blank" href="https://unsplash.com/photos/lQGJCMY5qcM?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Hans Reniers</a> on <a target="_blank" href="https://unsplash.com/search/photos/lab-test?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p>
</blockquote>
<p>I haven’t seen anyone doing data analysis in <a target="_blank" href="https://clojure.org/">Clojure</a>, <a target="_blank" href="https://www.erlang.org/">Erlang</a>, <a target="_blank" href="https://www.haskell.org/">Haskell</a> or another functional language.</p>
<pre><code>output = <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">inputs</span>)</span>
</code></pre><p>Strange, because if you think about it, <strong>everything in data analysis is a function</strong>. Data cleaning maps from messy data to tidy data. A statistical estimator maps from a sample to a real number. A visualization maps from data to a colorful bitmap. For data analysis, we almost exclusively write code that does not require user interaction and would be well suited to the functional paradigm.</p>
<p>The conventional definition of functional programming is “no side effects.” You only compute output from inputs. You cannot rely on any other information, and you cannot pass on any other information. This very tight discipline is super useful for science, as it easier to <a target="_blank" href="https://en.wikipedia.org/wiki/Referential_transparency"><strong>argue about correctness</strong></a>. For example, the ordinary least squares estimator of multivariate regressions,</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062882226/c098f847-428b-4ca2-8516-40ee154060c3.png" alt /></p>
<p>is a mathematical function which you can characterize using pencil and paper. The Julia equivalent,</p>
<pre><code class="lang-julia"><span class="hljs-keyword">function</span> OLS(X, Y)  
    <span class="hljs-keyword">return</span> inv(X' * X) * X' * Y  
<span class="hljs-keyword">end</span>
</code></pre>
<p>works independently of what you have done somewhere else in the code. (By the way, <code>X\Y</code> is a better way to write this in Julia.)</p>
<p>Moreover, it is easier to <strong>automate computations</strong> as a chain of functions. If <code>f(X,Y)</code> is the estimator of multivariate coefficients and <code>g(b,X)</code> is a prediction rule, then <code>g(f(X,Y),X)</code> is your fitted machine learning model. Relying on pure functions makes the data science process more reproducible.</p>
<h4 id="heading-what-are-some-existing-implementations-of-the-chain-of-functions-approach">What are some existing implementations of the chain of functions approach?</h4>
<p>You can chain small tools in a Unix-like shell <a target="_blank" href="http://swcarpentry.github.io/shell-novice/04-pipefilter/index.html">via the pipe operator</a>. The tool reads from STDIN and writes to STDOUT and (hopefully) does not touch anything else in between. As a data scientist, you can focus on implementing the function correctly, instead of worrying how you get the data and who does what with it. This is why I am a big fan of “<a target="_blank" href="https://medium.com/wunderlist-engineering/is-yelp-international-an-excuse-to-roll-data-with-the-command-line-415dc04499a3">data science from the command line</a>.”</p>
<p>An even better example is <code>%&gt;%</code> piping in R. (Julia has a similar <a target="_blank" href="https://docs.julialang.org/en/v1.1/base/base/#Base.:|%3E">pipe operator</a>.) As I understand from my R colleagues, most idiomatic code now uses this syntax.</p>
<pre><code class="lang-R">x %&gt;% log() %&gt;% diff() %&gt;% exp() %&gt;% round(<span class="hljs-number">1</span>)
</code></pre>
<p>At some level, even scripting languages such as Stata do-files can be thought of as a chain of functions. A strict limitation of Stata is that you can only carry out computations on a single dataframe at a time. This limitation has huge benefits, though. You can write functional code that maps from one state of your dataframe to the next. For example,</p>
<pre><code class="lang-stata">generate y = log(x)  
replace y = 0 if x &lt; 0
</code></pre>
<p>is a chain of two functions. Easy to read, easy to debug. It does the same as the Pandas code</p>
<pre><code class="lang-python">df[<span class="hljs-string">'y'</span>] = math.log(df[<span class="hljs-string">'x'</span>])  
df[<span class="hljs-string">'y'</span>][df[<span class="hljs-string">'x'</span>] &lt; <span class="hljs-number">0</span>] = <span class="hljs-number">0</span>
</code></pre>
<p>Er, what? This reads more complicated because of a vastly wider state we have to control. What log function do we want to use? Which dataframe are we selecting over? Which dataframe are we changing?</p>
<h4 id="heading-what-is-not-functional">What is not functional?</h4>
<p>Notebooks and other REPL are not and <a target="_blank" href="https://www.joelonsoftware.com/">Joel Spolsky</a> <a target="_blank" href="https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit">hates them with a passion</a>. When you move up and down between cells, saving all kinds of variables in your workspace, you confuse yourself about what is an input to your current computation. I sometimes play around in ipython notebooks, but I always feel guilty.</p>
<p><a target="_blank" href="https://jennybryan.org/">Jenny Bryan</a> from RStudio and tidyverse also has something to say about side effects.</p>
%[INVALID_URL]<h4 id="heading-a-wish-list-or-new-years-resolution-for-better-data-science">A wish list (or New Year’s resolution) for better data science</h4>
<ol>
<li>Implement pipe operator in Python. I know it’s hard, but can we just have <em>tidyverse</em> for Python?</li>
<li>Write purely functional Stata code. Separate out input/output and even model estimation, graphing from pure data manipulation code.</li>
<li>Explore <a target="_blank" href="https://www.datahaskell.org/index.html">data science libraries</a> for real functional languages. I know, SQL is functional, but it reads very complicated.</li>
<li>More generally, keep an eye out for side effects. Do I need this global parameter? Do I need to write this to disk? Aim to write as pure functions as possible.</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[The Tupperware Approach to Coding]]></title><description><![CDATA[title: The Tupperware Approach to Coding
published: true
description: A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.
tags: coding, script, software architecture
Coding is...]]></description><link>https://readme.codedthinking.com/the-tupperware-approach-to-coding</link><guid isPermaLink="true">https://readme.codedthinking.com/the-tupperware-approach-to-coding</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Tue, 05 Mar 2019 21:29:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062887782/4bbf7759-d163-40a6-b0b4-3a69b5813787.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: The Tupperware Approach to Coding
published: true
description: A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.</p>
<h2 id="heading-tags-coding-script-software-architecture">tags: coding, script, software architecture</h2>
<p>Coding is like ultra running. It is a huge, often daunting task. If you don’t want to go crazy, you have to break it into smaller chunks. <em>Before lunch, I will finish this function. At the next aid station, I have to refill my water bottles.</em></p>
<p>Dividing the problem into many small, manageable chunks is one way to deal with complex problems. But if you split the problem into too small chunks, you will end with too many of them. Again you will feel overwhelmed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062887782/4bbf7759-d163-40a6-b0b4-3a69b5813787.jpeg" alt /></p>
<p>A nested structure with multiple layers is often helpful. When running an ultra, I like to split the race into thirds, the thirds into sections between aid stations, and, indeed, I often just focus on single breaths. For coding, there are libraries, modules, classes, functions and single statements.</p>
<blockquote>
<p>A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.</p>
</blockquote>
<p>Perhaps the best known example is how we think about time. Time is naturally modular. There are about 30 days in a month and 12 months in a year. (We are lucky with this arrangement. A Saturn year takes about 25,000 Saturn days.) This way, we can have both <em>small</em> and <em>few</em>. I can plan for today. For this week. I can estimate how many weeks a project takes. I can select projects to work on next year.</p>
<p>Notice how I am moving up and down across multiple levels of abstraction. When I make plans for today, I do not pause to think about how these activities affect my goals for the year. (Maybe I should.) When I schedule different projects across the coming weeks, I do not pause to think about whether I will do them in the morning or the afternoon. I just assume that my daily plan will fall in line.</p>
<p>Another well known example is the folder structure on most operating systems. (The earliest mentions of folder hierarchies are from <a target="_blank" href="https://www.computer.org/csdl/proceedings/afips/1958/5053/00/50530059.pdf">1958</a> and <a target="_blank" href="https://multicians.org/fjcc4.html">1965</a>.) I can put a folder inside another folder, down to an arbitrary depth. This way, I can look around in my current folder and have an understanding quickly. If I need more details, I dig deeper into a folder inside.</p>
<blockquote>
<p>Much as a structured calendar and a nice folder structure, a well structured program helps organize your thoughts.</p>
</blockquote>
<p>I have written scripts, especially early in my career, that did everything at once. Thousands of lines of code, executing line by line. Looking through and trying to edit these scripts later is like an ultra runner’s nightmare.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062889931/3cceb60e-3f01-4d22-ae49-b765d4cc6f11.png" alt="Some of the 4569 lines of code in a single script" />
Later on, I erred on the side of too many. In a research project I could easily have 20–30 do files with little organization. Looking back, this makes me nauseous.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062891389/78b3aa97-67eb-4395-aff4-f5e0140f079e.png" alt="Some of 36 scripts." />
So what is the right level of abstraction? What is small enough? How many are few enough?</p>
<blockquote>
<p>Each of your chunks should be small enough to keep in your head.</p>
</blockquote>
<p>You should not look at another piece of code to find out what the current function does. Often, this means only a couple of lines of code per function and a couple of functions per module. Object oriented languages are modular by design, but you can split up even simple Stata scripts in many smaller pieces.</p>
<blockquote>
<p>And you should not refer to more than 6–8 other chunks in any one layer.</p>
</blockquote>
<p>More than that and you will get lost. Having 10 or more scripts to look at and run is a good indication that you want to introduce additional layers. Can these scripts be differentiated by function? By how often they are called? By what inputs they need? Anything to make you more organized.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062892688/65aafc90-96d5-4ba8-9c5b-e014c923e71f.png" alt="This is much better. But I can still improve the organization of utils." />
Nurture your code with the same love you nurture your calendar.</p>
]]></content:encoded></item><item><title><![CDATA[The Power of Plain Text]]></title><description><![CDATA[title: The Power of Plain Text
published: true
description: I believe portability and ease of exploration beats a tight schema-conforming database any time. 
tags: csv, data, json
I sometimes get excited by binary file formats for storing data. A cou...]]></description><link>https://readme.codedthinking.com/the-power-of-plain-text</link><guid isPermaLink="true">https://readme.codedthinking.com/the-power-of-plain-text</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Fri, 01 Mar 2019 20:26:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062897822/971c9040-c530-4027-952f-cc535d91c880.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: The Power of Plain Text
published: true
description: I believe portability and ease of exploration beats a tight schema-conforming database any time. </p>
<h2 id="heading-tags-csv-data-json">tags: csv, data, json</h2>
<p>I sometimes get excited by binary file formats for storing data. A couple of years ago it was <a target="_blank" href="https://www.hdfgroup.org/solutions/hdf5/">HDF5</a>. Now <a target="_blank" href="https://parquet.apache.org/">Apache Parquet</a> looks pretty promising. But most of my data work, especially if I share it with others, is stored in just simple, plain text.</p>
<blockquote>
<p>I believe portability and ease of exploration beats a tight schema-conforming database any time.</p>
</blockquote>
<p>Be it CSV, JSON or YAML, I love it that I can just peek into the data real quick.</p>
<pre><code class="lang-bash">head -n100 data.csv
wc -l data.csv
</code></pre>
<p>are commands I use quite often. And nothing beats the human readability of a nice YAML document.</p>
<p>Sure, performance is sometimes an issue. If you are regularly reading and writing tens of millions of rows, you probably don’t want to use plain text. But in most of our use cases, a data product is read and written maybe a couple times a day by its developer and then shared with several users who read it once or twice. It is more important to facilitate sharing and discovery than to save some bytes. And you can always zip of gzip. (Never rar or 7z or the like. Do you really expect me to install an app just to read your data?)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062897822/971c9040-c530-4027-952f-cc535d91c880.jpeg" alt /></p>
<p>Besides size (big) and speed (slow), there are three issues with CSV files:</p>
<ol>
<li>No standard definition. Should all strings be encapsulated in quotes? What happens to quotes inside quotes? Never write your own csv parser. There will be <a target="_blank" href="https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/">special cases</a> you didn’t think of. Use a standard library like <a target="_blank" href="https://docs.python.org/3/library/csv.html">Python3 csv</a> or <a target="_blank" href="https://pandas.pydata.org/">pandas</a>.</li>
<li>Character encoding. As with all plain text files, you have to realize there is no such thing as plain text. Your file is just a sequence of bytes, and you have to tell your computer <a target="_blank" href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/">what your bytes mean</a>. In our daily work, conversion to UTF-8 is the first order of business.</li>
<li>No schema. This is a big headache. Is this column a string? A date? I am constantly struggling with leading zeros and weird date formats. (But I would struggle with these in a proprietary data format, too. Date/time functions are impossible to remember in any programming language.) I have played around with schema validation in <a target="_blank" href="http://docs.python-cerberus.org/en/stable/">Cerberus</a> and it looks cool, but we haven’t adopted anything formal yet.</li>
</ol>
<p>So why am I a big fan of plain text data despite all these problems? I believe portability and ease of exploration beats a tight schema-conforming database any time. (Mind you, I am not working in a bank. Or health care.) See your data for what it is and play with it.</p>
]]></content:encoded></item><item><title><![CDATA[Semantic Versioning for Data Products]]></title><description><![CDATA[title: Semantic Versioning for Data Products
published: true
description: Semantic versioning lets you communicate your promises effectively within your data analysis team.
tags: semantic versioning,data science,data analysis
In one of my research pr...]]></description><link>https://readme.codedthinking.com/semantic-versioning-for-data-products</link><guid isPermaLink="true">https://readme.codedthinking.com/semantic-versioning-for-data-products</guid><dc:creator><![CDATA[Miklós Koren]]></dc:creator><pubDate>Wed, 27 Feb 2019 06:35:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062904223/b2c8adf3-5b63-4da6-b6f2-f9b3e1657c19.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>title: Semantic Versioning for Data Products
published: true
description: Semantic versioning lets you communicate your promises effectively within your data analysis team.</p>
<h2 id="heading-tags-semantic-versioningdata-sciencedata-analysis">tags: semantic versioning,data science,data analysis</h2>
<p>In one of my research projects, I study how Hungarian firms managed by foreign CEOs perform relative to those managed by domestic CEOs. I need to merge data on firm performance to data on manager nationality. This latter data we have collected ourselves in our research lab, based on manager names.</p>
<p>I recently noticed a data fluke that made us classify some Hungarian names in the early 1990s as foreign. Once found, it was relatively easy to fix. Now I want to make sure that my team uses the newer, better data product for manager nationality as opposed to the old one.</p>
<p>Enter semantic versioning: I have to release and refer to <code>manager-db-1.0.1</code>.</p>
<h2 id="heading-what-is-semantic-versioning">What is semantic versioning?</h2>
<p>As you probably know from software development, Semantic Versioning is a set of rules on how to number your releases:</p>
<blockquote>
<p>Given a version number MAJOR.MINOR.PATCH, increment the:</p>
<ol>
<li>MAJOR version when you make incompatible API changes,</li>
<li>MINOR version when you add functionality in a backwards-compatible manner, and</li>
<li>PATCH version when you make backwards-compatible bug fixes.</li>
</ol>
</blockquote>
<p>I think the exact same rules can be applied to data products. Fixing a bug in the data is a patch, hence I incremented the patch number. Adding a new column to a data table is added functionality, so you should increment the minor version. But when do you increment the major version? What is the API of a data product?</p>
<p>The work of my fellow analysts depends on my data product in two ways:</p>
<ul>
<li>Their code is dependent on the schema of my data files. If I delete or rename a column, or change its meaning, their code may stop running. Schema is like compile-time dependence for code.</li>
<li>Their output is dependent on the content of my data files. If I add new rows, their statistical analysis may yield different numbers. Content is like runtime dependence for code.</li>
</ul>
<p>Because of that, schema should be protected even more than content. Analysts are used to iterative work and changing outcomes. But they hate breaking code. So schema is the public contract I am offering with my data product, and I should increment the major version every time I change the schema in an incompatible way.</p>
<h2 id="heading-how-does-this-work-in-practice">How does this work in practice?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696062904223/b2c8adf3-5b63-4da6-b6f2-f9b3e1657c19.jpeg" alt="Photo by Lars Blankers on Unsplash" /></p>
<p>Take this simple csv table of foods, for example.</p>
<pre><code>---food<span class="hljs-number">-1.0</span><span class="hljs-number">.0</span>.csv
food,category,diet_quality_score
apple,fruit,<span class="hljs-number">2</span>
tomato,vegetable,<span class="hljs-number">2</span>
</code></pre><p>Before I release it as food-1.0.0.csv, I have to explicitly declare its schema, for example, in Cerberus:</p>
<pre><code>{<span class="hljs-string">'food'</span>: {<span class="hljs-string">'type'</span>: <span class="hljs-string">'string'</span>}, <span class="hljs-string">'category'</span>: {<span class="hljs-string">'type'</span>: <span class="hljs-string">'string'</span>, <span class="hljs-string">'allowed'</span>: [<span class="hljs-string">'fruit'</span>, <span class="hljs-string">'vegetable'</span>]}, <span class="hljs-string">'diet_quality_score'</span>: {<span class="hljs-string">'type'</span>: <span class="hljs-string">'integer'</span>}}
</code></pre><p>Then I realize that tomato is, scientifically speaking, a fruit, not a vegetable. This is patch, released as <code>food-1.0.1.csv</code>:</p>
<pre><code>---food<span class="hljs-number">-1.0</span><span class="hljs-number">.1</span>.csv
food,category,diet_quality_score
apple,fruit,<span class="hljs-number">2</span>
tomato,fruit,<span class="hljs-number">2</span>
</code></pre><p>Reading more about nutrition, I add carrot to my dataset. This is added functionality, so the new name is <code>food-1.1.0.csv</code>. Notice that the patch number was reset to zero.</p>
<p>All the downstream code is still compatible with this new dataset. My team will see the new versions and will expect runtime changes, but they can safely reuse their old code.</p>
<pre><code>---food<span class="hljs-number">-1.1</span><span class="hljs-number">.1</span>.csv
food,category,diet_quality_score
apple,fruit,<span class="hljs-number">2</span>
tomato,fruit,<span class="hljs-number">2</span>
carrot,vegetable,<span class="hljs-number">1</span>
</code></pre><p>Time to break the API. I added deep fried Mars bar, which is neither a fruit nor a vegetable.</p>
<pre><code>---food<span class="hljs-number">-2.0</span><span class="hljs-number">.0</span>.csv
food,category,diet_quality_score
apple,fruit,<span class="hljs-number">2</span>
tomato,fruit,<span class="hljs-number">2</span>
carrot,vegetable,<span class="hljs-number">1</span>
deep fried mars bar,fried stuff,<span class="hljs-number">-10</span>
</code></pre><p>Is this really breaking the API? The dataset has the same columns and the meaning of each column is the same as before. But I promised to only have fruit or vegetable in column 2 and I broke that promise. If your code relies on that promise, it will break too. Hence I bumped the major version.</p>
<p>How much you promise and how hard you fight to keep that promise is a judgement call. But semantic versioning lets you communicate your promises effectively within your data analysis team.</p>
]]></content:encoded></item></channel></rss>