Reproducibility from Day 1

Lars Vilhuber

Marie Connolly

Miklós Koren

2025-01-01

Follow along

larsvilhuber.github.io/day1-tutorial/

Reproducibility from Day 1

Journals require that you share your code and data in a replication package at the end of your research project.

Following some best practices from day 1 can not only help you prepare this package later, but also make you more productive researchers.

What is a replication package?

Example of deposit

AEA policy

Goal

Illustrate principles of reproducible research from the start
Stay reasonably close to an ideal reproducibility Standard
Use tools that are widely available and easy to use

Scenario

We start with an empty folder, and an idea.

Does procurement in the EU have a bias towards local providers?

We finish with a mini-project about public procurement across various European countries.

Setup

Tools needed:

Stata. It is feasible to do the whole exercise in R as well, but examples will be in Stata.
text editor (we suggest VS Code)
web browser
file browser (e.g., Windows Explorer, Finder, Nautilus, etc.)

First steps

Create an empty folder, like Desktop/day1
Navigate to this folder in Stata: cd ../Desktop/day1
Open this folder in your text editor, if it can open folders (optional)
Google “CEPII GeoDist”

Keeping on top of provenance

Licenses
Streamlining for reproducibility

Licenses

Where does the file come from?

How can we describe this later to somebody?
- Point and click is long to describe
- What are the rights we have?

What is a license?

A license (licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit).¹ ²

Examples

Creative Commons licenses, used for artistic products and data
Open Source licenses (BSD, GPL, MIT, etc.), used for software (code)

License applying to Geodist data

CEPII GeoDist is under an “Etalab 2.0 license”

Can we re-publish the file?

Various guidance pages provided by data editors and others:

Downloading via code

Easiest:

Stata

use "$URL" , clear

Why not?

will it be there in two months? in 6 years?
what if the internet connection is down?

Easy:

Stata

global URL "https://www.cepii.fr/distance/dist_cepii.dta"
copy "$URL" (outputfile), replace

download.file(url="$URL",destfile="(outputfile)")

We will get to even better methods a bit later

Creating a README

Template README
- Cite both dataset and working paper
- Add data URL and time accessed (can you think of a way to automate this?)
- Add a link to license (also: download and store the license)

Link

Step 1: Stata, R ³

Structure of the project

Folder by function

Let’s start with something easy. Separate folders for each function: code/ and data/

code/
data/

TIER Protocol

The TIER Protocol is a set of guidelines for organizing reproducible research.

TIER Protocol

Scripts by function

We will download -> Create a script download_data.do

Paths in scripts

🛑Do not hard-code paths!

copy "$URL" "C:\Users\lv39\Desktop\day1\data\dist_cepii.dta", replace

Why?

Names in scripts

🛑Do not rename data files!

copy "$URL" "C:\Users\lv39\Desktop\day1\data\that_file_from_cepii.dta", replace

Why?

Expanding data downloads

Redo the same thing for other data

Tender data: https://data.europa.eu/euodp/en/data/dataset/ted-csv
- Too big, therefore: https://github.com/codedthinking/tender-home-bias/releases/download/v2.0/ted-sample.csv
Country codes: https://datahub.io/core/country-codes

Link

Step 2: Stata ⁴ R

Data Cleaning

Need to merge the data

Write a merging script. Create the sample.

These two can go in the same script, create_analysis_sample.do

Download from here

Where does the merged data go?

No overwriting of original data!

The raw data should go into a data/raw directory.
The merged data should go in the data/derived directory.

Moving things around

Create directories
Consolidate paths

Link

Step 3: Stata ⁵

Making code more robust

Re-running takes time
What if a file is no longer there?
What if we need packages

Making download code robust

Only download if necessary
- file is missing
- file has changed

Robust download code (Stata)

Download only if we want to download

global redownload 0

if $redownload == 1 {
   // what to do when file does NOT exist
   copy "https://datahub.io/core/country-codes/r/country-codes.csv" "data/raw/country-codes.csv", replace
}

Robust download code (2)

Automatically download the file again if not there.

global redownload 0

capture confirm file  "data/raw/country-codes.csv"
if _rc != 0 {
    global redownload 1
}

if $redownload == 1 {
...

Robust download code (3)

What if the file has changed?

if $redownload == 1 {
   copy "https://datahub.io/core/country-codes/r/country-codes.csv" "data/raw/country-codes.csv", replace
   // create checksum of file
   // Aug 2023 version: 2295658388
   global countrycksum 2295658388
   checksum "data/raw/country-codes.csv", save
   assert $countrycksum == r(checksum)
   // This will fail if the files are not identical
   // Provide a verbose message if we get past this point
   disp in green "Country codes file downloaded successfully"
}

Robust download code (5)

Be informative!

...
} 
else {
   // what to do when file does exist
   disp in green "Country codes file already exists"
}

Link

Step 3 (with robust download code) Stata ⁶

Use the main file to control larger pieces

Change flags, don’t comment out code

No manual manipulation

“Change the parameter to 0.2, then run the code again”
“Compute the percentages for Table 2 by hand”

Systematic automation

Use functions, ado files, programs, macros, subroutines
Use loops, parameters, parameter files to call those subroutines
Use placeholders (globals, macros, libnames, etc.) for common locations ($CONFDATA, $TABLES, $CODE)
Compute all numbers in package
- No manual calculation of numbers

Example (1)

// Header of main.do
// Define which steps should be run
global step1 1
global step2 1

do "code/00_setup.do"
if $step1 == 1  do "code/01_download_data.do"
if $step2 == 1  do "code/02_create_analysis_sample.do"
if $step3 == 1  do "code/03_analysis.do"

Example (2)

Here we always run the 00_setup.do file.

// Header of main.do
// Define which steps should be run
global step1 1
global step2 1

do "code/00_setup.do"
if $step1 == 1  do "code/01_download_data.do"
if $step2 == 2  do "code/02_create_analysis_sample.do"
if $step3 == 3  do "code/03_analysis.do"

Example (3)

Then conditionally run the other pieces:

// Header of main.do
// Define which steps should be run
global step1 1
global step2 1

do "code/00_setup.do"
if $step1 == 1  do "code/01_download_data.do"
if $step2 == 1  do "code/02_create_analysis_sample.do"
if $step3 == 1  do "code/03_analysis.do"

Starting to be complex

Let’s use a separate config.do file to contain configuration parameters

// file locations
// code to set rootdir omitted
global inputdata "$rootdir/data/inputs"
global tempdata  "$rootdir/temporary"
global outputs   "$rootdir/tables-figures"

// ensure they are created
cap mkdir "$tempdata"
cap mkdir "$outputs"

Example (4)

So let’s automate some of this:

include "config.do"

// define steps
global step1 1
global step2 1

// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1  do "$rootdir/code/01_download_data.do"
if $step2 == 1  do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 1  do "$rootdir/code/03_analysis.do"

Example (5)

include "config.do"

// define steps
global step1 1
global step2 1

// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1  do "$rootdir/code/01_download_data.do"
if $step2 == 1  do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 1  do "$rootdir/code/03_analysis.do"

Example (6)

Configure the steps on certain conditions:

// define steps
global step1 1
global step2 1

// verify if file has changed
qui checksum "$resultfile1"
// if not, don't run Step 2
if `r(checksum)' == $checksum1 global step2 0 

// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1  do "$rootdir/code/01_download_data.do"
if $step2 == 1  do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 1  do "$rootdir/code/03_analysis.do"

and config.do contains additional information:

// file locations
// code to set rootdir omitted
global inputdata "$rootdir/data/inputs"
global tempdata  "$rootdir/temporary"
global outputs   "$rootdir/tables-figures"

// ensure they are created
cap mkdir "$tempdata"
cap mkdir "$outputs"

// some key parameters
global resultfile1 "$outputs/table1.tex"
global checksum1   386698503

Why can this be useful?

Consider a final test if everything runs:

delete temporary/ and tables-figures/ folders.
might even delete the downloaded files
then run the main.do file again.

This will test if everything works!

Secrets in the code

What are secrets?

API keys
Login credentials for data access
File paths (FSRDC!)
Variable names (IRS!)

Standard practice

Store secrets in environment variables or files that are not published.

Some services are serious about this

Github secret scanning

Where to store secrets

environment variables
“dot-env” files (Python), “Renviron” files (R)
or some other clearly identified file in the project or home directory

Environment variables

Typed interactively (here for Linux and Mac)

MYSECRET="dfad89ald"
CONFDATALOC="/path/to/irs/files"

(this is not recommended)

Storing these in files

Same syntax used for contents of “dot-env” or “Renviron” files, and in fact bash or zsh startup files (.bash_profile, .zshrc)

Using In R

Edit .Renviron (note the dot!) files:

# Edit global (personal) Renviron
usethis::edit_r_environ()
# You can also consider creating project-specific settings:
usethis::edit_r_environ(scope = "project")

Use the variables defined in .Renviron:

mysecret <- Sys.getenv('MYSECRET')

Using In Python

Loading regular environment variables:

import os
mysecret = os.getenv("MYSECRET")  # will load environment variables

Loading with dotenv

from dotenv import load_dotenv
load_dotenv()  # take environment variables from project .env.
mysecret = os.getenv("MYSECRET")  # will load environment variables

Using in Stata

Yes, this also works in Stata

// load from environment
global mysecret : env MYSECRET
display "$mysecret"  // don't actually do this in code

and via (what else) a user-written package for loading from files:

{.stata code-line-numbers="1-3} net install doenv, from(https://github.com/vikjam/doenv/raw/master/) doenv using ".env" global mysecret "`r(MYSECRET)'" display "$mysecret"

Simplest solution

//============ confidential parameters =============
capture confirm file "confidential/confparms.do"
if _rc == 0 {
    // file exists
    include "confidential/confparms.do"
} else {
    di in red "No confidential parameters found"
}
//============ end confidential parameters =========

//============ non-confidential parameters =========
include "config.do"
//============ end parameters ======================

Confidential code?

What is confidential code, you say?

In the United States, some variables on IRS databases are considered super-top-secret. So you can’t name that-variable-that-you-filled-out-on-your-Form-1040 in your analysis code of same data. (They are often referred to in jargon as “Title 26 variables”).
Your code contains the random seed you used to anonymize the sensitive identifiers. This might allow to reverse-engineer the anonymization, and is not a good idea to publish.

What is confidential code, you say?

You used a look-up table hard-coded in your Stata code to anonymize the sensitive identifiers (replace anoncounty=1 if county="Tompkins, NY"). A really bad idea, but yes, you probably want to hide that.
Your IT specialist or disclosure officer thinks publishing the exact path to your copy of the confidential 2010 Census data, e.g., “/data/census/2010”, is a security risk and refuses to let that code through.

What is confidential code, you say?

You have adhered to disclosure rules, but for some reason, the precise minimum cell size is a confidential parameter.

What is confidential code, you say?

So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?

Example

This will serve as an example. None of this is specific to Stata, and the solutions for R, Python, Julia, Matlab, etc. are all quite similar.
Assume that variables q2f and q3e are considered confidential by some rule, and that the minimum cell size 10 is also confidential.

set seed 12345
use q2f q3e county using "/data/economic/cmf2012/extract.dta", clear
gen logprofit = log(q2f)
by county: collapse (count)  n=q3e (mean) logprofit
drop if n<10
graph twoway n logprofit

Do not do this

A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:

set seed NNNNN
use <removed vars> county using "<removed path>", clear
gen logprofit = log(XXXX)
by county: collapse (count)  n=XXXX (mean) logprofit
drop if n<XXXX
graph twoway n logprofit

The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.

Better

Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:

//============ confidential parameters =============
global confseed    12345
global confpath    "/data/economic/cmf2012"
global confprofit  q2f
global confemploy  q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count)  n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit

Better

and this would be the released file, part of the replication package:

//============ confidential parameters =============
global confseed    XXXX    // a number
global confpath    "XXXX"  // a path that will be communicated to you
global confprofit  XXX     // Variable name for profit T26
global confemploy  XXX     // Variable name for employment T26
global confmincell XXX     // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count)  n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit

While the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f, anywhere in the code.

Caveats

You have to re-run the entire code to obtain a modified graph
But if the data presented in the graph is non-sensitive (i.e., disclosable), then the data underlying it is as well.

$\rightarrow$ provide code that

automatically detects if the confidential data is there (skips if not)
will always run for the graphing (“analysis”) part of the code.

Best

Main file
Conditional processing
Separate file for confidential parameters which can simply be excluded from disclosure request

Best

Main file main.do:

//============ confidential parameters =============
capture confirm file "include/confparms.do"
if _rc == 0 {
    // file exists
    include "include/confparms.do"
} else {
    di in red "No confidential parameters found"
}
//============ end confidential parameters =========

//============ non-confidential parameters =========
global safepath "releasable"
cap mkdir "$safepath"

//============ end parameters ======================

Best

Main file main.do (continued)

// ::::  Process only if confidential data is present 

capture confirm  file "${confpath}/extract.dta"
if _rc == 0 {
   set seed $confseed
   use $confprofit county using "${confpath}/extract.dta", clear
   gen logprofit = log($confprofit)
   by county: collapse (count)  n=$confemploy (mean) logprofit
   drop if n<$confmincell
   save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }

//============ at this point, the data is releasable ======
// ::::  Process always 

use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replace

Best

Auxiliary file include/confparms.do (not released)

//============ confidential parameters =============
global confseed    12345
global confpath    "/data/economic/cmf2012"
global confprofit  q2f
global confemploy  q3e
global confmincell 10
//============ end confidential parameters =========

Best

Auxiliary file include/confparms_template.do (this is released)

//============ confidential parameters =============
global confseed    XXXX    // a number
global confpath    "XXXX"  // a path that will be communicated to you
global confprofit  XXX     // Variable name for profit T26
global confemploy  XXX     // Variable name for employment T26
global confmincell XXX     // a number
//============ end confidential parameters =========

Best replication package

Thus, the replication package would have:

main.do
README.md
include/confparms_template.do
releasable/figure1.dta
releasable/figure1.pdf

Maintaining documentation

Wrapping up the replication package

Using templates for reproducibility
Documenting what you did
When to document

Best replication package

We already had this:

main.do
README.md
include/confparms_template.do
releasable/figure1.dta
releasable/figure1.pdf

What’s in the README?

Start with our fabulous template README. Really, it helps! Available at https://social-science-data-editors.github.io/template_README/

Three parts to README

Data availability (and citations)
Computer requirements
Description of processing

Start with the last part

That’s easy: you’ve been keeping clean instructions since the start, right?

Run “main.do”
Describe what parts might be skipped
Describe what the various parts do
Describe which parts use confidential data

You’ve been doing that since day 1!

Computer requirements

In most confidential environments, such as FSRDC/ IRE, this part is out of your control. But describe it anyway!

Approximate description of computers/nodes used
- memory size (but interested in actual usage, not max of what the system has!)
- compute time! How long does a clean run, from top to bottom, take?
- number of nodes: any parallel processing?
Software
- Version of software (Stata 17, update level)
- All packages! Ideally, version of package (which estout)

Some of that is captured in your notes (updated, remember?), some of that may change over the life of the project, and may be captured in your logs, or your qsub files.

Data availability

This is easy: it’s the data you requested to have included in your FSRDC project!
So you had this info from Day -90 of the project!

Data availability redux

In order to describe data availability, split into two:

how did YOU get access to the data (that’s old)
how can OTHERS get access to the same data (that might be different!)
The two are not always the same, but are both relevant.

Examples

Examples include

this excellent description from a paper by Teresa Fort (ReStud):

All the results in the paper use confidential microdata from the U.S. Census Bureau. To gain access to the Census microdata, follow the directions here on how to write a proposal for access to the data via a Federal Statistical Research Data Center: https://www.census.gov/ces/rdcresearch/howtoapply.html.

You must request the following datasets in your proposal:

Longitudinal Business Database (LBD), 2002 and 2007

Foreign Trade Database – Import (IMP), 2002 and 2007

Annual Survey of Manufactures (ASM), including the Computer Network Use Supplement (CNUS), 1999

[…]

Annual Survey of Magical Inputs (ASMI), 2002 and 2007

Reference “Technology and Production Fragmentation: Domestic versus Foreign Sourcing” by Teresa Fort, project number br1179 in the proposal. This will give you access to the programs and input datasets required to reproduce the results. Requesting a search of archives with the articles DOI (“10.1093/restud/rdw057”) should yield the same results.

NOTE: Project-related files are available for 10 years as of 2015.

Examples

Examples include

this description by Fadlon and Nielsen about Danish data

The information used in the analysis combines several Danish administrative registers (as described in the paper). The data use is subject to the European Union’s General Data Protection Regulation(GDPR) per new Danish regulations from May 2018. The data are physically stored on computers at Statistics Denmark and, due to security considerations, the data may not be transferred to computers outside Statistics Denmark. Researchers interested in obtaining access to the register data employed in this paper are required to submit a written application to gain approval from Statistics Denmark. The application must include a detailed description of the proposed project, its purpose, and its social contribution, as well as a description of the required datasets, variables, and analysis population. Applications can be submitted by researchers who are affiliated with Danish institutions accepted by Statistics Denmark, or by researchers outside of Denmark who collaborate with researchers affiliated with these institutions.

(Example taken from Fadlon and Nielsen, AEJ:Applied 2021).

Three parts to README: timing

- Data availability (and citations):	Start of project, edit at the end
- Computer requirements:	Middle of project
- Description of processing:	Middle of project

with the end really just a last read/edit.

Wrapping it all up

Wrapping up

Public replication package contains intelligible code, omits confidential details (but provides template code), has detailed data provenance statements
Confidential replication package contains all the same, plus the confidential code, is archived in the FSRDC

Things to remember

When doing a disclosure review request, remember to request the code
When outputting statistics, consider the disclosure rules - the less changes, the faster the output (in theory), but in particular fewer surprises
Do not think “nobody will ever read this code” - somebody is very likely to!

End

Now you wait for the replicators to show up!