Reproducibility from Day 1

Lars Vilhuber
Marie Connolly
Miklós Koren

2024-08-01

Follow along

lars.vilhuber.com/p/fsrdc2024/

Reproducibility from Day 1

Journals require that you share your code and data in a replication package at the end of your research project.

Following some best practices from day 1 can not only help you prepare this package later, but also make you more productive researchers.

What is a replication package?

Example of deposit

AEA policy

Goal

Scenario

We start with an empty folder, and an idea.

Does procurement in the EU have a bias towards local providers?

We finish with a mini-project about public procurement across various European countries.

Setup

Tools needed:

  • Stata. It is feasible to do the whole exercise in R as well, but examples will be in Stata.
  • text editor (we suggest VS Code)
  • web browser
  • file browser (e.g., Windows Explorer, Finder, Nautilus, etc.)

First steps

  1. Create an empty folder, like Desktop/day1
  2. Navigate to this folder in Stata: cd ../Desktop/day1
  3. Open this folder in your text editor, if it can open folders (optional)
  4. Google “CEPII GeoDist

Keeping on top of provenance

  • Licenses
  • Streamlining for reproducibility

Licenses

Where does the file come from?

  • How can we describe this later to somebody?
    • Point and click is long to describe
    • What are the rights we have?

What is a license?

A license (licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit).1 2

Examples

License applying to Geodist data

Can we re-publish the file?

Downloading via code

Easiest:

Stata

use "$URL" , clear

Why not?

  • will it be there in two months? in 6 years?
  • what if the internet connection is down?

Easy:

Stata

global URL "https://www.cepii.fr/distance/dist_cepii.dta"
copy "$URL" (outputfile), replace

R

download.file(url="$URL",destfile="(outputfile)")

We will get to even better methods a bit later

Creating a README

  • Template README
    • Cite both dataset and working paper
    • Add data URL and time accessed (can you think of a way to automate this?)
    • Add a link to license (also: download and store the license)

Structure of the project

Folder by function

Let’s start with something easy. Separate folders for each function: code/ and data/

code/
data/

TIER Protocol

  • The TIER Protocol is a set of guidelines for organizing reproducible research.

TIER Protocol

Scripts by function

We will download -> Create a script download_data.do

Paths in scripts

🛑Do not hard-code paths!

copy "$URL" "C:\Users\lv39\Desktop\day1\data\dist_cepii.dta", replace

Why?

Names in scripts

🛑Do not rename data files!

copy "$URL" "C:\Users\lv39\Desktop\day1\data\that_file_from_cepii.dta", replace

Why?

Redo the same thing for other data

Data Cleaning

Need to merge the data

Write a merging script. Create the sample.

These two can go in the same script, create_sample.do

Where does the merged data go?

No overwriting of original data!

The merged data should go in the data/generated directory.

Moving things around

  • Create directories
  • Consolidate paths
  • Time to write a main file!

Purpose of the main file

  • Complete list of all steps!
  • One-touch reproduction
  • Robustness checks all along

Example

do "code/00_setup.do"
do "code/01_download_data.do"
do "code/02_create_analysis_sample.do"
do "code/03_analysis.do"

Making code more robust

  • Re-running takes time
  • What if a file is no longer there?
  • What if we need packages

Making download code robust

  • Only download if necessary
    • file is missing
    • file has changed

Robust download code (Stata)

Download only if we want to download

global redownload 0

if $redownload == 1 {
   // what to do when file does NOT exist
   copy "https://datahub.io/core/country-codes/r/country-codes.csv" "data/raw/country-codes.csv", replace
}

Robust download code (2)

Automatically download the file again if not there.

global redownload 0

capture confirm file  "data/raw/country-codes.csv"
if _rc != 0 {
    global redownload 1
}

if $redownload == 1 {
...

Robust download code (3)

What if the file has changed?

if $redownload == 1 {
   copy "https://datahub.io/core/country-codes/r/country-codes.csv" "data/raw/country-codes.csv", replace
   // create checksum of file
   // Aug 2023 version: 2295658388
   global countrycksum 2295658388
   checksum "data/raw/country-codes.csv", save
   assert $countrycksum == r(checksum)
   // This will fail if the files are not identical
   // Provide a verbose message if we get past this point
   disp in green "Country codes file downloaded successfully"
} 

Robust download code (5)

Be informative!

...
} 
else {
   // what to do when file does exist
   disp in green "Country codes file already exists"
}

Use the main file to control larger pieces

Change flags, don’t comment out code

No manual manipulation

  • “Change the parameter to 0.2, then run the code again”
  • “Compute the percentages for Table 2 by hand”

Systematic automation

  • Use functions, ado files, programs, macros, subroutines
  • Use loops, parameters, parameter files to call those subroutines
  • Use placeholders (globals, macros, libnames, etc.) for common locations ($CONFDATA, $TABLES, $CODE)
  • Compute all numbers in package
    • No manual calculation of numbers

Example (1)

// Header of main.do
// Define which steps should be run
global step1 1
global step2 1

do "code/00_setup.do"
if $step1 == 1  do "code/01_download_data.do"
if $step2 == 2  do "code/02_create_analysis_sample.do"
if $step3 == 3  do "code/03_analysis.do"

Example (2)

Here we always run the 00_setup.do file.

// Header of main.do
// Define which steps should be run
global step1 1
global step2 1

do "code/00_setup.do"
if $step1 == 1  do "code/01_download_data.do"
if $step2 == 2  do "code/02_create_analysis_sample.do"
if $step3 == 3  do "code/03_analysis.do"

Example (3)

Then conditionally run the other pieces:

// Header of main.do
// Define which steps should be run
global step1 1
global step2 1

do "code/00_setup.do"
if $step1 == 1  do "code/01_download_data.do"
if $step2 == 2  do "code/02_create_analysis_sample.do"
if $step3 == 3  do "code/03_analysis.do"

Starting to be complex

  • Let’s use a separate config.do file to contain configuration parameters
// file locations
// code to set rootdir omitted
global inputdata "$rootdir/data/inputs"
global tempdata  "$rootdir/temporary"
global outputs   "$rootdir/tables-figures"

// ensure they are created
cap mkdir "$tempdata"
cap mkdir "$outputs"

Example (4)

So let’s automate some of this:

include "config.do"

// define steps
global step1 1
global step2 1

// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1  do "$rootdir/code/01_download_data.do"
if $step2 == 2  do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 3  do "$rootdir/code/03_analysis.do"

Example (5)

include "config.do"

// define steps
global step1 1
global step2 1

// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1  do "$rootdir/code/01_download_data.do"
if $step2 == 2  do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 3  do "$rootdir/code/03_analysis.do"

Example (6)

Configure the steps on certain conditions:

// define steps
global step1 1
global step2 1

// verify if file has changed
qui checksum "$resultfile1"
// if not, don't run Step 2
if `r(checksum)' == $checksum1 global step2 0 

// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1  do "$rootdir/code/01_download_data.do"
if $step2 == 2  do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 3  do "$rootdir/code/03_analysis.do"

and config.do contains additional information:

// file locations
// code to set rootdir omitted
global inputdata "$rootdir/data/inputs"
global tempdata  "$rootdir/temporary"
global outputs   "$rootdir/tables-figures"

// ensure they are created
cap mkdir "$tempdata"
cap mkdir "$outputs"

// some key parameters
global resultfile1 "$outputs/table1.tex"
global checksum1   386698503

Why can this be useful?

Consider a final test if everything runs:

  • delete temporary/ and tables-figures/ folders.
  • might even delete the downloaded files
  • then run the main.do file again.

This will test if everything works!

Secrets in the code

What are secrets?

  • API keys
  • Login credentials for data access
  • File paths (FSRDC!)
  • Variable names (IRS!)

Standard practice

Store secrets in environment variables or files that are not published.

Some services are serious about this

Github secret scanning

Where to store secrets

  • environment variables
  • dot-env” files (Python), “Renviron” files (R)
  • or some other clearly identified file in the project or home directory

Environment variables

Typed interactively (here for Linux and Mac)

MYSECRET="dfad89ald"
CONFDATALOC="/path/to/irs/files"

(this is not recommended)

Storing these in files

Same syntax used for contents of “dot-env” or “Renviron” files, and in fact bash or zsh startup files (.bash_profile, .zshrc)

Using In R

Edit .Renviron (note the dot!) files:

# Edit global (personal) Renviron
usethis::edit_r_environ()
# You can also consider creating project-specific settings:
usethis::edit_r_environ(scope = "project")

Use the variables defined in .Renviron:

mysecret <- Sys.getenv('MYSECRET')

Using In Python

Loading regular environment variables:

import os
mysecret = os.getenv("MYSECRET")  # will load environment variables

Loading with dotenv

from dotenv import load_dotenv
load_dotenv()  # take environment variables from project .env.
mysecret = os.getenv("MYSECRET")  # will load environment variables

Using in Stata

Yes, this also works in Stata

// load from environment
global mysecret : env MYSECRET
display "$mysecret"  // don't actually do this in code

and via (what else) a user-written package for loading from files:

{.stata code-line-numbers="1-3} net install doenv, from(https://github.com/vikjam/doenv/raw/master/) doenv using ".env" global mysecret "`r(MYSECRET)'" display "$mysecret"

Simplest solution

//============ confidential parameters =============
capture confirm file "confidential/confparms.do"
if _rc == 0 {
    // file exists
    include "confidential/confparms.do"
} else {
    di in red "No confidential parameters found"
}
//============ end confidential parameters =========

//============ non-confidential parameters =========
include "config.do"
//============ end parameters ======================

Confidential code?

What is confidential code, you say?

  • In the United States, some variables on IRS databases are considered super-top-secret. So you can’t name that-variable-that-you-filled-out-on-your-Form-1040 in your analysis code of same data. (They are often referred to in jargon as “Title 26 variables”).
  • Your code contains the random seed you used to anonymize the sensitive identifiers. This might allow to reverse-engineer the anonymization, and is not a good idea to publish.

What is confidential code, you say?

  • You used a look-up table hard-coded in your Stata code to anonymize the sensitive identifiers (replace anoncounty=1 if county="Tompkins, NY"). A really bad idea, but yes, you probably want to hide that.
  • Your IT specialist or disclosure officer thinks publishing the exact path to your copy of the confidential 2010 Census data, e.g., “/data/census/2010”, is a security risk and refuses to let that code through.

What is confidential code, you say?

  • You have adhered to disclosure rules, but for some reason, the precise minimum cell size is a confidential parameter.

What is confidential code, you say?

So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?

Example

  • This will serve as an example. None of this is specific to Stata, and the solutions for R, Python, Julia, Matlab, etc. are all quite similar.
  • Assume that variables q2f and q3e are considered confidential by some rule, and that the minimum cell size 10 is also confidential.
set seed 12345
use q2f q3e county using "/data/economic/cmf2012/extract.dta", clear
gen logprofit = log(q2f)
by county: collapse (count)  n=q3e (mean) logprofit
drop if n<10
graph twoway n logprofit

Do not do this

A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:

set seed NNNNN
use <removed vars> county using "<removed path>", clear
gen logprofit = log(XXXX)
by county: collapse (count)  n=XXXX (mean) logprofit
drop if n<XXXX
graph twoway n logprofit

The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.

Better

Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:

//============ confidential parameters =============
global confseed    12345
global confpath    "/data/economic/cmf2012"
global confprofit  q2f
global confemploy  q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count)  n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit

Better

and this would be the released file, part of the replication package:

//============ confidential parameters =============
global confseed    XXXX    // a number
global confpath    "XXXX"  // a path that will be communicated to you
global confprofit  XXX     // Variable name for profit T26
global confemploy  XXX     // Variable name for employment T26
global confmincell XXX     // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count)  n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit

While the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f, anywhere in the code.

Caveats

  • You have to re-run the entire code to obtain a modified graph
  • But if the data presented in the graph is non-sensitive (i.e., disclosable), then the data underlying it is as well.

\(\rightarrow\) provide code that

  • automatically detects if the confidential data is there (skips if not)
  • will always run for the graphing (“analysis”) part of the code.

Best

  • Main file
  • Conditional processing
  • Separate file for confidential parameters which can simply be excluded from disclosure request

Best

Main file main.do:

//============ confidential parameters =============
capture confirm file "include/confparms.do"
if _rc == 0 {
    // file exists
    include "include/confparms.do"
} else {
    di in red "No confidential parameters found"
}
//============ end confidential parameters =========

//============ non-confidential parameters =========
global safepath "releasable"
cap mkdir "$safepath"

//============ end parameters ======================

Best

Main file main.do (continued)

// ::::  Process only if confidential data is present 

capture confirm  file "${confpath}/extract.dta"
if _rc == 0 {
   set seed $confseed
   use $confprofit county using "${confpath}/extract.dta", clear
   gen logprofit = log($confprofit)
   by county: collapse (count)  n=$confemploy (mean) logprofit
   drop if n<$confmincell
   save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }

//============ at this point, the data is releasable ======
// ::::  Process always 

use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replace

Best

Auxiliary file include/confparms.do (not released)

//============ confidential parameters =============
global confseed    12345
global confpath    "/data/economic/cmf2012"
global confprofit  q2f
global confemploy  q3e
global confmincell 10
//============ end confidential parameters =========

Best

Auxiliary file include/confparms_template.do (this is released)

//============ confidential parameters =============
global confseed    XXXX    // a number
global confpath    "XXXX"  // a path that will be communicated to you
global confprofit  XXX     // Variable name for profit T26
global confemploy  XXX     // Variable name for employment T26
global confmincell XXX     // a number
//============ end confidential parameters =========

Best replication package

Thus, the replication package would have:

main.do
README.md
include/confparms_template.do
releasable/figure1.dta
releasable/figure1.pdf

Part 2

Step 5: Stata R

Step 2: Stata R

Step 3: General

Step 3 (with robust download code) Stata

Alternatively: https://github.com/datasets/country-codes/blob/master/data/country-codes.csv

Step 4: Stata R

Step 5: Stata R

Footnotes

  1. Cambridge Dictionary

  2. Wikipedia

  3. 🔒Tag: stage1

  4. 🔒Tag: stage3

  5. 🔒Tag: stage5

  6. 🔒Tag: stage3-alt