2024-10-14
This document describes a few possible steps to self-check replication packages before submitting them to a journal. It is not meant to be exhaustive, and it is not meant to be prescriptive. There are many ways to construct a replication package, and many more to check that it works.
The key ingredient is what I call “computational empathy” - thinking about what an unknown person attempting to reproduce the results in your paper might face, what they might know and assume, and more importantly, what they might not know or know to assume. While the replication package might very well run on your computer, that is by no means evidence that it will run on someone else’s computer.
In what follows, we will assume that the replicator satisfies the following conditions:
Techy lingo for “too long, didn’t read”. A summary of the most important takeaways will be at the top of each section.
We want to check that
Unpredictable things happening to your computing environment:
The very first test is that your code must run, beginning to end, top to bottom, without error, and ideally without any user intervention. This should in principle (re)create all figures, tables, and numbers you include in your paper.
This is pretty much the most basic test of reproducibility. If you cannot run your code, you cannot reproduce your results, nor can anybody else. So just re-run the code.
What happens when some of these re-runs are very long? See later in this chapter for how to handle this.
While the code, once set to run, can do so on its own, you might need to spend a lot of time getting all the various pieces to run.
This should be a warning sign:
If it takes you a long time to get it to run, or to manually reproduce the results, it might take others even longer.1
Furthermore, it may suggest that you haven’t been able to re-run your own code very often, which can be indicate fragility or even lack of reproducibility.
Automation and robustness checks, as well as efficiency.
Generating a log file means that you can inspect it, and you can share it with others. Also helps in debugging, for you and others.
Running it again does not help:
Did it take you a long time to run everything again?
Your code must run, beginning to end, top to bottom, without error, and without any user intervention.
This should in principle (re)create all figures, tables, and in-text numbers you include in your paper.
Out of 8280 replication packages in ~20 top econ journals, only 2594 (31.33%) had a main/controller script.2
In order to be able to enable “hands-off running”, the main (controller) script is key. I will show here a few simple examples for single-software replication packages. We will discuss more complex examples in one of the next chapters.
Set the root directory (dynamically)
Call the various files that consitute your complete analysis.
do
(instead of run
or even capture run
) is best, as it will show the code that is being run, and is thus more transparent to you and the future replicator.Set the root directory (using here()
or rprojroot()
).
# main.R
## Set the root directory
# If you are using Rproj files or git
rootdir <- here::here()
# or if not
# rootdir <- getwd()
## Run the data preparation file
source(file.path(rootdir, "01_data_prep.R"),
echo = TRUE)
## Run the analysis file
source(file.path(rootdir, "02_analysis.R"),
echo = TRUE)
## Run the table file
source(file.path(rootdir, "03_tables.R"), echo = TRUE)
## Run the figure file
source(file.path(rootdir, "04_figures.R"), echo = TRUE)
## Run the appendix file
source(file.path(rootdir, "05_appendix.R"), echo = TRUE)
Call each of the component programs, using source()
.
# main.R
## Set the root directory
# If you are using Rproj files or git
rootdir <- here::here()
# or if not
# rootdir <- getwd()
## Run the data preparation file
source(file.path(rootdir, "01_data_prep.R"),
echo = TRUE)
## Run the analysis file
source(file.path(rootdir, "02_analysis.R"),
echo = TRUE)
## Run the table file
source(file.path(rootdir, "03_tables.R"), echo = TRUE)
## Run the figure file
source(file.path(rootdir, "04_figures.R"), echo = TRUE)
## Run the appendix file
source(file.path(rootdir, "05_appendix.R"), echo = TRUE)
The use of echo=TRUE
is best, as it will show the code that is being run, and is thus more transparent to you and the future replicator.
Even if you are using Rstudio, run this using the terminal method in Rstudio for any platform, or from the terminal (macOS, Linux):
Do not use Rscript
, as it will not generate enough output!
For examples for Julia, Python, MATLAB, and multi-software scripts, see the full text.
Do you usually right-click and save the figures?
Or copy-paste then into a Word document?
Say you have 53 figures and 23 tables, the latter created from 161 different specifications. That makes for a lot of work when re-running the code, if you haven’t automated the saving of said figures and tables.
In order to be able to enable “hands-off running”, saving figures (and tables) automatically is important. I will show here a few simple examples for various statistical programming languages.
After having created the graph (“graph twoway
”, “graph bar
”, etc.), simply add “graph export "name_of_file.graph-extension", replace
”. Many formats are available, as required by journal requirements.
For more information, see https://www.stata.com/manuals/g-2graphexport.pdf.
Methods vary, but the two main ones are redefining the graphics device for the desired format, or using the ggsave
wrapper.
library(ggplot2)
library(here)
figures <- here::here("figures")
ggp <- ggplot(mtcars,aes(mpg,disp)) +
geom_point()+
stat_smooth(method = "lm",geom="smooth" )
ggsave(ggp,file.path(figures,"figure1.png"))
(for more information, see https://ggplot2.tidyverse.org/reference/ggsave.html)
Python, MATLAB, other R methods - see the full text.
In every programming language, this is simple!
Learn how to save tables in robust, reproducible ways. Do not try to copy-paste from console!
esttab
or outreg2
, also putexcel
. For fancier stuff, treat tables as data, use regsave
or export excel
to manipulate.
xtable
, stargazer
, others.
I want to get ahead of the game…
Somebody will ask “but I have confidential data…”
How can you show that you actually ran the code?
The question of how to provide access confidential data is a separate issue, with no simple solution.
In order to document that you have actually run your code, a log file, a transcript, or some other evidence, may be useful. It may even be required by certain journals.
In almost all cased, the generated log files are simple text files, without any formatting, and can be read by any text editor (e.g., Visual Studio Code, Notepad++, etc.).
If not, ensure that they are (avoid Stata SMCL files, for example).
We start by describing how to explicitly generate log files as part of the statistical processing code.
Start by creating a directory for the log files.
global logdir "${rootdir}/logs"
cap mkdir "$logdir"
local c_date = c(current_date)
local cdate = subinstr("`c_date'", " ", "_", .)
local c_time = c(current_time)
local ctime = subinstr("`c_time'", ":", "_", .)
local logname = "`cdate'-`ctime'-`c(username)'"
local globallog = "$logdir/logfile_`logname'.log"
log using "`globallog'", name(global) replace text
Add code to capture date, time, and who ran the code.
global logdir "${rootdir}/logs"
cap mkdir "$logdir"
local c_date = c(current_date)
local cdate = subinstr("`c_date'", " ", "_", .)
local c_time = c(current_time)
local ctime = subinstr("`c_time'", ":", "_", .)
local logname = "`cdate'-`ctime'-`c(username)'"
local globallog = "$logdir/logfile_`logname'.log"
log using "`globallog'", name(global) replace text
Create a logfile, giving it a name so it does not get closed.
global logdir "${rootdir}/logs"
cap mkdir "$logdir"
local c_date = c(current_date)
local cdate = subinstr("`c_date'", " ", "_", .)
local c_time = c(current_time)
local ctime = subinstr("`c_time'", ":", "_", .)
local logname = "`cdate'-`ctime'-`c(username)'"
local globallog = "$logdir/logfile_`logname'.log"
log using "`globallog'", name(global) replace text
An alternative (or complement) to creating log files explicitly is to use native functionality of the software to create them. This usually is triggered when using the command line to run the software, and thus may be considered an advanced topic. The examples below are for Linux/macOS, but similar functionality exists for Windows.
To automatically create a log file, run Stata from the command line with the -b
option:
which will create a file main.log
in the same directory as main.do
.
To automatically create a log file, run R from the command line using the BATCH
functionality, as follows:
On Windows, you may need to include the full path of R:
C:\Program Files\R\R-4.1.0\bin\R.exe CMD BATCH main.R
To automatically create a log file, run R from the command line using the BATCH
functionality, as follows:
This will create a file main.Rout
in the same directory as main.R
.
If you prefer a different name for the output file, you can specify it.
which will create a second-precise date-time stamped log file.
If you want to prevent R from saving or restoring its environment (by default, R CMD BATCH
does both), you can specify the --no-save
and --no-restore
options.
If there are other commands, such as sink()
, active in the R code, the main.Rout
file will not contain some output.
To see more information, check the manual documentation by typing
?BATCH
(orhelp(BATCH)
) from within an R interactive session. Or by typingR CMD BATCH --help
from the command line.
To automatically create a log file, run MATLAB from the command line as follows:
A similar command on Windows would be:
In order to capture screen output in Julia and Python, on Unix-like system (Linux, macOS), the following can be run:
which will create a log file with everything that would normally appear on the console using the tee
command.
In order to capture screen output in Julia and Python, on Unix-like system (Linux, macOS), the following can be run:
which will create a log file with everything that would normally appear on the console using the tee
command.
From the renv documentation:
renv
packagevenv
or virtualenv
modulePkg
moduleGenerically, all “environments” simply modify where the specific software searches (the “search path”) for its components, and in particular any supplementary components (packages, libraries, etc.).3
.libPaths()
sys.path
>>> import sys
>>> from pprint import pprint
>>> pprint(sys.path)
['',
'C:\\Users\\lv39\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip',
'C:\\Users\\lv39\\AppData\\Local\\Programs\\Python\\Python312\\DLLs',
'C:\\Users\\lv39\\AppData\\Local\\Programs\\Python\\Python312\\Lib',
'C:\\Users\\lv39\\AppData\\Local\\Programs\\Python\\Python312',
'C:\\Users\\lv39\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages']
.libPaths()
sys.path
DEPOT_PATH
(Julia docs)We now have the ingredients to understand (project) environments in Stata.
In Stata, we typically do not talk about environments, but the same basic structure applies: Stata searches along a set order for its commands.
Some commands are built into the executable (the software that is opened when you click on the Stata icon), but most other internal, and all external commands, are found in a search path.
sysdir
directoriesThe default set of directories which can be searched, from a freshly installed Stata, can be queried with the sysdir
command, and will look something like this:
adopath
search orderThe search paths where Stata looks for commands is queried by adopath
, and looks similar, but now has an order assigned to each entry:
When we install a package (net install
, ssc install
)4, only one of the (sysdir
) paths is relevant: PLUS
.
But the (PLUS)
directory can be manipulated
* Set the root directory
global rootdir : pwd
* Define a location where we will hold all packages in THIS project (the "environment")
global adodir "$rootdir/ado"
* make sure it exists, if not create it.
cap mkdir "$adodir"
* Now let's simplify the adopath
* - remove the OLDPLACE and PERSONAL paths
* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
adopath - OLDPLACE
adopath - PERSONAL
* modify the PLUS path to point to our new location, and move it up in the order
sysdir set PLUS "$adodir"
adopath ++ PLUS
* verify the path
adopath
* Set the root directory
global rootdir : pwd
* Define a location where we will hold all packages in THIS project (the "environment")
global adodir "$rootdir/ado"
* make sure it exists, if not create it.
cap mkdir "$adodir"
* Now let's simplify the adopath
* - remove the OLDPLACE and PERSONAL paths
* - NEVER REMOVE THE SYSTEM-WIDE PATHS - bad things will happen!
adopath - OLDPLACE
adopath - PERSONAL
* modify the PLUS path to point to our new location, and move it up in the order
sysdir set PLUS "$adodir"
adopath ++ PLUS
* verify the path
adopath
So it is no longer found. Why? Because we have removed the previous location (the old PLUS
path) from the search sequence. It’s as if it didn’t exist.
When we now install reghdfe
again:
. ssc install reghdfe
checking reghdfe consistency and verifying not already installed...
installing into C:\Users\lv39\Documents\PROJECT123\ado\plus\...
installation complete.
. which reghdfe
C:\Users\lv39\Documents\PROJECT123\ado\plus\r\reghdfe.ado
*! version 6.12.3 08aug2023
We now see it in the project-specific directory, which we can distribute with the whole project.
Let’s imagine we need an older version of reghdfe
.
Most package repositories are versioned:
Stata does not (as of 2024). But see the full site for one approach.
From the earlier desiderata of environments:
Do you know your rights?
These will be noted in the data use agreement (DUA), license, or non-disclosure agreement (NDA) that you signed or clicked through to obtain access to the data from the data provider.
Careful: scraped or downloaded data that did not have an explicit license!
Just because
Whatever restrictions are imposed on the data typically convey to other replicators as well.
When the replication package relies on confidential data that cannot be shared, or is shared under different conditions, you should
Prepare a confidential (partial) replication package project-confidential.zip
, contains the contents of data/confidential
and possibly data/conf_analysis
.
Prepare a non-confidential replication package that contains all code, and any data that is not subject to publication controls
Source: Red Warning PNG Clipart, CC-BY.
Results computed on Nov 26, 2023 based on a scan of replication packages conducted by Sebastian Kranz. 2023. “Economic Articles with Data”. https://ejd.econ.mathematik.uni-ulm.de/, searching for the words main
, master
, makefile
, dockerfile
, apptainer
, singularity
in any of the program files in those replication packages. Code not yet integrated into this presentation.
Formally, this is true for operating systems as well, and in some cases, the operating system and the programming language interact (for instance, in Python).
net install
refererence. Strictly speaking, the location where ado packages are installed can be changed via the net set ado
command, but this is rarely done in practice, and we won’t do it here.