2024-08-01
Journals require that you share your code and data in a replication package at the end of your research project.
Following some best practices from day 1 can not only help you prepare this package later, but also make you more productive researchers.
We start with an empty folder, and an idea.
Does procurement in the EU have a bias towards local providers?
We finish with a mini-project about public procurement across various European countries.
Desktop/day1
cd ../Desktop/day1
A license (licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit).1 2
Stata
Stata
R
Let’s start with something easy. Separate folders for each function: code/
and data/
We will download -> Create a script download_data.do
🛑Do not hard-code paths!
copy "$URL" "C:\Users\lv39\Desktop\day1\data\dist_cepii.dta", replace
Why?
🛑Do not rename data files!
copy "$URL" "C:\Users\lv39\Desktop\day1\data\that_file_from_cepii.dta", replace
Why?
Write a merging script. Create the sample.
These two can go in the same script, create_sample.do
No overwriting of original data!
The merged data should go in the data/generated
directory.
do "code/00_setup.do"
do "code/01_download_data.do"
do "code/02_create_analysis_sample.do"
do "code/03_analysis.do"
Step 5:5
Download only if we want to download
Automatically download the file again if not there.
What if the file has changed?
if $redownload == 1 {
copy "https://datahub.io/core/country-codes/r/country-codes.csv" "data/raw/country-codes.csv", replace
// create checksum of file
// Aug 2023 version: 2295658388
global countrycksum 2295658388
checksum "data/raw/country-codes.csv", save
assert $countrycksum == r(checksum)
// This will fail if the files are not identical
// Provide a verbose message if we get past this point
disp in green "Country codes file downloaded successfully"
}
Be informative!
Here we always run the 00_setup.do
file.
Then conditionally run the other pieces:
config.do
file to contain configuration parametersSo let’s automate some of this:
Configure the steps on certain conditions:
// define steps
global step1 1
global step2 1
// verify if file has changed
qui checksum "$resultfile1"
// if not, don't run Step 2
if `r(checksum)' == $checksum1 global step2 0
// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1 do "$rootdir/code/01_download_data.do"
if $step2 == 2 do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 3 do "$rootdir/code/03_analysis.do"
and config.do
contains additional information:
// file locations
// code to set rootdir omitted
global inputdata "$rootdir/data/inputs"
global tempdata "$rootdir/temporary"
global outputs "$rootdir/tables-figures"
// ensure they are created
cap mkdir "$tempdata"
cap mkdir "$outputs"
// some key parameters
global resultfile1 "$outputs/table1.tex"
global checksum1 386698503
Consider a final test if everything runs:
temporary/
and tables-figures/
folders.main.do
file again.Store secrets in environment variables or files that are not published.
Typed interactively (here for Linux and Mac)
(this is not recommended)
Same syntax used for contents of “dot-env” or “Renviron” files, and in fact bash
or zsh
startup files (.bash_profile
, .zshrc
)
Edit .Renviron
(note the dot!) files:
# Edit global (personal) Renviron
usethis::edit_r_environ()
# You can also consider creating project-specific settings:
usethis::edit_r_environ(scope = "project")
Use the variables defined in .Renviron
:
Loading regular environment variables:
Loading with dotenv
Yes, this also works in Stata
// load from environment
global mysecret : env MYSECRET
display "$mysecret" // don't actually do this in code
and via (what else) a user-written package for loading from files:
{.stata code-line-numbers="1-3} net install doenv, from(https://github.com/vikjam/doenv/raw/master/) doenv using ".env" global mysecret "`r(MYSECRET)'" display "$mysecret"
//============ confidential parameters =============
capture confirm file "confidential/confparms.do"
if _rc == 0 {
// file exists
include "confidential/confparms.do"
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
include "config.do"
//============ end parameters ======================
replace anoncounty=1 if county="Tompkins, NY"
). A really bad idea, but yes, you probably want to hide that.So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?
q2f
and q3e
are considered confidential by some rule, and that the minimum cell size 10
is also confidential.A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:
set seed NNNNN
use <removed vars> county using "<removed path>", clear
gen logprofit = log(XXXX)
by county: collapse (count) n=XXXX (mean) logprofit
drop if n<XXXX
graph twoway n logprofit
The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.
Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:
//============ confidential parameters =============
global confseed 12345
global confpath "/data/economic/cmf2012"
global confprofit q2f
global confemploy q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit
and this would be the released file, part of the replication package:
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit
While the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f
, anywhere in the code.
\(\rightarrow\) provide code that
Main file main.do
:
//============ confidential parameters =============
capture confirm file "include/confparms.do"
if _rc == 0 {
// file exists
include "include/confparms.do"
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
global safepath "releasable"
cap mkdir "$safepath"
//============ end parameters ======================
Main file main.do
(continued)
// :::: Process only if confidential data is present
capture confirm file "${confpath}/extract.dta"
if _rc == 0 {
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }
//============ at this point, the data is releasable ======
// :::: Process always
use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replace
Auxiliary file include/confparms.do
(not released)
Auxiliary file include/confparms_template.do
(this is released)
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
Thus, the replication package would have:
Step 3: General
Step 3 (with robust download code) Stata
Alternatively: https://github.com/datasets/country-codes/blob/master/data/country-codes.csv