2025-01-01
Journals require that you share your code and data in a replication package at the end of your research project.
Following some best practices from day 1 can not only help you prepare this package later, but also make you more productive researchers.
We start with an empty folder, and an idea.
Does procurement in the EU have a bias towards local providers?
We finish with a mini-project about public procurement across various European countries.
Desktop/day1
cd ../Desktop/day1
A license (licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit).1 2
Various guidance pages provided by data editors and others:
Stata
Stata
R
Let’s start with something easy. Separate folders for each function: code/
and data/
TIER Protocol
We will download -> Create a script download_data.do
🛑Do not hard-code paths!
copy "$URL" "C:\Users\lv39\Desktop\day1\data\dist_cepii.dta", replace
Why?
🛑Do not rename data files!
copy "$URL" "C:\Users\lv39\Desktop\day1\data\that_file_from_cepii.dta", replace
Why?
Write a merging script. Create the sample.
These two can go in the same script, create_analysis_sample.do
No overwriting of original data!
data/raw
directory.data/derived
directory.Download only if we want to download
Automatically download the file again if not there.
What if the file has changed?
if $redownload == 1 {
copy "https://datahub.io/core/country-codes/r/country-codes.csv" "data/raw/country-codes.csv", replace
// create checksum of file
// Aug 2023 version: 2295658388
global countrycksum 2295658388
checksum "data/raw/country-codes.csv", save
assert $countrycksum == r(checksum)
// This will fail if the files are not identical
// Provide a verbose message if we get past this point
disp in green "Country codes file downloaded successfully"
}
Be informative!
Here we always run the 00_setup.do
file.
Then conditionally run the other pieces:
config.do
file to contain configuration parametersSo let’s automate some of this:
Configure the steps on certain conditions:
// define steps
global step1 1
global step2 1
// verify if file has changed
qui checksum "$resultfile1"
// if not, don't run Step 2
if `r(checksum)' == $checksum1 global step2 0
// Nothing needs to be changed here
do "$rootdir/code/00_setup.do"
if $step1 == 1 do "$rootdir/code/01_download_data.do"
if $step2 == 1 do "$rootdir/code/02_create_analysis_sample.do"
if $step3 == 1 do "$rootdir/code/03_analysis.do"
and config.do
contains additional information:
// file locations
// code to set rootdir omitted
global inputdata "$rootdir/data/inputs"
global tempdata "$rootdir/temporary"
global outputs "$rootdir/tables-figures"
// ensure they are created
cap mkdir "$tempdata"
cap mkdir "$outputs"
// some key parameters
global resultfile1 "$outputs/table1.tex"
global checksum1 386698503
Consider a final test if everything runs:
temporary/
and tables-figures/
folders.main.do
file again.Store secrets in environment variables or files that are not published.
Github secret scanning
Typed interactively (here for Linux and Mac)
(this is not recommended)
Same syntax used for contents of “dot-env” or “Renviron” files, and in fact bash
or zsh
startup files (.bash_profile
, .zshrc
)
Edit .Renviron
(note the dot!) files:
# Edit global (personal) Renviron
usethis::edit_r_environ()
# You can also consider creating project-specific settings:
usethis::edit_r_environ(scope = "project")
Use the variables defined in .Renviron
:
Loading regular environment variables:
Loading with dotenv
Yes, this also works in Stata
// load from environment
global mysecret : env MYSECRET
display "$mysecret" // don't actually do this in code
and via (what else) a user-written package for loading from files:
{.stata code-line-numbers="1-3} net install doenv, from(https://github.com/vikjam/doenv/raw/master/) doenv using ".env" global mysecret "`r(MYSECRET)'" display "$mysecret"
//============ confidential parameters =============
capture confirm file "confidential/confparms.do"
if _rc == 0 {
// file exists
include "confidential/confparms.do"
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
include "config.do"
//============ end parameters ======================
replace anoncounty=1 if county="Tompkins, NY"
). A really bad idea, but yes, you probably want to hide that.So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?
q2f
and q3e
are considered confidential by some rule, and that the minimum cell size 10
is also confidential.A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:
set seed NNNNN
use <removed vars> county using "<removed path>", clear
gen logprofit = log(XXXX)
by county: collapse (count) n=XXXX (mean) logprofit
drop if n<XXXX
graph twoway n logprofit
The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.
Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:
//============ confidential parameters =============
global confseed 12345
global confpath "/data/economic/cmf2012"
global confprofit q2f
global confemploy q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit
and this would be the released file, part of the replication package:
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofit
While the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f
, anywhere in the code.
\(\rightarrow\) provide code that
Main file main.do
:
//============ confidential parameters =============
capture confirm file "include/confparms.do"
if _rc == 0 {
// file exists
include "include/confparms.do"
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
global safepath "releasable"
cap mkdir "$safepath"
//============ end parameters ======================
Main file main.do
(continued)
// :::: Process only if confidential data is present
capture confirm file "${confpath}/extract.dta"
if _rc == 0 {
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }
//============ at this point, the data is releasable ======
// :::: Process always
use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replace
Auxiliary file include/confparms.do
(not released)
Auxiliary file include/confparms_template.do
(this is released)
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
Thus, the replication package would have:
We already had this:
main.do
README.md
include/confparms_template.do
releasable/figure1.dta
releasable/figure1.pdf
Start with our fabulous template README. Really, it helps! Available at https://social-science-data-editors.github.io/template_README/
That’s easy: you’ve been keeping clean instructions since the start, right?
main.do
”You’ve been doing that since day 1!
In most confidential environments, such as FSRDC/ IRE, this part is out of your control. But describe it anyway!
which estout
)Some of that is captured in your notes (updated, remember?), some of that may change over the life of the project, and may be captured in your logs, or your qsub
files.
In order to describe data availability, split into two:
Examples include
- All the results in the paper use confidential microdata from the U.S. Census Bureau. To gain access to the Census microdata, follow the directions here on how to write a proposal for access to the data via a Federal Statistical Research Data Center: https://www.census.gov/ces/rdcresearch/howtoapply.html.
- You must request the following datasets in your proposal:
- Longitudinal Business Database (LBD), 2002 and 2007
- Foreign Trade Database – Import (IMP), 2002 and 2007
- Annual Survey of Manufactures (ASM), including the Computer Network Use Supplement (CNUS), 1999
- […]
- Annual Survey of Magical Inputs (ASMI), 2002 and 2007
- Reference “Technology and Production Fragmentation: Domestic versus Foreign Sourcing” by Teresa Fort, project number br1179 in the proposal. This will give you access to the programs and input datasets required to reproduce the results. Requesting a search of archives with the articles DOI (“10.1093/restud/rdw057”) should yield the same results.
NOTE: Project-related files are available for 10 years as of 2015.
Examples include
The information used in the analysis combines several Danish administrative registers (as described in the paper). The data use is subject to the European Union’s General Data Protection Regulation(GDPR) per new Danish regulations from May 2018. The data are physically stored on computers at Statistics Denmark and, due to security considerations, the data may not be transferred to computers outside Statistics Denmark. Researchers interested in obtaining access to the register data employed in this paper are required to submit a written application to gain approval from Statistics Denmark. The application must include a detailed description of the proposed project, its purpose, and its social contribution, as well as a description of the required datasets, variables, and analysis population. Applications can be submitted by researchers who are affiliated with Danish institutions accepted by Statistics Denmark, or by researchers outside of Denmark who collaborate with researchers affiliated with these institutions.
(Example taken from Fadlon and Nielsen, AEJ:Applied 2021).
- Data availability (and citations): | Start of project, edit at the end |
- Computer requirements: | Middle of project |
- Description of processing: | Middle of project |
with the end really just a last read/edit.
Now you wait for the replicators to show up!
Step 3: General
Step 3 (with robust download code) Stata
Alternatively: https://github.com/datasets/country-codes/blob/master/data/country-codes.csv