Based on an earlier presentation and tutorial at the Cornell Day of Data 2021 with David Wasser and a presentation at MONT^2.
2022-10-17
Based on an earlier presentation and tutorial at the Cornell Day of Data 2021 with David Wasser and a presentation at MONT^2.
Part 1: | When to start |
Part 2: | - Ideal directory and data structure - Adapting to confidential / big data |
Part 3: | - Ideal programming practices - Secure coding techniques |
Part 4: | - Documenting what you did - When to document |
git
git
It might be “Future You!”
git
branches!main.do
, run.sh
)make
, checksums, etc.)// Header of main.do global step1 0 global step2 1 global step3 1 if $step1 == 1 do "$code/01_dataclean.do" if $step2 == 1 do "$code/02_merge_stuff.do" if $step3 == 1 do "$code/03_figures.do"
// Header of main.do global step1 1 global step2 1 // verify if file has changed global checksum1 386698503 qui checksum "$resultfile1" if `r(checksum)' == $checksum1 global step1 0 if $step1 == 1 do "$code/01_dataclean.do" if $step2 == 1 do "$code/02_merge_stuff.do" if $step3 == 1 do "$code/03_figures.do"
TIER protocol
Structure your project
/inputs /outputs /code /paper |
/datos/ /brutos /limpiados /finales /codigo /articulo |
It doesn’t really matter, as long as it is logical.
TIER Protocol again
TIER Protocol again
TIER Protocol Confidential
TIER Protocol Confidential
This may no longer work: /datos/ /brutos /limpiados /finales /codigo /articulo |
This may be what it looks like: /proyecto/ /datos/ /brutos /limpiados /finales /codigo /articulo /secretos (read-only) /impuestos (read-only) /salarios (read-only) |
How do we handle directories outside of the project space?
main.do
, run.sh
) preferred
main.do
// Header of main.do global step1 1 global step2 1 // verify if file has changed global checksum1 386698503 qui checksum "$resultfile1" if `r(checksum)' == $checksum1 global step1 0 if $step1 == 1 do "$code/01_dataclean.do" if $step2 == 1 do "$code/02_merge_stuff.do" if $step3 == 1 do "$code/03_figures.do"
As file structure becomes more complex, configure short-cuts (globals, variables, etc.)
// config.do global outputdata "/proyecto/datos/limpiados" // this is where you would write the data you create in this project global results "/proyecto/articulo" // All tables for inclusion in your paper go here global programs "/proyecto/codigo" // All programs (which you might "include") are to be found here
Expanded to include non-project space directories:
// config.do global taxdata "/secretos/impuestos" global salarydata "/secretos/salarios" global outputdata "/proyecto/datos/limpiados" // this is where you would write the data you create in this project global results "/proyecto/articulo" // All tables for inclusion in your paper go here global programs "/proyecto/codigo" // All programs (which you might "include") are to be found here
Or something like this:
// config.do global taxdata "/data/irs1040" global salarydata "/data/lehd" global outputdata "/project/data/outputs" // this is where you would write the data you create in this project global results "/project/article/tables" // All tables for inclusion in your paper go here global programs "/project/code" // All programs (which you might "include") are to be found here
Let’s expand it to contain another parameter:
// config.do global taxdata "/data/irs1040" global salarydata "/data/lehd" global outputdata "/project/data/outputs" // this is where you would write the data you create in this project global results "/project/article/tables" // All tables for inclusion in your paper go here global programs "/project/code" // All programs (which you might "include") are to be found here global checksum1 386698503
// Header of main.do include "config.do" global step1 1 global step2 1 // verify if file has changed qui checksum "$resultfile1" if `r(checksum)' == $checksum1 global step1 0 if $step1 == 1 do "$code/01_dataclean.do" if $step2 == 1 do "$code/02_merge_stuff.do" if $step3 == 1 do "$code/03_figures.do"
But let’s extend it to confidential code.
replace anoncounty=1 if county="Tompkins, NY"
). A really bad idea, but yes, you probably want to hide that.So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?
q2f
and q3e
are considered confidential by some rule, and that the minimum cell size 10
is also confidential.set seed 12345 use q2f q3e county using "/data/economic/cmf2012/extract.dta", clear gen logprofit = log(q2f) by county: collapse (count) n=q3e (mean) logprofit drop if n<10 graph twoway n logprofit
A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:
set seed NNNNN use <removed vars> county using "<removed path>", clear gen logprofit = log(XXXX) by county: collapse (count) n=XXXX (mean) logprofit drop if n<XXXX graph twoway n logprofit
The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.
Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:
//============ confidential parameters ============= global confseed 12345 global confpath "/data/economic/cmf2012" global confprofit q2f global confemploy q3e global confmincell 10 //============ end confidential parameters ========= set seed $confseed use $confprofit county using "${confpath}/extract.dta", clear gen logprofit = log($confprofit) by county: collapse (count) n=$confemploy (mean) logprofit drop if n<$confmincell graph twoway n logprofit
and this would be the released file, part of the replication package:
//============ confidential parameters ============= global confseed XXXX // a number global confpath "XXXX" // a path that will be communicated to you global confprofit XXX // Variable name for profit T26 global confemploy XXX // Variable name for employment T26 global confmincell XXX // a number //============ end confidential parameters ========= set seed $confseed use $confprofit county using "${confpath}/extract.dta", clear gen logprofit = log($confprofit) by county: collapse (count) n=$confemploy (mean) logprofit drop if n<$confmincell graph twoway n logprofit
While the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f
, anywhere in the code.
Note that you have to re-run the entire code to obtain a modified graph, e.g., if you want to add some reference line, or change colors. But if the data presented in the graph is non-sensitive (i.e., disclosable), then the data underlying it is as well. Thus, and this is a more general approach, we can provide code that automatically detects if the confidential data is there, and only then will it run the data preparation part, but it will always run for the graphing (“analysis”) part of the code.
We also introduce the use of a separate file for all the confidential parameters, which may be more convenient, since now, no redaction is needed - the confidential file is simply dropped (but should be documented).
Main file main.do
:
//============ confidential parameters ============= capture confirm file "include/confparms.do" if _rc == 0 { // file exists include "include/confparms.do" } else { di in red "No confidential parameters found" } //============ end confidential parameters ========= //============ non-confidential parameters ========= global safepath "releasable" cap mkdir "$safepath" //============ end parameters ======================
Main file main.do
(continued)
// :::: Process only if confidential data is present capture confirm file "${confpath}/extract.dta" if _rc == 0 { set seed $confseed use $confprofit county using "${confpath}/extract.dta", clear gen logprofit = log($confprofit) by county: collapse (count) n=$confemploy (mean) logprofit drop if n<$confmincell save "${safepath}/figure1.dta", replace } else { di in red "Skipping processing of confidential data" } //============ at this point, the data is releasable ====== // :::: Process always use "${safepath}/figure1.dta", clear graph twoway n logprofit graph export "${safepath}/figure1.pdf", replace
Auxiliary file include/confparms.do
(not released)
//============ confidential parameters ============= global confseed 12345 global confpath "/data/economic/cmf2012" global confprofit q2f global confemploy q3e global confmincell 10 //============ end confidential parameters =========
Auxiliary file include/confparms_template.do
(this is released)
//============ confidential parameters ============= global confseed XXXX // a number global confpath "XXXX" // a path that will be communicated to you global confprofit XXX // Variable name for profit T26 global confemploy XXX // Variable name for employment T26 global confmincell XXX // a number //============ end confidential parameters =========
Thus, the replication package would have:
main.do README.md include/confparms_template.do releasable/figure1.dta releasable/figure1.pdf
We already had this:
main.do README.md include/confparms_template.do releasable/figure1.dta releasable/figure1.pdf
Start with our fabulous template README. Really, it helps! Available at https://social-science-data-editors.github.io/template_README/
That’s easy: you’ve been keeping clean instructions since the start, right?
main.do
”You’ve been doing that since day 1!
In most confidential environments, such as FSRDC/ IRE, this part is out of your control. But describe it anyway!
which estout
)Some of that is captured in your notes (updated, remember?), some of that may change over the life of the project, and may be captured in your logs, or your qsub
files.
In order to describe data availability, split into two:
Examples include
NOTE: Project-related files are available for 10 years as of 2015.
Examples include
The information used in the analysis combines several Danish administrative registers (as described in the paper). The data use is subject to the European Union’s General Data Protection Regulation(GDPR) per new Danish regulations from May 2018. The data are physically stored on computers at Statistics Denmark and, due to security considerations, the data may not be transferred to computers outside Statistics Denmark. Researchers interested in obtaining access to the register data employed in this paper are required to submit a written application to gain approval from Statistics Denmark. The application must include a detailed description of the proposed project, its purpose, and its social contribution, as well as a description of the required datasets, variables, and analysis population. Applications can be submitted by researchers who are affiliated with Danish institutions accepted by Statistics Denmark, or by researchers outside of Denmark who collaborate with researchers affiliated with these institutions.
(Example taken from Fadlon and Nielsen, AEJ:Applied 2021).
- Data availability (and citations): | Start of project, edit at the end |
- Computer requirements: | Middle of project |
- Description of processing: | Middle of project |
with the end really just a last read/edit.
Now you wait for the replicators to show up!
Find this presentation at larsvilhuber.github.io/reproducibility-confidential-fsrdc