2025-10-09

larsvilhuber.github.io/transparency-statistical-agencies/ (HTML zipped, PDF)
It is the policy of the American Economic Association to publish papers only if the data used in the analysis are clearly and precisely documented and access to the data and code is clearly and precisely documented and is non-exclusive to the authors.
Authors … must provide, prior to acceptance, the data, programs, and other details of the computations sufficient to permit replication
AER 2011 thanks to Stefano Dellavigna
None
… there is a grey zone:
“Reproducibility” refers to the ability of a researcher to duplicate the results of a prior study using the same materials and procedures as were used by the original investigator.” 1
through reproducibility
Student replicators
Over the past 6 years, over 170 undergraduate students have been involved in verifying these articles.


https://social-science-data-editors.github.io/




Fundamental Principles of Official Statistics, Principle 3:
Accountability and Transparency To facilitate a correct interpretation of the data, the statistical agencies are to present information according to scientific standards on the sources, methods and procedures of the statistics 5
Principles and Practices for a Federal Statistical Agency, Principle 2:
Credibility among Data Users A federal statistical agency must have credibility with those who use its data and information 6
“… flow of objective, credible statistics to support the decisions of individuals, households, governments, businesses, and other organizations.”
Statistical Policy Directive No. 1, 4:
“Any loss of trust in the integrity of the Federal statistical system and its products could lessen respondent cooperation with Federal statistical surveys, decrease the quality of statistical system products, and foster uncertainty about the validity of measures our Nation uses to monitor and assess its performance and progress.”



Joint Statement on Commitment to Scientific Integrity and Transparency



Agencies do provide detailed information on sources

But: Availability of “computing instructions”?

But: Availability of reliable, trusted data archives

Which are starting to be infused into the federal system
FAIR:


Website
High-level description of sources
Sources are cited!
High-level description of methods, but no (obvious) code
Some methods - R code - is cited
Own citation does not include a URL



Not even close.
URL is
https://ers.usda.gov/webdocs/DataFiles/107356/RuggednessScale2010tracts.xlsx?v=6316.8
Francesca Gino











Toronto researcher loses Ph.D.

MIT student makes up firm data



The emerging consensus is fully in line with the decades-strong principles of statistical agencies:
if name="Lars" then confid=2)Don’t do that.
Store secrets in environment variables or files that are not published.
Github secret scanning
Typed interactively (here for Linux and Mac)
(this is not recommended)
Same syntax used for contents of “dot-env” or “Renviron” files, and in fact bash or zsh startup files (.bash_profile, .zshrc)
Edit .Renviron (note the dot!) files:
Use the variables defined in .Renviron:
Loading regular environment variables:
Loading with dotenv
Yes, this also works in Stata
and via (what else) a user-written package for loading from files:
//============ non-confidential parameters =========
include "config.do"
//============ confidential parameters =============
capture confirm file "$code/confidential/confparms.do"
if _rc == 0 {
// file exists
include "$code/confidential/confparms.do"
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========replace anoncounty=1 if county="Tompkins, NY").A really bad idea, but yes, you probably want to hide that.
So whether reasonable or not, this is an issue. How do you do that, without messing up the code, or spending hours redacting your code?
q2f and q3e are considered confidential by some rule, and that the minimum cell size 10 is also confidential.Only one line that does not contain “confidential” information.
A bad example, because literally making more work for you and for future replicators, is to manually redact the confidential information with text that is not legitimate code:
The redacted program above will no longer run, and will be very tedious to un-redact if a subsequent replicator obtains legitimate access to the confidential data.
Simply replacing the confidential data with replacement that are valid placeholders in the programming language of your choice is already better. Here’s the confidential version of the file:
//============ confidential parameters =============
global confseed 12345
global confpath "/data/economic/cmf2012"
global confprofit q2f
global confemploy q3e
global confmincell 10
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofitand this could be the released file, part of the replication package:
//============ confidential parameters =============
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
graph twoway n logprofitWhile the code won’t run as-is, it is easy to un-redact, regardless of how many times you reference the confidential values, e.g., q2f, anywhere in the code.
Main file main.do:
//============ confidential parameters =============
capture confirm file "$code/confidential/confparms.do"
if _rc == 0 {
// file exists
include "$code/confidential/confparms.do""
} else {
di in red "No confidential parameters found"
}
//============ end confidential parameters =========
//============ non-confidential parameters =========
global safepath "$rootdir/releasable"
cap mkdir "$safepath"
//============ end parameters ======================Main file main.do (continued)
// :::: Process only if confidential data is present
capture confirm file "${confpath}/extract.dta"
if _rc == 0 {
set seed $confseed
use $confprofit county using "${confpath}/extract.dta", clear
gen logprofit = log($confprofit)
by county: collapse (count) n=$confemploy (mean) logprofit
drop if n<$confmincell
save "${safepath}/figure1.dta", replace
} else { di in red "Skipping processing of confidential data" }
//============ at this point, the data is releasable ======
// :::: Process always
use "${safepath}/figure1.dta", clear
graph twoway n logprofit
graph export "${safepath}/figure1.pdf", replaceAuxiliary file $code/confidential/confparms.do" (not released)
Auxiliary file $code/include/confparms_template.do (this is released)
//============ confidential parameters =============
// Copy this file to $code/confidential/confparms.do and edit
global confseed XXXX // a number
global confpath "XXXX" // a path that will be communicated to you
global confprofit XXX // Variable name for profit T26
global confemploy XXX // Variable name for employment T26
global confmincell XXX // a number
//============ end confidential parameters =========Thus, the replication package would have:
A license (licence) is an official permission or permit to do, use, or own something (as well as the document of that permission or permit).11 12
Stata
Stata
R
estout, graph export, regsave)stata -b do file.do not fine-grained enough) linkRun it all again, top to bottom!
Now you wait for the replicators to show up!
Bollen et al. 2015. “Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science.” National Science Foundation. https://www.nsf.gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report.pdf.
Ambrus, Attila, Erica Field, and Robert Gonzalez. 2020. “Loss in the Time of Cholera: Long-Run Impact of a Disease Epidemic on the Urban Landscape.” American Economic Review, 110 (2): 475–525. https://doi.org/10.1257/aer.20190759
Ambrus, Attila, Field, Erica, and Gonzalez, Robert. Data and Code for: Loss in the Time of Cholera: Long-run Impact of a Disease Epidemic on the Urban Landscape. Nashville, TN: American Economic Association [publisher], 2020. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-01-31. https://doi.org/10.3886/E111523V2
Weeden, K. A. (2023). Crisis? What Crisis? Sociology’s Slow Progress Toward Scientific Transparency . Harvard Data Science Review, 5(4). https://doi.org/10.1162/99608f92.151c41e3
United Nations Fundamental Principles of Official Statistics
National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. https://doi.org/10.17226/24810.
Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 https://www.force11.org/group/joint-declaration-data-citation-principles-final
https://ers.usda.gov/data-products/area-and-road-ruggedness-scales/
https://datacolada.org/109, https://datacolada.org/110, https://datacolada.org/111, https://datacolada.org/112, https://datacolada.org/114, https://datacolada.org/118
Jones, M. (2024). Introducing Reproducible Research Standards at the World Bank. Harvard Data Science Review, 6(4). https://doi.org/10.1162/99608f92.21328ce3
