Concerns about confidentiality in statistical products have increased in the past several years:
But what if users don´t trust the data?
Alternative: validate analyses run on synthetic data against the confidential data
These pilot projects were not set up to scale, and yet they demonstrated that there is a need for such a process.
From conversations/informal surveys:
SDS validation required typically substantial human debugging
Failure to maintain strong links
In a sample of over 8,000 replication packages associated with high-profile economics articles, only 30% had some sort of master script.
Statistical agencies and research institutes have explored various ways to scale up access to confidential data, without full (remote) access to confidential data.
Most such processes have limitations, including in their utility for general purpose analysis
Most still have some strong access limitations
Many systems strongly limit the type of analysis that is feasible by
The comparison researchers and analysts make is (for right or wrong) to the unfettered use of public-use data that they trust …
Remote-access or local secure access in the form of physical or virtual secure data enclaves is still the dominant - but expensive - way to access confidential data.
The dominant method of access thus forces researchers to choose between lower quality data in an environment that corresponds to their preferred computing method (public-use data), and higher quality confidential data in environments that are expensive for researchers, data providers, or both.
Containers are lightweight, standalone, executable packages that contain everything needed to run an application, including the code, a runtime, libraries, environment variables, and config files.1
One of the first mentions of containers for scientific research was Boettiger (2015).
Apptainer
more popular than Docker
)In a sample of over 8,000 replication packages
associated with high-profile economics articles, only 11 had a
Dockerfile
(the key build script for
containers).
(That’s n=11, not 11% - in fact, it’s 0.13% of replication packages.)
The use of containers in this way is novel as a systematic way to provide scalable, potentially high-throughput validation, and differs in usage from previous methods, such as the Cornell Synthetic Data Server.
I believe that it is promising as a modern way of implementing validation when data are confidential.
Possibilities:
Pre-provisioned data does not need to be “analytically valid” - need only be “plausible”!
Containers are generalized technology
Cost: $0 to low $
rocker
for R containers, datascience
containers)While not strictly necessary, containers might contain
Prepared containers and recipes can be posted on public registries:
But if validation and verification are a key part of it, then data quality can be lower (plausible, not analytically valid)
Dockerfile
) and code
for validation to StatAgency.gov./Dockerfile
./code/01_prepare_data.R
./code/02_run_analysis.R
./code/03_create_figures.R
No binary code is transmitted
Any external data may need to be vetted.
(Automated) system receives and processes
./Dockerfile
./code/01_prepare_data.R
./code/02_run_analysis.R
./code/03_create_figures.R
Just to check that user actually did test…
If rejected, automated system returns to user without further ado.
If accepted, proceed to validation step
Dockerfile
recipeWhile useful in the public space, when running internally and for pre-vetting,
Scalability of a system hinges critically on streamlined output vetting.
However, the challenge of creating automated and reliable disclosure avoidance procedures is not unique to the validation process described here.
In general, bad idea to blindly run untrusted containers. However, this is a solved problem in the industry, facilitated by the (expected) sparsity of the build process.
As a reminder, most social scientists are not familiar with containers.
Consider the new disclosure avoidance method for ACS 2035
https://www.datacamp.com/tutorial/docker-for-data-science-introduction↩︎
An earlier version of this presentation mentioned Gigantum. As is not unusual in this space, Gigantum no longer functions as a company.↩︎
Image credit Christopher
Scholz, under CC-By-SA
2.0↩︎