Reproducibilidad con Impacto

Lars Vilhuber

2026-01-23

Follow along

larsvilhuber.github.io/presentation-2026-01/ (PDF)

Reproducibilidad con Impacto

Cómo la Ciencia Abierta Puede Cambiar el Mundo

Lars Vilhuber
Cornell University

What is this?

If you I tell you that…

If I give you a name…

...

SP500.csv

...

If I show you its contents…

# read in the data
sp500 <- read.csv(here::here("presentation","SP500.csv"))
head(sp500)
  observation_date   SP500
1       2021-01-20 3851.85
2       2021-01-21 3853.07
3       2021-01-22 3841.47
4       2021-01-25 3855.36
5       2021-01-26 3849.62
6       2021-01-27 3750.77

If I tell you where I got it from…

But I cannot give it to you!

Copyright © 2016, S&P Dow Jones Indices LLC. All rights reserved. Reproduction of S&P 500 in any form is prohibited except with the prior written permission of S&P Dow Jones Indices LLC (“S&P”).

So how can you verify my results?

  • By obtaining the file again
  • By running my code again
  • By verifying that the results are the same!

Trust but verify!

  • Reproducibility is key!
  • But there are weaknesses…

The issue of provenance

SP500

  • Created by S&P Dow Jones Indices LLC
  • Widely used as a market benchmark
  • Commercial product, cannot be redistributed
  • S&P considered reliable
  • Me getting the file is the weak link

What if I am the collector of data?

Who is this person?

1

Dirk Smeesters

Full article

Retraction

“a scientific integrity committee found that the results in two of Smeesters’ papers were statistically highly unlikely. Smeesters could not produce the raw data behind the findings, and told the committee that he cherry-picked the data to produce a statistically significant result. Those two papers are being retracted, and the university accepted Smeesters’ resignation on June 21.”

Who is this person? (2)

2

Aidan Toner-Rodgers

2

Maybe you’ve heard about him

Now

“MIT now declares “no confidence in the provenance, reliability or validity of the data and…in the veracity of the research”. Mr Toner-Rodgers’s paper has been withdrawn from the pre-print repository on which it first appeared [arXiv]; … The lab at the heart of his findings remains unknown.”

How can we know that a data source is reliably obtained?

Consider the case of Gino

Francesca Gino

The case of Gino

  • Francesca Gino was a tenured professor at Harvard Business School, writing on honesty (!)

The case of Gino

  • Several articles were investigated by third parties (Data Colada, in particular 1), and found to be problematic

Data manipulated

The case of Gino

  • At least one of them had manipulated data AFTER it had been collected, BEFORE it had been analyzed.

Data manipulation

Results of manipulation

What can YOU do?

What is this?

Training

  • Biology students learn key lab techniques
    • Pipetting
    • Capture-recapture of wild animals

Training

  • Biology students learn key lab techniques
    • Pipetting
    • Capture-recapture of wild animals

That’s my daughter’s hands in there

YOU!

(BTW…)

When I prompted Gemini to correct the spelling in “Reproducibility”, it only made it worse!

Back to the topic

Generic survey processing

Generic survey processing

Generic survey processing

Generic survey processing

Requiring transparency in academia

Generic survey processing

Verifying transparency in academia

Generic survey processing

Verification by journals

  • Provision (publication of materials) provides transparency
  • Verification (running the analysis again - computational reproducibility) compensates for mistrust/absence of trust

Which journals

Verification by others

cascad

I4R

Verification by institutions

World Bank RRR

Verification by you

  • Toner-Rodgers was caught b/c others scrutinized his work
    • Investigated the statistics (not too fishy)
    • Investigated the sources (huh?)
    • Investigated the coherence of the paper (much harder)

Taking it a step further

Survey flow

Taking it a step further

  • Has been discussed by authors behind Data Colada
  • Survey tool provider (Qualtrics, etc.) exports data, posts checksum
  • Survey tool provider exports data only to institution directly into trusted repository, researchers obtain data from there (with privacy protections)
  • Researcher can verify checksum

Does not work yet

What if you could verify the file?

# compute checksum of the file
tools::md5sum(here::here("presentation","SP500.csv"))
/home/runner/work/presentation-2026-01/presentation-2026-01/presentation/SP500.csv 
                                                "d9aed4cf23b0f0e3f0c9e254ccf00208" 

Certified survey processing

Not yet available!

Does not prevent all fraud

How to document the full process?

Survey flow

How to document the full process?

  • Identify all sources - very precisely!
  • Document all processing steps - using tools you are learning here!
  • Transparent about your process - show your code!
  • Expect critique (if not criticism) - embrace it!

Churchill said…

Do not trust any statistics you did not fake yourself.

Or did he…

“No, of course, the British Prime Minister … never claimed such nonsense. But putting his name in front of a quote gives it a more solemn, more imposing, more definitive appearance.” [Source]

A sketch: Transparency Certified

https://transparency-certified.github.io/

Transparency Certified

Work in progress

  • Working with cascad, several INEXDA members, and others
  • Relying on external certification of data inputs (data catalogs with metadata, checksums)

Who is this person? (3)

Who is this person? (3)

Alp Simsek, one of …

Transparency is the norm

nearly 5,000 authors who have had their work verified by the AEA Data Editor and team.

You can do it, too!

You can do it, too!

Bonus: You can help, too!

  • Apply the reproducible tools not just for research: Policy analysis, program evaluation, business analytics, etc.
  • Push data providers to provide better access tools (API access, checksums, DOI)
  • Be able to explain why this benefits them!

The end!

Source

Footnotes

  1. https://datacolada.org/109, https://datacolada.org/110, https://datacolada.org/111, https://datacolada.org/112, https://datacolada.org/114, https://datacolada.org/118

  2. Jones, M. (2024). Introducing Reproducible Research Standards at the World Bank. Harvard Data Science Review, 6(4). https://doi.org/10.1162/99608f92.21328ce3