Writing a Clear README file

Lars Vilhuber

Marie Connolly

2025-03-31

Goals

Why a README?

Transparency
Completeness
Typically not enough information just in article or code

Topics here

Elements of a README

Overview

The role of the template README

Entry point to understand contents of package
Especially crucial when data are confidential and thus not provided with the replication package
There are requirements and good practices, but your ReadMe will reflect your package—it’s not a “one size fits all” document

Best practices

The role of the template README
Best practices

Start writing your ReadMe from DAY 1
- It will take a minute to add some notes today, but probably hours to retrieve all the information years from now
- Pay special attention to your rights: what are the terms of use of your data, what is the license? Save that information

Template README

V1.1 Published in November 2022

Archive: > Lars Vilhuber, Connolly, M., Koren, M., Llull, J., & Morrow, P. (2022). A template README for social science replication packages (v1.1). Social Science Data Editors. https://doi.org/10.5281/zenodo.7293838

Browsable: https://social-science-data-editors.github.io/template_README/

Using the Template README

You can download the Word, LaTeX, or Markdown version of the README with lots of examples.

Online Generator

You can also try Miklós Koren’s online generator https://www.templatereadme.org/.

An Example

Lavecchia, Adam, 2023, “Replication Data and Code for: Family-Level Responses to the Introduction of Tax-Free Savings Accounts”, https://doi.org/10.5683/SP3/M6HLUF, Borealis, V1.

Overview

INSTRUCTIONS: The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper. Start by providing a brief overview of the available material and a brief guide as to how to proceed from beginning to end

Overview

guides a reader through the available material and a route to replicating the results in the research paper, including

the description of the origins of data and/or description of programs.
provides a brief overview of the available material and
provides a brief guide as to how to proceed from beginning to end
then dives into the specifics.

Overview (Example)

Data and Code Availability Statement (DCAS)

Data and Code Availability Statement (DCAS)

It contains information about the sources of data used in the replication package, in addition to or instead of such detailed description in the manuscript.

If information is provided as part of an appendix in the paper with the same level of detail described here, then the README can refer to that appendix.
In most economics journals, the README is part of a package together with the code, so a separate statement of where the code is located is not necessary.

Data Availability Statement

The information should describe ALL data used, regardless of whether they are provided as part of the replication archive or not, and regardless of size or scope!

Should describe the conditions under which (a) the current authors (b) any future users might access the data (which may differ)

Data Availability Statement

These may include

required registrations,
memberships,
application procedures,
monetary cost, or
other qualifications.

Data Availability Statement (when primary)

When the data were generated (by the authors) in the course of conducting (lab or field) experiments, or were collected as part of surveys run by the authors, then the description of the provenance should describe the data generating process, i.e., survey or experimental procedures

Caution about inclusion or exclusion of data

Authors sometimes believe that because the data were “publicly available” they could include them in the replication package. This is not always true

If providing the data as part of the replication package, authors should be clear about whether they have the rights to distribute the data

Rights to use data, and rights to distribute data

Statement about rights

Rights to use data, and rights to distribute data

Example: if data are on a restricted server, and you have access:

You have the rights to use the data
You do NOT have the rights to distribute the data

Rights to use data, and rights to distribute data

Less obvious: You were able to download the data from a website that did not require a login.¹

You have the rights to use the data
You may have the rights to distribute the data, depending on terms of use or license, and whether you had to agree to those terms of use.

Data Availability Statement (Example: restricted)

Data Availability Statement (Example: public)

Listing of data sets

Data sources translate into datasets. Ideally, the README lists them:

list of datasets

Listing of data sets: Example

Computational Requirements

For simple replication packages, may appear to be trivial (a laptop and some common software).²

What if requirement is expensive commercial software and a super computer cluster?

Computational Requirements

In order to assess the complexity of the task of replicating, authors should specify each of the following elements:

Software used, including version number as used. If the code is expected to run with a lower version number, that should be added.
Any additional packages, including their version number or similar, as used.
The computer hardware specification as used by the author, in terms of OS, CPU generation and quantity, memory and necessary disk space. If multiple computers were used, the specification for each should be identified.
The wall-clock time given the provided computer hardware, expressed in appropriate units (minutes, days).

Software

INSTRUCTIONS: List all the software requirements, up to and including any operating system requirements, for the entire set of code

List each package, and the version you used, including dependencies
Suggested: distribute most dependencies together with the replication package if allowed, in particular if sourced from unversioned code repositories, Github repos, and personal webpages.
Suggested: if possible, pre-install link
Suggested: use environments (Python, R, Julia, also Stata)

Software: Example

PseudoRNG seed

INSTRUCTIONS: Some estimation code uses random numbers, almost always provided by pseudorandom number generators (PRNGs). For reproducibility purposes, these should be provided with a deterministic seed, so that the sequence of numbers provided is the same for the original author and any replicators

What is a computer?

INSTRUCTIONS: Give information on the machine you used, especially if specialized hardware is needed (e.g. high-power servers)

What is a computer? Example

The code

This should provide some details, but ideally:

explain summarily what the code does
do so for ALL code, even when data are not included in replication package

Important: Remove any redundant code.

The code: Example

Instructions

Provide human-readable instructions to conduct the replication
May be simple or may involve many complicated steps.
List format, no excess prose (or poems, except Haiku). Strict linear sequence.
If more than 4-5 manual steps, please wrap a main program/Makefile around them, in logical sequences

Instructions: Example

List of tables and programs

INSTRUCTIONS: Your programs should clearly identify the tables and figures as they appear in the manuscript, by number

If using named programs (table1.do) this may not be necessary, but should be stated.
If mnemonics are used (table_ols_rds_55.do), a mapping is necessary.
If not produced by code (e.g. sourced elsewhere), a mention is useful

List of tables and programs: Example

References

INSTRUCTIONS: As in any scientific manuscript, you should have proper references. Cite your data, the packages that you use, everything you refer to in the ReadMe, and generally everything you rely upon

References should also be included in manuscript (subject to journal style considerations)
Guidance for citations: https://social-science-data-editors.github.io/guidance/addtl-data-citation-guidance.html

References: Admin data

Some admin data are very well referenced, others less so…
Ideally, each data set comes with its own DOI and clear reference
If not, try to provide as much information as you can, especially what needs to be requested for someone to reproduce what you did

References: Admin data - Example

Good example 1, German data: https://fdz.iab.de/en/our-data-products/integrated-establishment-and-individual-data/liab/

References: Admin data - Example

Good example 2, French data: https://www.casd.eu/en/source/all-employees-databases-establishment-data/

References: Example

Summary

The ReadMe is a central part of a data and code replication package
It’s the instructions manual to your package
Whether or not you can provide the data is almost irrelevant
You still need to explain their availability and provenance
The earlier you start collecting the information necessary for the ReadMe, the easier the task becomes
There are resources available to help you

Resources

README

Lars Vilhuber, Connolly, M., Koren, M., Llull, J., & Morrow, P. (2022). A template README for social science replication packages (v1.1). Social Science Data Editors. https://doi.org/10.5281/zenodo.7293838

You can download the Word, LaTeX, or Markdown version of the README with lots of examples.

Other guidance

Presentation on “Self Checking Reproducibility” and its associated website
Guidance when (some) data are confidential: https://labordynamicsinstitute.github.io/reproducibility-confidential/
Guidance for citations: https://social-science-data-editors.github.io/guidance/addtl-data-citation-guidance.html

Thank you

Footnotes

World Values Survey data are available for download without a login, but you have to agree to terms of use that prohibit redistribution.
https://social-science-data-editors.github.io/template_README/#computational-requirements
Image by JoePhin, under Creative Commons Attribution-Share Alike 4.0 International license. https://commons.wikimedia.org/wiki/File:Numbers.gif