Tag Archives: reproducible research

Philip Stark on “Preproducibility”

A lot of sensible advice here: https://www.youtube.com/watch?v=wHryMtEBkB4

A combination of Dr. Stark’s name and the recent time I’ve spend absorbing the Marvel Comic Universe through Luke Cage [link] led to a culture insight, though. Some of my favorite superheros’ origin would have me believe that irreproducible research is shortest path to greatness. Maybe not for the (usually evil) scientist, but still.

Comments Off on Philip Stark on “Preproducibility”

Filed under science policy

My “Training Modules” on reproducible computational research for secondary data analysis

Here is something I want to make one day:

A7) Overall goal and specific core competencies for training module
The overall goal of this module is to improve the quality, reproducibility, and usability of results from secondary analysis through the development, implementation, and refinement of training modules on best practices and reproducibility in secondary data analysis. This will be achieved through two specific aims:
• Aim 1: Leverage existing experience and resources available at the UW and beyond to create a framework for online training that is of broad utility for development of training modules. We will create a broadly applicable framework that includes a workshop to establish best practices for secondary analysis; conduct preliminary vetting and evaluation of module content; creation, implementation and evaluation of online modules; and ongoing programmatic evaluation and refinement. While we will use this framework on data reproducibility in secondary analysis, we will seek to develop a format that is amenable to developing modules on a range of future topics of interest to biomedical researchers and population health researchers.

• Aim 2: Enrich the educational and research training of early stage scientists within the University and beyond. A module on data reproducibility will make research involving secondary analysis easier and of higher quality when investigators have a more complete understanding of secondary data approaches and their limitations, and have access to standardized methods for data seeking, tracking, cleaning, and reporting. We seek to produce a training module whose format is engaging, self-directed, and is broadly applicable and understandable by a range of scientific experts and levels.

Specific to education and research training, we seek to grow participant capacity in three major areas: search and data management; code and analysis; dissemination of necessary methodological factors. These three areas are broken into nine potential core competencies that are critical to high quality secondary analysis. These competencies are based off the requirements articulated by the eScience Reproducible Research Working Group for a range of types of data reproducibility. Together with this Working Group, we have also identified a number of useful tools and skills for meeting these requirements, which can be translated into learning objectives and modular training videos during the course of this project. Potential additional competencies for future inclusion include identifying and dealing with missing data and weighting data sources. The nine current proposed competencies are:

1. Designing and documenting search strategy. Essential to conducting secondary data analysis is the choice of how to identify data and which data to use (or not use) for secondary analysis. Equally important, from the perspective of reproducibility, is documenting these choices and the reasoning behind them, including the search terms, the inclusion and exclusion criteria, and the decisions made based on these criteria. This can be done through systematic review, where it is essential to document the approach correctly to make it empirically reproducible. The PubMed database is the mainstay of biomedical and public health research, and the module will include a detailed example of documenting a PubMed search strategy. Because there are a range of other relevant databases, the module will include a comparison of some alternatives to PubMed, such as Elsevier’s Scopus and the Thompson Reuters Web of Science, with the goal of indicating how things might differ between databases. In addition, the module will include an example using survey and other administrative data (e.g. US Census, Macro DHS), with a comparison of how to document search of health databases such as the Global Health Data Exchange (GHDx), the Integrated Public Use Microdata Series (IPUMS), or the Harvard DataVerse Network. The GHDx, maintained and run by IHME, is the world’s largest catalogue of health-related data around the world and contained records for 41,914 sources as of Nov. 10, 2014. IPUMS, housed at the Minnesota Population Center, is the world’s largest individual-level population database, and includes 159 samples from 55 countries. The Harvard DataVerse Network is a container for data from all scientific disciplines developed by the Harvard Institute for Quantitative Social Sciences, and includes data from 55,106 studies. Although it is impossible to anticipate all of the possible future options for data search strategies, the goal will be to prepare students for a range of situations that they might face.

2. Capturing essential information on data provenance as “meta-data”. In addition to a clear record of the search strategy and inclusion/exclusion decision for each secondary data source, it is a great aid in reproducibility to record key information on the secondary data sources themselves. Including key fields, such as a digital object identifier (DOI), in a database of sources can be a huge time saver, although not all secondary data sources have a DOI. Recording information about how to gain access to data sources that are not publicly available is also of great value to making secondary data analysis reproducible. There are currently existing, freely available tools that can aid in this process, such as Zotero28. The process of creating and refining the GHDx database at IHME has developed a wealth of operational experience on this topic, which to date has not been formalized for wider dissemination, although has been used in internal trainings at IHME29.

3. Processes and tools for data cleaning, formatting and merging. Secondary analysis and meta-analysis includes the extraction and integration of data across a number of sources, and requires a minimum level of work to ensure information is properly cleaned and formatted prior to analysis. Modern free/libre open source software like Tabula can greatly increase the speed and accuracy of extracting information from tables and graphs as part of systematic review30,31. Shareware like Data Thief32,33 and other proprietary software (or software-as-service solutions like Captricity34) can also be useful, and will be introduced to the degree appropriate for a federally-funded training module. We will also discuss how to use automatic scripts for merging and formatting information from existing databases to enhance reproducibility and accuracy.

4. Double entry, resolving conflicts, and spot checking. Whether the data cleaning, formatting, and merging is accomplished by hand or automatically, some form of quality assurance is necessary. The time-consuming, but most established, approach to data quality assurance35 for by-hand data collection is “double entry”, where the work is done independently by two different coders, and any discrepancies are then adjudicated36. This is a time intensive method, however, and it is important to be realistic about the corners that will be cut when allocating scarce resources, such as researcher time. A quality assurance approach from statistical process control is a cost-saving alternative: by spot checking a representative sample of the secondary data, it is possible to measure the error rate in the cleaned data. Certain types of spot checking can even be automated, so that a set of scripts can automatically validate any additions or updates to the secondary database.

5. Literate programming. To make secondary data analysis reproducible, it is just as important to have a clear record of the analysis is it is of the data. This certainly includes a lucid methods section in the write-up, but is greatly enhanced by a well-documented body of computer code for performing the analysis. Literate programming was developed by star computer scientist Donald Knuth and in simplest terms is a practice of writing programs for humans first and computers second. Although it may be a challenge for large software engineering projects not led by Knuth to make use of literate programming, with modern tools, it is quite easy to incorporate this innovation into small- and moderate-scale data analysis. For example, the IPython Notebook (soon to be rebranded Project Jupyter, and extended to non-Python analysis tools, such as R) provides a mechanism to intersperse analysis code, formatted text (typset with Markdown), and mathematical equations (typeset with LaTeX). Sweave and knitr are alternative approaches that offer similar functionality for an R-centric analyst.

6. Test-driven development. The comprehensive testing of software is essential for ensuring that it performs as intended, and this should be included in any methodological innovations developed for research purposes. Test-driven development (TDD) is an approach to software testing that makes testing an integral part of writing code. It relies on the development of a suite of automatic tests for functionality that are developed simultaneously to the code for each function itself. A short summary of the TDD process is: “write a test, check that it fails, write code, and check that the test passes.” It can be simple and engaging to incorporate this process into the workflow of secondary data analysis, where this takes the form of both creating functional tests to automatically verify the format and contents of the reformatted secondary data, and creating unit tests to verify that any data analysis codes are functioning as expected on simple test cases. We have used this approach successfully in the development and annual updating of long-lasting insecticidal net (LLIN) coverage in global malaria control efforts37,38, but it has not been widely adopted to date, due to lack of training and dissemination. Testing has been a challenge in Software Carpentry, and I hope that the TDD approach will provide the missing piece in translating this practice from software engineering to science39.

7. Version control. Data analysis is an iterative process, and it is important to keep track of the changes to the data analysis as the iterations occur. Version control is an important solution for tracking changes in software development, which makes it possible to track changes and include branches and annotations as part of the methodological development process. This makes it easier to try out changes to analysis and when writing up results. It can also solve the problem of having many copies of a file with similar names, modified to show the date or initials of the researcher who edited it. The importance of version control in reproducible computational research was identified at a recent workshop on reproducibility sponsored by the Society for Industrial and Applied Mathematics40,41.

8. Code review. Even with comprehensive testing, errors happen. Code review is a way to catch these errors, and also to improve the speed and readability of analysis code. There are many approaches to code review, but all rely on having at least one other person read portions of your source code and provide feedback on it. This can improve code quality directly, to the degree that the feedback is incorporated, but it can also improve code quality through something analogous to a Hawthorne effect, where simply knowing that this code will be reviewed inspires better work on the part of the code writer. There have recently been a handful of blog posts and academic articles about incorporating code review into regular laboratory practice42–49, and the lightweight approach advocated by Phillip Guo48 seems particularly suited to routine use in secondary data analysis. The mechanics of code review, as described by Guo (who attributes the practice to Rob Miller), have four researchers work together with a single facilitator and take a 30-minute session to review one page of code per researcher.

9. Replication archive. The previous eight practices will capture all of the essential information for making a secondary data analysis reproducible, but they are of no use if they are not communicated in some way. The replication archive is a way to ensure that regardless of what can fit into the publication and what can be included as an online appendix, the precise details necessary for replication are released to complement the published findings, including those of the code development. There are several approaches available, including archiving the work on a private website, working with an academic library affiliated with the researcher’s institution, or working with the DataVerse Network. We will advocate replication archives in the spirit of “publish your code, it is good enough”50, and together with the four practices for reproducible analysis (literate programming, test-driven development, version control, and code review), it really will be good enough.

1 Comment

Filed under education

agedays column in SmartVA data

Here is another little detail for the detail-oriented mapper who is trying to get data into SmartVA-Analyze: agedays.

In the tutorial, I mentioned that there are handful of additional columns that are not in the Guide, because they are created automatically by the ODK form.

The one called agedays is a bit important, because it gets used to determine if the age of the deceased is above or below a threshold. So set it! The important things currently are if it is more or less than 300 and if it is more or less than 3.

Comments Off on agedays column in SmartVA data

Filed under software engineering

Cool example of reproducible research

—–Original Message—–
From: Ben Marwick
Sent: Monday, January 11, 2016 4:06 PM
To: escience_reproducibility
Subject: [Reproducible] New paper on reproducible research in archaeology

Hi everyone,

You might be interested to know of a peer-reviewed paper on reproducible research in archaeology that I’ve just had published.

The paper owes a big debt to the UW eScience Institute, especially the Reproducibility and Open Science Working Group. So this paper is a kind of tribute to all of you in that group who have helped me make sense of reproducibility, thanks!

Here’s the citation and a link to the PDF:

Marwick, B. (2016). Computational reproducibility in archaeological
research: Basic principles and a case study of their implementation.
Journal of Archaeological Method and Theory, 1-27. doi:

All the gory details of the case study paper are here:
https://github.com/benmarwick/1989-excavation-report-Madjebebe so you can give it a try. It works on my machine 😉

Happy reading!


Comments Off on Cool example of reproducible research

Filed under science policy

Mapping Data for SmartVA-Analyze 1.1

I have just released an updated version of the SmartVA app that predicts the underlying cause of death from the results of verbal autopsy interviews (VAIs). It was a lot of hard work and I hope that people find it useful. You can find the details here: http://www.healthdata.org/verbal-autopsy/tools

There is a major challenge in using this tool (now called SmartVA-Analyze 1.1), however, which is getting the necessary data to feed into it. If you use the ODK form to collect data in just the right format, it is easy. But electronic data collection is not always possible. And there is a fair amount of data out there that has already been collected, but not yet analyzed (which is some of the motivation for creating this tool in the first place).

This blog describes the process of mapping existing VAI data into a format that can be used as input to SmartVA-Analyze 1.1. It is a challenging process that requires careful attention to detail. I will demonstrate the basics here, and I hope to provide fuller examples in multiple scripting languages as researchers complete this exercise for themselves.

A short version of the following, with example code is available on GitHub: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example

The ODK output of electronic version of the PHMRC Shortened Questionnaire is a .csv file, such as the following: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/blob/master/example_1.csv

But if you have data that was collected with pencil-and-paper and then laboriously digitized, you will need to map it into that format. This Guide for data entry spreadsheet is your Rosetta Stone. SmartVA-Analyze 1.1 expects the input csv file to have a column for every row in that spreadsheet, with column heading matching the entry in the “field name” column.
Mapping Process

I like to use Python with Pandas for doing this kind of work, but I recommend you use whatever scripting language you are most comfortable with. But I strongly recommend that you use a script to do this mapping. It will be much easier to debug and reproduce your work than if you do the mapping by hand! (I also recommend that you work incrementally and use a revision control system…) To learn more about the Python/Pandas approach, I recommend the book Python for Data Analysis.

Here is a block of Python code that will create a DataFrame with columns for every field named in the Guide:

import numpy as np, pandas as pd

# load codebook
fname = 'https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/raw/master/Guide%20for%20data%20entry.xlsx'
cb = pd.read_excel(fname, index_col=2)

df = pd.DataFrame(index=[0], columns=cb.index.unique())

(you can also see this in context in an Jupyter Notebook on GitHub here.)

SmartVA-Analyze 1.1 requires a handful of additional columns that are not in the Guide (they are created automatically by the ODK form): child_3_10, agedays, child_5_7e, child_5_6e, adult_2_9a. Here is a block of Python code that will add these columns to the DataFrame created above:

df['child_3_10'] = np.nan
df['agedays']    = np.nan # see notes though http://wp.me/pk40B-Mm
df['child_5_7e'] = np.nan
df['child_5_6e'] = np.nan
df['adult_2_9a'] = np.nan

If you save this DataFrame as a csv file, it will constitute a minimal example of what is necessary to make SmartVA-Analyze 1.1 run:

fname = 'example_1.csv'
df.to_csv(fname, index=False)

Here is what it looks like when SmartVA-Analyze 1.1 is running:

The results are rather minimal, and can be found in the “neonate-predictions.csv” file (because without an age or age group specified, this is the default):
Mapping a more substantial dataset, even a the following hypothetical example is an idiosyncratic and time-consuming procedure.

Example (hypothetical) dataset:

Python code to map the id, sex, and age:

# set id
df['sid'] = hypothetical_data.index

# set sex
df['gen_5_2'] = hypothetical_data['sex'].map({'M': '1', 'F': '2'})

# set age
df['gen_5_4'] = 1  # units are years
df['gen_5_4a'] = hypothetical_data['age'].astype(int)

This is the simple stuff… to map the injury data you will need to dig into the paper questionnaire to see how the responses are coded (the Guide spreadsheet includes some codings, but will refer you to the paper questionnaire when necessary):

# map injuries to appropriate codes
# suffered injury?
df['adult_5_1'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'1', '':'0'})
# injury type
df['adult_5_2'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'2'})

Mapping more columns proceeds analogously, but I recommend working incrementally, so at this point you should save the partially mapped data and make sure it runs through the SmartVA-Analyze app, and make sure that the results make some sense. For example, in this case the mapped hypothetical data from the first 2 rows are correctly identified as traffic and fall injury deaths, but the final 3 rows are undetermined (because non-injury signs and symptoms have not yet been mapped).

Mapping the additional columns proceeds analogously:

# map heart disease (to column adult_1_1i, see Guide)
df['adult_1_1i'] = hypothetical_data['heart_disease'].map({'Y':'1', 'N':'0'})

# map chest pain (to column adult_2_43, see Guide)
df['adult_2_43'] = hypothetical_data['chest_pain'].map({'Y':'1', 'N':'0', '':'9'})

I hope that this helps… if you’ve read this far, you probably have a hard job ahead of you! Please see the Jupyter Notebook version of this example here, and good luck!


Filed under software engineering

Replicability and reproducibility

Lots of material on Reproducible Research in my backlog… I’m going to get it out there for you (or at least for future-me).

—–Original Message—–
From: Reproducible On Behalf Of Ben Marwick
Sent: Monday, November 2, 2015 9:53 PM
Subject: [Reproducible] Language Log: Replicability vs. reproducibility — or is it the other way around?

A popular academic blog on linguistics just put up a post with a nice discussion of definitions of reproducibility in science:


Comments Off on Replicability and reproducibility

Filed under science policy

Ben Marwick on ‘The Conversation’: How computers broke science – and what we can do to fix it

—–Original Message—–
From: Reproducible On Behalf Of Ben Marwick
Sent: Monday, November 9, 2015 5:58 AM
Subject: [Reproducible] My article on ‘The Conversation’: How computers broke science – and what we can do to fix it

I wrote a short essay on reproducible research and how researchers use computers for a popular media outlet (citing UW eScience):


Please leave a comment at the bottom to help demonstrate to other readers that there really is a movement toward this way of working, and I’m not making it up!


Comments Off on Ben Marwick on ‘The Conversation’: How computers broke science – and what we can do to fix it

Filed under software engineering

Catalog of Reproducible Products

The Reproducibility and Open Science (ROS) Working Group recently finished up a form to begin to gather information on Reproducible Products from the community.

Please take a few minutes to submit information on any product (peer-reviewed manuscript, preprint, or other product). It is only about 20 questions with many multiple choice question.

The google form can be accessed @

Please feel free to let us know if you have any questions or comments.

Steven Roberts

Comments Off on Catalog of Reproducible Products

Filed under science policy

Jupyter Notebooks in GitHub

So cool:



I wonder what diffs look like?
Currently, not shown: https://github.com/fonnesbeck/statistical-analysis-python-tutorial/commit/17ca0cd15c1379f9adc4561042c4a31621baeef6

Is that next GitHub? It will be huge.

Comments Off on Jupyter Notebooks in GitHub

Filed under software engineering

Irreproducibile science as a communication failure

From: Abraham D. Flaxman
Sent: Thursday, May 7, 2015 4:40 PM
To: reproducible@u.washington.edu
Subject: [Reproducible] licenses and reproducibility: the scholarly communication lens

The recent discussion on reproducibility and licensing inspired me to read something historical about UW and software licensing that has been on my desk for a while. I think others on the list might find it interesting as well, so I scanned a copy for you: https://www.dropbox.com/s/79k92iwm20159of/williams_barnett_digital_ventures_2009.pdf?dl=0

I particularly like the idea that software is communication, and the university is an institute that is good at scholarly communication and at teaching. I think there is some framing here that could be valuable for reproducible research as well. Irreproducible results are, in a sense, a communication failure, and a lot of what we are talking about on this list are different ways to improve our scholarly communication.


Comments Off on Irreproducibile science as a communication failure

Filed under science policy