My “Training Modules” on reproducible computational research for secondary data analysis

Here is something I want to make one day:

A7) Overall goal and specific core competencies for training module
The overall goal of this module is to improve the quality, reproducibility, and usability of results from secondary analysis through the development, implementation, and refinement of training modules on best practices and reproducibility in secondary data analysis. This will be achieved through two specific aims:
• Aim 1: Leverage existing experience and resources available at the UW and beyond to create a framework for online training that is of broad utility for development of training modules. We will create a broadly applicable framework that includes a workshop to establish best practices for secondary analysis; conduct preliminary vetting and evaluation of module content; creation, implementation and evaluation of online modules; and ongoing programmatic evaluation and refinement. While we will use this framework on data reproducibility in secondary analysis, we will seek to develop a format that is amenable to developing modules on a range of future topics of interest to biomedical researchers and population health researchers.

• Aim 2: Enrich the educational and research training of early stage scientists within the University and beyond. A module on data reproducibility will make research involving secondary analysis easier and of higher quality when investigators have a more complete understanding of secondary data approaches and their limitations, and have access to standardized methods for data seeking, tracking, cleaning, and reporting. We seek to produce a training module whose format is engaging, self-directed, and is broadly applicable and understandable by a range of scientific experts and levels.

Specific to education and research training, we seek to grow participant capacity in three major areas: search and data management; code and analysis; dissemination of necessary methodological factors. These three areas are broken into nine potential core competencies that are critical to high quality secondary analysis. These competencies are based off the requirements articulated by the eScience Reproducible Research Working Group for a range of types of data reproducibility. Together with this Working Group, we have also identified a number of useful tools and skills for meeting these requirements, which can be translated into learning objectives and modular training videos during the course of this project. Potential additional competencies for future inclusion include identifying and dealing with missing data and weighting data sources. The nine current proposed competencies are:

1. Designing and documenting search strategy. Essential to conducting secondary data analysis is the choice of how to identify data and which data to use (or not use) for secondary analysis. Equally important, from the perspective of reproducibility, is documenting these choices and the reasoning behind them, including the search terms, the inclusion and exclusion criteria, and the decisions made based on these criteria. This can be done through systematic review, where it is essential to document the approach correctly to make it empirically reproducible. The PubMed database is the mainstay of biomedical and public health research, and the module will include a detailed example of documenting a PubMed search strategy. Because there are a range of other relevant databases, the module will include a comparison of some alternatives to PubMed, such as Elsevier’s Scopus and the Thompson Reuters Web of Science, with the goal of indicating how things might differ between databases. In addition, the module will include an example using survey and other administrative data (e.g. US Census, Macro DHS), with a comparison of how to document search of health databases such as the Global Health Data Exchange (GHDx), the Integrated Public Use Microdata Series (IPUMS), or the Harvard DataVerse Network. The GHDx, maintained and run by IHME, is the world’s largest catalogue of health-related data around the world and contained records for 41,914 sources as of Nov. 10, 2014. IPUMS, housed at the Minnesota Population Center, is the world’s largest individual-level population database, and includes 159 samples from 55 countries. The Harvard DataVerse Network is a container for data from all scientific disciplines developed by the Harvard Institute for Quantitative Social Sciences, and includes data from 55,106 studies. Although it is impossible to anticipate all of the possible future options for data search strategies, the goal will be to prepare students for a range of situations that they might face.

2. Capturing essential information on data provenance as “meta-data”. In addition to a clear record of the search strategy and inclusion/exclusion decision for each secondary data source, it is a great aid in reproducibility to record key information on the secondary data sources themselves. Including key fields, such as a digital object identifier (DOI), in a database of sources can be a huge time saver, although not all secondary data sources have a DOI. Recording information about how to gain access to data sources that are not publicly available is also of great value to making secondary data analysis reproducible. There are currently existing, freely available tools that can aid in this process, such as Zotero28. The process of creating and refining the GHDx database at IHME has developed a wealth of operational experience on this topic, which to date has not been formalized for wider dissemination, although has been used in internal trainings at IHME29.

3. Processes and tools for data cleaning, formatting and merging. Secondary analysis and meta-analysis includes the extraction and integration of data across a number of sources, and requires a minimum level of work to ensure information is properly cleaned and formatted prior to analysis. Modern free/libre open source software like Tabula can greatly increase the speed and accuracy of extracting information from tables and graphs as part of systematic review30,31. Shareware like Data Thief32,33 and other proprietary software (or software-as-service solutions like Captricity34) can also be useful, and will be introduced to the degree appropriate for a federally-funded training module. We will also discuss how to use automatic scripts for merging and formatting information from existing databases to enhance reproducibility and accuracy.

4. Double entry, resolving conflicts, and spot checking. Whether the data cleaning, formatting, and merging is accomplished by hand or automatically, some form of quality assurance is necessary. The time-consuming, but most established, approach to data quality assurance35 for by-hand data collection is “double entry”, where the work is done independently by two different coders, and any discrepancies are then adjudicated36. This is a time intensive method, however, and it is important to be realistic about the corners that will be cut when allocating scarce resources, such as researcher time. A quality assurance approach from statistical process control is a cost-saving alternative: by spot checking a representative sample of the secondary data, it is possible to measure the error rate in the cleaned data. Certain types of spot checking can even be automated, so that a set of scripts can automatically validate any additions or updates to the secondary database.

5. Literate programming. To make secondary data analysis reproducible, it is just as important to have a clear record of the analysis is it is of the data. This certainly includes a lucid methods section in the write-up, but is greatly enhanced by a well-documented body of computer code for performing the analysis. Literate programming was developed by star computer scientist Donald Knuth and in simplest terms is a practice of writing programs for humans first and computers second. Although it may be a challenge for large software engineering projects not led by Knuth to make use of literate programming, with modern tools, it is quite easy to incorporate this innovation into small- and moderate-scale data analysis. For example, the IPython Notebook (soon to be rebranded Project Jupyter, and extended to non-Python analysis tools, such as R) provides a mechanism to intersperse analysis code, formatted text (typset with Markdown), and mathematical equations (typeset with LaTeX). Sweave and knitr are alternative approaches that offer similar functionality for an R-centric analyst.

6. Test-driven development. The comprehensive testing of software is essential for ensuring that it performs as intended, and this should be included in any methodological innovations developed for research purposes. Test-driven development (TDD) is an approach to software testing that makes testing an integral part of writing code. It relies on the development of a suite of automatic tests for functionality that are developed simultaneously to the code for each function itself. A short summary of the TDD process is: “write a test, check that it fails, write code, and check that the test passes.” It can be simple and engaging to incorporate this process into the workflow of secondary data analysis, where this takes the form of both creating functional tests to automatically verify the format and contents of the reformatted secondary data, and creating unit tests to verify that any data analysis codes are functioning as expected on simple test cases. We have used this approach successfully in the development and annual updating of long-lasting insecticidal net (LLIN) coverage in global malaria control efforts37,38, but it has not been widely adopted to date, due to lack of training and dissemination. Testing has been a challenge in Software Carpentry, and I hope that the TDD approach will provide the missing piece in translating this practice from software engineering to science39.

7. Version control. Data analysis is an iterative process, and it is important to keep track of the changes to the data analysis as the iterations occur. Version control is an important solution for tracking changes in software development, which makes it possible to track changes and include branches and annotations as part of the methodological development process. This makes it easier to try out changes to analysis and when writing up results. It can also solve the problem of having many copies of a file with similar names, modified to show the date or initials of the researcher who edited it. The importance of version control in reproducible computational research was identified at a recent workshop on reproducibility sponsored by the Society for Industrial and Applied Mathematics40,41.

8. Code review. Even with comprehensive testing, errors happen. Code review is a way to catch these errors, and also to improve the speed and readability of analysis code. There are many approaches to code review, but all rely on having at least one other person read portions of your source code and provide feedback on it. This can improve code quality directly, to the degree that the feedback is incorporated, but it can also improve code quality through something analogous to a Hawthorne effect, where simply knowing that this code will be reviewed inspires better work on the part of the code writer. There have recently been a handful of blog posts and academic articles about incorporating code review into regular laboratory practice42–49, and the lightweight approach advocated by Phillip Guo48 seems particularly suited to routine use in secondary data analysis. The mechanics of code review, as described by Guo (who attributes the practice to Rob Miller), have four researchers work together with a single facilitator and take a 30-minute session to review one page of code per researcher.

9. Replication archive. The previous eight practices will capture all of the essential information for making a secondary data analysis reproducible, but they are of no use if they are not communicated in some way. The replication archive is a way to ensure that regardless of what can fit into the publication and what can be included as an online appendix, the precise details necessary for replication are released to complement the published findings, including those of the code development. There are several approaches available, including archiving the work on a private website, working with an academic library affiliated with the researcher’s institution, or working with the DataVerse Network. We will advocate replication archives in the spirit of “publish your code, it is good enough”50, and together with the four practices for reproducible analysis (literate programming, test-driven development, version control, and code review), it really will be good enough.

1 Comment

Filed under education

One response to “My “Training Modules” on reproducible computational research for secondary data analysis

  1. I completely agree, much more focus on these things is needed. #5 and #6 are the topic of some of my recent work for math modeling, which absolutely extends to data analysis.

    My blog details some python tools I put together for doing this: http://robclewley.github.io/2015/05/06/logging-and-diagnostic-tools-for-numeric-python-algorithms
    http://robclewley.github.io/2015/04/09/capturing-the-provenance-of-model-prototyping-and-development—part-2

    Also, version control over iPython notebooks is still in its infancy, but I got Nick Bollweg interested in the idea and he has put some effort into getting an implementation started.