A colleague forwarded me this RFP specifically for replication of controversial impact evaluations. It includes a recent journal club article (by other IHME colleagues) on the candidate study list. Cool!
I’ve also made it really easy for someone to replicate the results in one recent paper I was involved in, on hepatitis C virus seroprevalence. Well, easy if you manage to get dismod installed… making that really easy is still on my to-do list.
I had a fun time on Monday talking to area high school students at the UW Math Department’s annual Math Day event. My slides and some others are now on the web.
Very cool new visualizations of the GBD2010 results are now on-line: http://viz.healthmetricsandevaluation.org/gbd-compare/
The massive project I’ve been working on since moving from math to global health has been published!
The Global Burden of Disease Study 2010 (GBD 2010) is the largest ever systematic effort to describe the global distribution and causes of a wide array of major diseases, injuries, and health risk factors. The results show that infectious diseases, maternal and child illness, and malnutrition now cause fewer deaths and less illness than they did twenty years ago. As a result, fewer children are dying every year, but more young and middle-aged adults are dying and suffering from disease and injury, as non-communicable diseases, such as cancer and heart disease, become the dominant causes of death and disability worldwide. Since 1970, men and women worldwide have gained slightly more than ten years of life expectancy overall, but they spend more years living with injury and illness.
GBD 2010 consists of seven Articles, each containing a wealth of data on different aspects of the study (including data for different countries and world regions, men and women, and different age groups), while accompanying Comments include reactions to the study’s publication from WHO Director-General Margaret Chan and World Bank President Jim Yong Kim. The study is described by Lancet Editor-in-Chief Dr Richard Horton as “a critical contribution to our understanding of present and future health priorities for countries and the global community.”
Now I have to get my book about the methods out the door as well…
I’m excited to call your attention to a paper that my co-author Ben Birnbaum is presenting next week at the ACM DEV conference:
This research is about… well, the title says it pretty clearly. I’m interested in using our approach to detect surprises in data quality in all kinds of settings. Ben did the heavy lifting for this paper, so he deserves a lot of the congratulations that it has received the best paper award from the DEV 2012 program committee.
I’m afraid that Healthy Algorithms will be pretty quiet in the next month, I’ve got some major other writing commitments to attend to, and I need to ration my keystrokes if I’m going to make the deadline.
But here is something I’m happy to leave at the top of the page while I’m busy: the special issue of Population Health Metrics devoted to the Verbal Autopsy is provisionally available.
This includes the paper on using random forests for computer coding verbal autopsies that I’ve mentioned before, a paper describing the massive efforts that went into collecting a verbal autopsy validation dataset, and a paper on our take on the metrics of prediction quality that we recommend for any approach to verbal autopsy.
Bonus, a commentary that quotes Foucault to put random forests in context.
I just got back from a very fun conference, which was the culmination of some very hard work, all on the Verbal Autopsy (which I’ve mentioned often here in the past).
In the end, we managed to produce machine learning methods that rival the ability of physicians. Forget Jeopardy, this is a meaningful victory for computers. Now Verbal Autopsy can scale up without pulling human doctors away from their work.
Oh, and the conference was in Bali, Indonesia. Yay global health!
I do have a Machine Learning question that has come out of this work, maybe one of you can help me. The thing that makes VA most different from the machine learning applications I have seen in the past is the large set of values the labels can take. For neonatal deaths, for which the set is smallest, we were hoping to make predictions out of 11 different causes, and we ended up thinking that maybe 5 causes is the most we could do. For adult deaths, we had 55 causes on our initial list. There are two standard approaches that I know for converting binary classifiers to multiclass classifiers, and I tried both. Random Forest can produce multiclass predictions directly, and I tried this, too. But the biggest single improvement to all of the methods I tried came from a post-processing step that I have not seen in the literature, and I hope someone can tell me what it is called, or at least what it reminds them of.
For any method that produces a score for each cause, what we ended up doing is generating a big table with scores for a collection of deaths (one row for each death) for all the causes on our cause list (one column for each cause). Then we calculated the rank of the scores down each column, i.e. was it the largest score seen for this cause in the dataset, second largest, etc., and then to predict the cause of a particular death, we looked across the row corresponding to that death and found the column with the best rank. This can be interpreted as a non-parametric transformation from scores into probabilities, but saying it that way doesn’t make it any clearer why it is a good idea. It is a good idea, though! I have verified that empirically.
So what have we been doing here?
OMG I have got busy. I went to NIPS and the weekend disappeared and now it’s post-doc interview season again, already! So much to say, but I plan to pace myself. For this short post, an exciting announcement that my model of the insecticide treated mosquito net distribution supply chain was used in the WHO 2010 World Malaria Report, which just came out. Since it is a Bayesian statistical model that draws samples from a posterior distribution with MCMC, it’s really nice that the report includes some of the uncertainty intervals around the coverage estimates. Guess what? There is a lot of uncertainty. But nets are getting to households and getting used. Pages 19 and 20 in Chapter 4 have the results of our hard work.
My first first-authored global health paper came out today (I consider it my first “first-authored” paper ever, since the mathematicians I’ve worked with deviantly list authorship in alphabetical order regardless of seniority and contribution). It’s a bit of a mouthful by title: Rapid Scaling Up of Insecticide-Treated Bed Net Coverage in Africa and Its Relationship with Development Assistance for Health: A Systematic Synthesis of Supply, Distribution, and Household Survey Data.
What I find really pleasing about this research paper is the way it continues research I worked on in graduate school, but in a completely different and unexpected direction. Approximate counting is something that my advisor specialized in, and he won a big award for the random polynomial time algorithm for approximating the volume of convex bodies. I followed in his footsteps when I was a student, and I’m still doing approximate counting, it’s just that now, instead of approximating the amount of high-dimensional sand that will fit in an oddly shaped high-dimensional box, I’ve been approximating the number of insecticide-treated bednets that have made it from manufacturers through the distribution supply-chain and into the households of malaria-endemic regions of the world. I’m even using the same technique, Markov-chain Monte Carlo.
I’ve been itching to write about the computational details of this research for a while, and now that the paper’s out, I will have my chance. But for today, I refer you to the PLoS Med paper, and the technical appendix, and the PyMC code on github.
Check it out, my first published research in global health: Neonatal, postneonatal, childhood, and under-5 mortality for 187 countries, 1970—2010: a systematic analysis of progress towards Millennium Development Goal 4. I’m the ‘t’ in et al, and my contribution was talking them into using the really fun Gaussian Process in their model (and helping do it).
I’ve long wanted to write a how-to style tutorial about using Gaussian Processes in PyMC, but time continues to be on someone else’s side. Instead of waiting for that day, you can enjoy the GP Users Guide now.