Just because I missed posting for the last year, doesn’t mean I have not been writing. Perhaps I have been writing more. Here is something that I just wrote for a perspective on opportunities for machine learning in population health.
Machine learning (ML) is emerging as a technology, climbing the “peak of inflated expectations” or perhaps even starting to slip into the “trough of disillusionment”, in the terms of the technology hype cycle,[ref] and offers both opportunities and threats to population health. ML is a technique for constructing computer algorithms, and what distinguishes ML methods from other computer solutions is that, while the structure of the computer program may be fixed, the details are learned from data. This data-driven approach is now dominant in Artificial Intelligence (AI), especially through deep neural networks, and stands in contrast to the old way, an expert-algorithms approach in which rules summarizing expert knowledge were painstakingly constructed by engineers and domain specialists. ML has succeeded by trading experts and programmers for data and nonparametric statistical models. However, the applications where ML has been successfully deployed remain limited. AI luminary Andrew Ng provides this concise heuristic: “[i]f a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”[ref]
The editor only wants 1,000 words, so this is getting cut.
Cool paper, cool idea, ICYMI:
From: Mabry, Patricia L
Sent: Thursday, January 14, 2016 5:51 AM
Subject: [iuni_systems_sci-l] Article of interest: reusable holdout method
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248), 636-638.
Misapplication of statistical data analysis is a common cause of spurious discoveries in
scientific research. Existing approaches to ensuring the validity of inferences drawn from data
assume a fixed procedure to be performed, selected before the data are examined. In common
practice, however, data analysis is an intrinsically adaptive process, with new analyses
generated on the basis of data exploration, as well as the results of previous analyses on the
same data. We demonstrate a new approach for addressing the challenges of adaptivity based
on insights from privacy-preserving data analysis. As an application, we show how to safely
reuse a holdout data set many times to validate the results of adaptively chosen analyses.
It was great speaking with you. This is the paper I was talking about.
Looking forward to know more about each other’s work.
I just got back from a very fun conference, which was the culmination of some very hard work, all on the Verbal Autopsy (which I’ve mentioned often here in the past).
In the end, we managed to produce machine learning methods that rival the ability of physicians. Forget Jeopardy, this is a meaningful victory for computers. Now Verbal Autopsy can scale up without pulling human doctors away from their work.
Oh, and the conference was in Bali, Indonesia. Yay global health!
I do have a Machine Learning question that has come out of this work, maybe one of you can help me. The thing that makes VA most different from the machine learning applications I have seen in the past is the large set of values the labels can take. For neonatal deaths, for which the set is smallest, we were hoping to make predictions out of 11 different causes, and we ended up thinking that maybe 5 causes is the most we could do. For adult deaths, we had 55 causes on our initial list. There are two standard approaches that I know for converting binary classifiers to multiclass classifiers, and I tried both. Random Forest can produce multiclass predictions directly, and I tried this, too. But the biggest single improvement to all of the methods I tried came from a post-processing step that I have not seen in the literature, and I hope someone can tell me what it is called, or at least what it reminds them of.
For any method that produces a score for each cause, what we ended up doing is generating a big table with scores for a collection of deaths (one row for each death) for all the causes on our cause list (one column for each cause). Then we calculated the rank of the scores down each column, i.e. was it the largest score seen for this cause in the dataset, second largest, etc., and then to predict the cause of a particular death, we looked across the row corresponding to that death and found the column with the best rank. This can be interpreted as a non-parametric transformation from scores into probabilities, but saying it that way doesn’t make it any clearer why it is a good idea. It is a good idea, though! I have verified that empirically.
So what have we been doing here?