Random Forest Verbal Autopsy Debut

I just got back from a very fun conference, which was the culmination of some very hard work, all on the Verbal Autopsy (which I’ve mentioned often here in the past).

In the end, we managed to produce machine learning methods that rival the ability of physicians. Forget Jeopardy, this is a meaningful victory for computers. Now Verbal Autopsy can scale up without pulling human doctors away from their work.

Oh, and the conference was in Bali, Indonesia. Yay global health!

I do have a Machine Learning question that has come out of this work, maybe one of you can help me. The thing that makes VA most different from the machine learning applications I have seen in the past is the large set of values the labels can take. For neonatal deaths, for which the set is smallest, we were hoping to make predictions out of 11 different causes, and we ended up thinking that maybe 5 causes is the most we could do. For adult deaths, we had 55 causes on our initial list. There are two standard approaches that I know for converting binary classifiers to multiclass classifiers, and I tried both. Random Forest can produce multiclass predictions directly, and I tried this, too. But the biggest single improvement to all of the methods I tried came from a post-processing step that I have not seen in the literature, and I hope someone can tell me what it is called, or at least what it reminds them of.

For any method that produces a score for each cause, what we ended up doing is generating a big table with scores for a collection of deaths (one row for each death) for all the causes on our cause list (one column for each cause). Then we calculated the rank of the scores down each column, i.e. was it the largest score seen for this cause in the dataset, second largest, etc., and then to predict the cause of a particular death, we looked across the row corresponding to that death and found the column with the best rank. This can be interpreted as a non-parametric transformation from scores into probabilities, but saying it that way doesn’t make it any clearer why it is a good idea. It is a good idea, though! I have verified that empirically.

So what have we been doing here?

7 Comments

Filed under TCS

7 responses to “Random Forest Verbal Autopsy Debut

  1. Possibly related: Kai-Bo Duan and S. Sathiya Keerthi, Which Is the Best Multiclass SVM Method? An Empirical Study — http://www.springerlink.com/content/r9deb9rv9x5qdxjc/

  2. rif

    I’m a bit confused.

    Did you compute this big matrix of ranked scores across just your test matrix, or your test+training? Either way, you’re now no longer making independent predictions on independent data points. [Unless maybe you just ranked the training points and added each test point individually into these rankings, but I bet that's not what you did?]

    Were your class memberships about equal? It seems like this method would tend to produce [at least roughly] equal numbers of outputs per class, regardless of the true class membership of the data. It feels like a system where each of the classes take turns “drafting” a test point.

  3. @rif, you’ve got it just right. I’ve tried this with just test and test+train, and if you like, you can think of it as a “semi-supervised” method. This isn’t the important feature though, because if I use just the training data and a single test point, it is still helpful.

    When we started using this approach, the class sizes were pretty much pretty balanced, but since predicting the class sizes themselves in the test is one of the important applications of VA, we switched to a validation setup where the class sizes in the test set were selected randomly, and then the test data points were resampled to have these sizes. The ranking approach still gives a big improvement there.

    As you say, this can be viewed as a “drafting” system, where each class takes turns picking its team. Is it starting to sound like something you’ve seen before?

  4. rif

    Actually, I was wrong. It’s not really like a drafting system per se. For example, if we add a single new test point it could conceivably be ranked “1” by all the classifiers. And a second new point could conceivably be ranked “2” by all the classifiers.

    What’s really going on is that your ranking system is a form of whitening the output of each classifier — you’re roughly constructing an empirical cdf, an answer to “what percentage of outputs from this classifier is this likely to be larger than?”

  5. @rif: Whitening, huh? Maybe that is the keyword that I was missing. Do you have a good reference for me? And should I call this “non-parametric whitening”?

  6. Pingback: A Slide I Like | Healthy Algorithms