Verbal Autopsy Challenge from AI-D

I was down in Palo Alto last week to attend the AAAI session on Artificial Intelligence for Development. The proceedings should be available online soon.

I was there to connect with other theoretical computer science and find out how they have been applying machine learning to “development”. It turned out that development means mostly applications to health, education, and agriculture in this crowd.

I was also there to share a very concrete challenge problem that I’ve been dabbling in here at IHME, which my colleague Sean Green presented our short paper on: the Verbal Autopsy.

Instead of recapping the problem in detail here, I’ll point you to our paper, and try to say just enough to get you interested. In many parts of the world, there are not death certificates, so it’s hard to know what diseases should be public health priorities. To try to get some idea, you can conduct interviews with relatives of recently deceased people, asking them about the signs and symptoms of illness that they observed shortly before death. These interviews are verbal autopsies. What to do with these interview results? Well, the standard practice is to hire local physicians to read them and diagnose the cause of death, and then use aggregate statistics of their findings in priority setting. But there are some problems: physicians are not very accurate in their diagnoses, and, especially in places where there aren’t enough doctors, these physicians could be spending their time on people who are not yet dead.

I think it’s a great place for robots! There has previously been a stumbling block in validating machine learning techniques, however, which is the lack of “labeled examples”. But, just before heading off to AI-D, I got some good news. Sean and I were able to convince the IHME top brass to release some appropriately anonymized verbal autopsy data, together with gold-standard cause-of-death diagnosis. I put it in a github repository, verbal-autopsy-challenge. Maybe when I have some time, I’ll put some sample code in there, too.

I hope the format is self-explanatory, and if it’s not, leave a comment and we can figure out how to describe it better. It looks like this:


The comment section is also a great place to discuss machine learning approaches to tangling with this ML task. If you use the data in a paper, please cite our AI-D paper, S. T. Green and A. D. Flaxman, Machine Learning Methods for Verbal Autopsy in Developing Countries, 2010.


Filed under global health

7 responses to “Verbal Autopsy Challenge from AI-D

  1. Hi Abraham,
    Nice to meet you at AID, and really excited that you’re making this data available.

  2. Hi Kuang, it was nice to meet you, too. Let me know how it goes if you get a chance to work on this data! 🙂

  3. Hi Abraham, still working on my verbal autopsy project and i’m still looking to to use the dataset that you used in your paper. I know you have have explained that you cannot give out all the details on the symptoms and I do fully understand and accept this. However, to allow me to interprete the csv file it would be very helpful for me to understand which columns are the actual symptoms of the diseases. In my project I am trying to take the disease symptoms and then run them through various classifiers to see how accurate it predicts probable cause of death. At present when I upload the file into WEKA I am getting some very strange results. In your paper you say the file has 928 rows, 1528 attributes of which 200 actual correspond to VA survey questions and they are 140 causes of death. So to help me please could you advise which columns are disease symptoms it would help me enormously to make sense of the data. Finally in your paper you say that there 140 cause of death. In column “EM” annotated “cause of death”there are numbers 1-32 so I interpreted this that there was 32 causes of death that were categorized. Please could you explain, I must be missing something. I apologise for all the questions, but this is the first sample that I have come across that looks very promising indeed and is of a suitable size. I have been many places to get VA data and have struggled enormously. The best that I have been able to get is 5 VA’s from Ghana. So as you can see I have a real problem with sample size! Thank you for reading and hoping you can help. Rebecca

  4. Hi Rebecca,
    I worked on the verbal autopsy paper with Abie and I think I can answer some of your questions. The symptoms are a mixture of categorical, continuous, and binary data. If it helps I can let you know the following:
    1) symptom2 is an age variable and should be treated as continuous
    2) symptoms 27, 40, 45, 73, 77, 81, 83, 90, and 138 all describe the duration of symptoms listed elsewhere in the survey and should also be treated as continuous.
    3) symptom 140 is a location variable and should be treated as categorical.
    4) All other symptoms should be treated as categorical. If the symptom values are binary, then it is a yes/no question. If the values are integers and include several different values then it is a symptom question with many categories.
    5)For any of the symptoms there are two special values you should take note of:
    a) A value of “99” indicates “did not know”
    b) A value of “-1” indicates “no response”

    In the paper we state that the sample Bangladesh data set at has 928 rows, 1528 attributes, and 140 causes of death; however, the Bangladesh data set is not the one we posted. The data we posted contains anonymized data from another country. It has only 142 symptom questions (if you consider age, location, and duration to be symptoms) and has only 32 unique causes of death.
    So you were correct when you determined that there are 32 causes of death.

    I hope this helps!


  5. Pingback: Global Congress on Verbal Autopsy in 2011 open for abstract submission « Healthy Algorithms

  6. Pingback: Random Forest Verbal Autopsy Debut | Healthy Algorithms