Monthly Archives: January 2016

Using the sklearn text.CountVectorizer

I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:

import sklearn.feature_extraction

ngram_range = (1,2)

clf = sklearn.feature_extraction.text.CountVectorizer(
        ngram_range=ngram_range,
        min_df=10,  # minimum number of docs that must contain n-gram to include as a column
        #tokenizer=lambda x: [x_i.strip() for x_i in x.split()]  # keep '*' characters as tokens
    )

There is a stop_words parameter that is also sometimes useful.

Comments Off on Using the sklearn text.CountVectorizer

Filed under machine learning

Injuries Query on Mapping Data for SmartVA-Analyze 1.1

In [previous post] …

Follow-up tip:
From: Abraham D. Flaxman
Sent: Wednesday, December 30, 2015 5:16 PM
Subject: RE: VA Data Update

Cool, it looks like you are making progress. I do again encourage to do this work incrementally. So make a mapping that just gets the age and sex into the right columns and run that through, and then add in a few questions at a time to make sure things keep changing in a way that makes sense (i.e. when you add the column on chest pain to your mapping, the number of heart attack deaths should increase…).

You had a question about injury coding in your spreadsheet. Here is the coding:
1. Road traffic crash/ injury
2. Fall
3. Drowning
4. Poisoning
5. Bite or sting by venomous animal
6. Burn/Fire
7. Violence (suicide, homicide, abuse)
11. Other injury, specify (__________)
8. Refused to answer
9. Don’t know

If multiple injury causes were endorsed, you may record them as a space-separated list, i.e “2 3” for a fall that resulted in drowning.

–Abie

Comments Off on Injuries Query on Mapping Data for SmartVA-Analyze 1.1

Filed under software engineering

Mapping Data for SmartVA-Analyze 1.1

I have just released an updated version of the SmartVA app that predicts the underlying cause of death from the results of verbal autopsy interviews (VAIs). It was a lot of hard work and I hope that people find it useful. You can find the details here: http://www.healthdata.org/verbal-autopsy/tools

There is a major challenge in using this tool (now called SmartVA-Analyze 1.1), however, which is getting the necessary data to feed into it. If you use the ODK form to collect data in just the right format, it is easy. But electronic data collection is not always possible. And there is a fair amount of data out there that has already been collected, but not yet analyzed (which is some of the motivation for creating this tool in the first place).

This blog describes the process of mapping existing VAI data into a format that can be used as input to SmartVA-Analyze 1.1. It is a challenging process that requires careful attention to detail. I will demonstrate the basics here, and I hope to provide fuller examples in multiple scripting languages as researchers complete this exercise for themselves.

A short version of the following, with example code is available on GitHub: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example

The ODK output of electronic version of the PHMRC Shortened Questionnaire is a .csv file, such as the following: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/blob/master/example_1.csv

But if you have data that was collected with pencil-and-paper and then laboriously digitized, you will need to map it into that format. This Guide for data entry spreadsheet is your Rosetta Stone. SmartVA-Analyze 1.1 expects the input csv file to have a column for every row in that spreadsheet, with column heading matching the entry in the “field name” column.
Mapping Process

I like to use Python with Pandas for doing this kind of work, but I recommend you use whatever scripting language you are most comfortable with. But I strongly recommend that you use a script to do this mapping. It will be much easier to debug and reproduce your work than if you do the mapping by hand! (I also recommend that you work incrementally and use a revision control system…) To learn more about the Python/Pandas approach, I recommend the book Python for Data Analysis.

Here is a block of Python code that will create a DataFrame with columns for every field named in the Guide:

import numpy as np, pandas as pd

# load codebook
fname = 'https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/raw/master/Guide%20for%20data%20entry.xlsx'
cb = pd.read_excel(fname, index_col=2)

df = pd.DataFrame(index=[0], columns=cb.index.unique())

(you can also see this in context in an Jupyter Notebook on GitHub here.)

SmartVA-Analyze 1.1 requires a handful of additional columns that are not in the Guide (they are created automatically by the ODK form): child_3_10, agedays, child_5_7e, child_5_6e, adult_2_9a. Here is a block of Python code that will add these columns to the DataFrame created above:

df['child_3_10'] = np.nan
df['agedays']    = np.nan # see notes though http://wp.me/pk40B-Mm
df['child_5_7e'] = np.nan
df['child_5_6e'] = np.nan
df['adult_2_9a'] = np.nan

If you save this DataFrame as a csv file, it will constitute a minimal example of what is necessary to make SmartVA-Analyze 1.1 run:

fname = 'example_1.csv'
df.to_csv(fname, index=False)

Here is what it looks like when SmartVA-Analyze 1.1 is running:
running_example_1

The results are rather minimal, and can be found in the “neonate-predictions.csv” file (because without an age or age group specified, this is the default):
minimal_output
Mapping a more substantial dataset, even a the following hypothetical example is an idiosyncratic and time-consuming procedure.

Example (hypothetical) dataset:
hypothetical_data

Python code to map the id, sex, and age:

# set id
df['sid'] = hypothetical_data.index

# set sex
df['gen_5_2'] = hypothetical_data['sex'].map({'M': '1', 'F': '2'})

# set age
df['gen_5_4'] = 1  # units are years
df['gen_5_4a'] = hypothetical_data['age'].astype(int)

This is the simple stuff… to map the injury data you will need to dig into the paper questionnaire to see how the responses are coded (the Guide spreadsheet includes some codings, but will refer you to the paper questionnaire when necessary):

# map injuries to appropriate codes
# suffered injury?
df['adult_5_1'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'1', '':'0'})
# injury type
df['adult_5_2'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'2'})

Mapping more columns proceeds analogously, but I recommend working incrementally, so at this point you should save the partially mapped data and make sure it runs through the SmartVA-Analyze app, and make sure that the results make some sense. For example, in this case the mapped hypothetical data from the first 2 rows are correctly identified as traffic and fall injury deaths, but the final 3 rows are undetermined (because non-injury signs and symptoms have not yet been mapped).

Mapping the additional columns proceeds analogously:

# map heart disease (to column adult_1_1i, see Guide)
df['adult_1_1i'] = hypothetical_data['heart_disease'].map({'Y':'1', 'N':'0'})

# map chest pain (to column adult_2_43, see Guide)
df['adult_2_43'] = hypothetical_data['chest_pain'].map({'Y':'1', 'N':'0', '':'9'})

I hope that this helps… if you’ve read this far, you probably have a hard job ahead of you! Please see the Jupyter Notebook version of this example here, and good luck!

2 Comments

Filed under software engineering

DisMod-MR Book Talk

I gave a talk on the DisMod book and it is online: http://www.healthdata.org/events/seminar/dismod-mr-gbd-study-integrative-systems-modeling-approach-meta-regression-descriptive

Comments Off on DisMod-MR Book Talk

Filed under global health

To read in JAMA:

Flavored Tobacco Product Use Among US Youth Aged 12-17 Years, 2013-2014

http://jama.jamanetwork.com/article.aspx?articleID=2464690

and here is the survey instrument: [oops, didn’t get the link…]

Comments Off on To read in JAMA:

Filed under global health

Fake git manual that could be real

http://git-man-page-generator.lokaltog.net/

Comments Off on Fake git manual that could be real

Filed under software engineering

Reddit asks about IBM Watson

Is IBM Watson just (mostly) marketing? (self.MachineLearning)

https://www.reddit.com/r/MachineLearning/comments/3qmfbz/is_ibm_watson_just_mostly_marketing/

Comments Off on Reddit asks about IBM Watson

Filed under TCS

NIH, ScienceMag, and BoD

This article in ScienceMag caught my attention and then got forwarded to everyone: http://www.sciencemag.org/news/funding/2015/12/nih-drops-special-10-set-aside-aids-research

It looks like GBD stuff, but they never said IHME or GBD. But digging deeper… it is! http://report.nih.gov/info_disease_burden.aspx

Now we can say it is definitely our data even though the Science article doesn’t mention us.

Comments Off on NIH, ScienceMag, and BoD

Filed under global health, science policy

Replicability and reproducibility

Lots of material on Reproducible Research in my backlog… I’m going to get it out there for you (or at least for future-me).

—–Original Message—–
From: Reproducible On Behalf Of Ben Marwick
Sent: Monday, November 2, 2015 9:53 PM
Subject: [Reproducible] Language Log: Replicability vs. reproducibility — or is it the other way around?

A popular academic blog on linguistics just put up a post with a nice discussion of definitions of reproducibility in science:

http://languagelog.ldc.upenn.edu/nll/?p=21956

Comments Off on Replicability and reproducibility

Filed under science policy

Ben Marwick on ‘The Conversation’: How computers broke science – and what we can do to fix it

—–Original Message—–
From: Reproducible On Behalf Of Ben Marwick
Sent: Monday, November 9, 2015 5:58 AM
Subject: [Reproducible] My article on ‘The Conversation’: How computers broke science – and what we can do to fix it

I wrote a short essay on reproducible research and how researchers use computers for a popular media outlet (citing UW eScience):

https://theconversation.com/how-computers-broke-science-and-what-we-can-do-to-fix-it-49938

Please leave a comment at the bottom to help demonstrate to other readers that there really is a movement toward this way of working, and I’m not making it up!

Ben

Comments Off on Ben Marwick on ‘The Conversation’: How computers broke science – and what we can do to fix it

Filed under software engineering