Tag Archives: ipython

Just what I’ve been needing: runipy

https://github.com/paulgb/runipy
perfect fit with my workflow and the current IHME cluster configuration.

Comments Off on Just what I’ve been needing: runipy

Filed under software engineering

Mapping Data for SmartVA-Analyze 1.1

I have just released an updated version of the SmartVA app that predicts the underlying cause of death from the results of verbal autopsy interviews (VAIs). It was a lot of hard work and I hope that people find it useful. You can find the details here: http://www.healthdata.org/verbal-autopsy/tools

There is a major challenge in using this tool (now called SmartVA-Analyze 1.1), however, which is getting the necessary data to feed into it. If you use the ODK form to collect data in just the right format, it is easy. But electronic data collection is not always possible. And there is a fair amount of data out there that has already been collected, but not yet analyzed (which is some of the motivation for creating this tool in the first place).

This blog describes the process of mapping existing VAI data into a format that can be used as input to SmartVA-Analyze 1.1. It is a challenging process that requires careful attention to detail. I will demonstrate the basics here, and I hope to provide fuller examples in multiple scripting languages as researchers complete this exercise for themselves.

A short version of the following, with example code is available on GitHub: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example

The ODK output of electronic version of the PHMRC Shortened Questionnaire is a .csv file, such as the following: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/blob/master/example_1.csv

But if you have data that was collected with pencil-and-paper and then laboriously digitized, you will need to map it into that format. This Guide for data entry spreadsheet is your Rosetta Stone. SmartVA-Analyze 1.1 expects the input csv file to have a column for every row in that spreadsheet, with column heading matching the entry in the “field name” column.
Mapping Process

I like to use Python with Pandas for doing this kind of work, but I recommend you use whatever scripting language you are most comfortable with. But I strongly recommend that you use a script to do this mapping. It will be much easier to debug and reproduce your work than if you do the mapping by hand! (I also recommend that you work incrementally and use a revision control system…) To learn more about the Python/Pandas approach, I recommend the book Python for Data Analysis.

Here is a block of Python code that will create a DataFrame with columns for every field named in the Guide:

import numpy as np, pandas as pd

# load codebook
fname = 'https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/raw/master/Guide%20for%20data%20entry.xlsx'
cb = pd.read_excel(fname, index_col=2)

df = pd.DataFrame(index=[0], columns=cb.index.unique())

(you can also see this in context in an Jupyter Notebook on GitHub here.)

SmartVA-Analyze 1.1 requires a handful of additional columns that are not in the Guide (they are created automatically by the ODK form): child_3_10, agedays, child_5_7e, child_5_6e, adult_2_9a. Here is a block of Python code that will add these columns to the DataFrame created above:

df['child_3_10'] = np.nan
df['agedays']    = np.nan # see notes though http://wp.me/pk40B-Mm
df['child_5_7e'] = np.nan
df['child_5_6e'] = np.nan
df['adult_2_9a'] = np.nan

If you save this DataFrame as a csv file, it will constitute a minimal example of what is necessary to make SmartVA-Analyze 1.1 run:

fname = 'example_1.csv'
df.to_csv(fname, index=False)

Here is what it looks like when SmartVA-Analyze 1.1 is running:
running_example_1

The results are rather minimal, and can be found in the “neonate-predictions.csv” file (because without an age or age group specified, this is the default):
minimal_output
Mapping a more substantial dataset, even a the following hypothetical example is an idiosyncratic and time-consuming procedure.

Example (hypothetical) dataset:
hypothetical_data

Python code to map the id, sex, and age:

# set id
df['sid'] = hypothetical_data.index

# set sex
df['gen_5_2'] = hypothetical_data['sex'].map({'M': '1', 'F': '2'})

# set age
df['gen_5_4'] = 1  # units are years
df['gen_5_4a'] = hypothetical_data['age'].astype(int)

This is the simple stuff… to map the injury data you will need to dig into the paper questionnaire to see how the responses are coded (the Guide spreadsheet includes some codings, but will refer you to the paper questionnaire when necessary):

# map injuries to appropriate codes
# suffered injury?
df['adult_5_1'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'1', '':'0'})
# injury type
df['adult_5_2'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'2'})

Mapping more columns proceeds analogously, but I recommend working incrementally, so at this point you should save the partially mapped data and make sure it runs through the SmartVA-Analyze app, and make sure that the results make some sense. For example, in this case the mapped hypothetical data from the first 2 rows are correctly identified as traffic and fall injury deaths, but the final 3 rows are undetermined (because non-injury signs and symptoms have not yet been mapped).

Mapping the additional columns proceeds analogously:

# map heart disease (to column adult_1_1i, see Guide)
df['adult_1_1i'] = hypothetical_data['heart_disease'].map({'Y':'1', 'N':'0'})

# map chest pain (to column adult_2_43, see Guide)
df['adult_2_43'] = hypothetical_data['chest_pain'].map({'Y':'1', 'N':'0', '':'9'})

I hope that this helps… if you’ve read this far, you probably have a hard job ahead of you! Please see the Jupyter Notebook version of this example here, and good luck!

2 Comments

Filed under software engineering

Using the sklearn grid_search tools

Scikit-learn has a really nice grid search module. It will soon be called model_selection, because it has grown beyond simple grid search. But here is the spirit of it:

import sklearn.svm, sklearn.grid_search, sklearn.datasets.samples_generator
parameters = {'kernel':('poly', 'rbf'), 'C':[.01, .1, 1, 10, 100]}
clf = sklearn.grid_search.GridSearchCV(
    sklearn.svm.SVC(probability=True),
    parameters,
    n_jobs=64)
X, y = sklearn.datasets.samples_generator.make_classification(n_samples=200, n_features=5, random_state=12345)
clf.fit(X, y)
clf.best_params_

And say you want to take a careful look at the results? They are all in there, too. http://nbviewer.ipython.org/gist/aflaxman/cb0660e602d361d06599

Comments Off on Using the sklearn grid_search tools

Filed under machine learning, software engineering

Earth Engine in IPython Notebook

This is cool: http://nbviewer.ipython.org/github/tylere/g4g14-ee-python-api/tree/master/

g-ee-in-ipy-nb

Comments Off on Earth Engine in IPython Notebook

Filed under global health

Talks in Python: Interactive Instruction with RISE

I had a chance to give a lecture on using Python/Pandas in scientific research this week, and it was __________ (fill this in after it happens…). Since I was talking about Python, I decided to make my talk in Python, too. I did this for a few classes in Winter and Summer quarters of 2013, but the technology has come a long way since then. For this time around, I used RISE aka the live_reveal extension, and I found it very promising, although _very_ “bleeding edge” (which is what happens when the cutting edge is too cutting).

To make it really work as a powerpoint killer, I think it needs a little more friendlyness on the slide layout side of things. I don’t need much, but I would like:
* a talk title slide that has title, name, and date;
* a full-screen image slide;
* a way to put slide titles in a consistent place;

Am I totally power-pointed in my desires? I should file some issues on github.

Other wishes, while it’s on my mind: would be helpful to start slideshow from highlighted cell, would be convenient if cell toolbar toggled automatically between slideshow to none when starting and stopping presentation display, make it all easy easy easy to use.

Comments Off on Talks in Python: Interactive Instruction with RISE

Filed under education

IPython Notebook Clipboard Extension

I was so excited when I got the image pasting to work in my IPython Notebook, (although now I can’t find any mention of it on Healthy Algorithms…) but then things changed and I didn’t keep up, and it stopped working for me for a while. But then I _needed_ it, and so I figured out how to make it work again:

* upgrade IPython to the latest development version from github – https://github.com/ipython/ipython
* install the chrome_clipboard ipython notebook extension – https://github.com/ipython-contrib/IPython-notebook-extensions/wiki/chrome_clipboard
* make it work each time, by adding a line to `~/.ipython/profile_[name]/static/custom/custom.js`:

$([IPython.events]).on('app_initialized.NotebookApp', function(){
require(['nbextensions/chrome_clipboard'],function(module){
module.load_ipython_extension();
});
});

So nice to have it back.

Comments Off on IPython Notebook Clipboard Extension

Filed under software engineering

I used the IPython Notebook for my lab book for a year. How did it go?

It was exactly a year ago when I firmed up a workflow wherein the IPython Notebook was the center of my daily scientific research. All notes end up in a .ipynb file, and my code, plots, and equations all live together there. Looking back on 2013, how did it go and what should I change for 2014?

I am very happy with it overall. I have 641 .ipynb files, with names like 2013_01_01_EM_4_1_2.ipynb and 2013_12_22a_dm_pde_for_pop_prediction.ipynb. This includes notes for two courses I taught and plan to teach again, for several papers that we published, and for a large number of projects that didn’t pan out. I’ll definitely use the course notes again the next time I teach, I’ve already had to look up the calculations from some of those papers for responses to reviewers and clarifications after publication, and maybe I can come back to projects that didn’t pan out in the future with some new insight.

What could go better? I couldn’t decide if my lab book should capture everything, like I was taught in science class, or have a curated collection of my work including only the parts I would need in the future. Probably some blend is best, and since it is hard to know the right balance ahead of time, I tried to keep everything in a git repo, so that I could curate and edit, but recover anything that I realized I still wanted after cutting. I only ended up with 59 git commits, though. If that approach was working, I would expect more commits than notebooks.

I sometimes lost things in my stack of notebooks. The .ipynb format is not easy to search, so I kept a .py copy of everything and grepped through them looking for the notebooks about a specific technique or project. Since I organized my notebooks chronologically, I ended up doing this a lot more than if I had organized them thematically, but even if I already had all of my congenital heart disease notes in one place, I would still find myself saying, “I know I did some data munging like this for a different project recently, how does the pandas.melt function work again?”, or whatever.

The feature I would like the most is a way to paste images into my notebook. I wrote some notes about it in a github issue page about IPython Notebook feature requests. I want the digital equivalent of stapling a copy into my lab book, and I want it to be easy.

Collaboration worked pretty well. I have a lot of colleagues who don’t want to see Python code, no matter how much easier it would make their lives. I’ve had good success sending them pdf version of notebooks, or sticking my research notes in a github gist and sending them a link to nbviewer. I think there is room for improvement in this, too, though.

2 Comments

Filed under global health