Tag Archives: python

August 24, 2017 · 12:21 pm

Introducing Vivarium

I’ve had a new line of research developing for the last 18 months or so—*microsimulation*. It started when I stepped in to help with the “Cost Effectiveness Analysis with Microsimulation” (or CEAM) project at IHME. Now it is growing and growing to take over all of my research and recreation time. Is that bad or good?

Some of this work has now seen daylight from our presentations at SummerSim and iHEA in July, and today I am please to introduce a python package that you can use, too.

The programmers I’ve been working with on this convinced me that it is not just for cost effectiveness analysis and we need a more expansive name for it. So I present to you: vivarium. https://github.com/ihmeuw/vivarium

2 Comments

Filed under Uncategorized

Tagged as microsimulation, python

February 10, 2017 · 8:00 am

Infographics in Python: Plot a Noun Project Icon on a Matplotlib Chart

I had to put an icon on a chart in Python last week, and I couldn’t find a good brief blog about how to do it. Here is what I cobbled together:

1. Find a free, appropriate image from The Noun Project.
2. Load it into Python with plt.imread
3. Draw it in the proper place on a figure with plt.imshow and some cryptic, hacky options.

Looks good, right?

See this all in action here: https://gist.github.com/aflaxman/c171050384471636e8f23f322ba7e9c5

Comments Off on Infographics in Python: Plot a Noun Project Icon on a Matplotlib Chart

Filed under dataviz

Tagged as matplotlib, python

January 27, 2017 · 8:00 am

So cool–nbtutor

The first release of nbtutor (“Visualize Python code execution (line-by-line) in Jupyter Notebook cells.”) is available on pypi:

pip install nbtutor
jupyter nbextension install --sys-prefix --overwrite --py nbtutor
jupyter nbextension enable --sys-prefix --py nbtutor

https://github.com/lgpage/nbtutor

Comments Off on So cool–nbtutor

Filed under education

Tagged as python

January 6, 2017 · 8:00 am

dfply package

Potentially of interest, although I’ve done enough d3js to think that .select .head is fine notation:

dfply Version: 0.2.4

GitHub – kieferk from November 28, 2016
“The dfply package makes it possible to do R’s dplyr-style data manipulation with pipes in python on pandas DataFrames.”
https://github.com/kieferk/dfply

from dfply import *

diamonds >> select(X.carat, X.cut) >> head(3)

   carat      cut
0   0.23    Ideal
1   0.21  Premium
2   0.23     Good

Comments Off on dfply package

Filed under software engineering

Tagged as pandas, python

December 19, 2016 · 8:00 am

py.test recipes for slowness

Useful material on how to deal with slow tests in py.test, a bit buried in the docs:

From http://doc.pytest.org/en/latest/usage.html, to get a list of the slowest 10 test durations:

pytest --durations=10

From http://doc.pytest.org/en/latest/example/simple.html, to skip slow tests unless they are requested:

# content of conftest.py

import pytest
def pytest_addoption(parser):
    parser.addoption("--runslow", action="store_true",
        help="run slow tests")

# content of test_module.py
import pytest


slow = pytest.mark.skipif(
    not pytest.config.getoption("--runslow"),
    reason="need --runslow option to run"
)


def test_func_fast():
    pass


@slow
def test_func_slow():
    pass

Very convenient to know.

Comments Off on py.test recipes for slowness

Filed under software engineering

Tagged as python

March 16, 2016 · 8:00 am

Delta Time in Python: Simple calendar times with Pandas

Here is something that Google did not help with as quickly as I would have expected: how do I convert start and stop times into the time between events in seconds (or minutes)?

Or for the busy searcher “how do I convert Pandas Timedelta to seconds”?

The classy answer is:

start_time = df.interviewstarttime.map(pd.Timestamp)
end_time = df.interviewendtime.map(pd.Timestamp)

((end_time-start_time) / pd.Timedelta(minutes=1)).describe()

I found it hidden away here: http://www.datasciencebytes.com/bytes/2015/05/16/pandas-timedelta-histograms-unit-conversion-and-overflow-danger/

6 Comments

Filed under statistics

Tagged as pandas, python

March 15, 2016 · 8:00 am

I wish I had this Python video sooner

Video recommendation: Stop Writing Classes

Comments Off on I wish I had this Python video sooner

Filed under software engineering

Tagged as oop, python

February 1, 2016 · 8:00 am

Git says this is binary and it is not

I had an annoying little issue, where git was saying my file was binary. What do I care what git thinks? Well, I care if it refuses to show me my diff:

[abie@cluster-dev TICS]$ git diff diff --git a/etl.py b/etl.py index 3b5b4ca..2cb591e 100644 Binary files a/etl.py and b/etl.py differ

Google and Stack Overflow usually solve any problem I have like this, but today they under-delivered. They gave me a good hint, there must be some funny character in my .py file. That can happen when a 1.5 year old is helping with the typing.

Here is a quick fix, in case I (or you) ever find ourselves in this situation again:

import unidecode
f = file('etl.py').read()
with file('etl.py', 'w') as f2:
    f2.write(unidecode.unidecode(f))

All better. Thanks again unidecode.

Comments Off on Git says this is binary and it is not

Filed under software engineering

Tagged as git, python

January 27, 2016 · 8:00 am

Mapping Data for SmartVA-Analyze 1.1

I have just released an updated version of the SmartVA app that predicts the underlying cause of death from the results of verbal autopsy interviews (VAIs). It was a lot of hard work and I hope that people find it useful. You can find the details here: http://www.healthdata.org/verbal-autopsy/tools

There is a major challenge in using this tool (now called SmartVA-Analyze 1.1), however, which is getting the necessary data to feed into it. If you use the ODK form to collect data in just the right format, it is easy. But electronic data collection is not always possible. And there is a fair amount of data out there that has already been collected, but not yet analyzed (which is some of the motivation for creating this tool in the first place).

This blog describes the process of mapping existing VAI data into a format that can be used as input to SmartVA-Analyze 1.1. It is a challenging process that requires careful attention to detail. I will demonstrate the basics here, and I hope to provide fuller examples in multiple scripting languages as researchers complete this exercise for themselves.

A short version of the following, with example code is available on GitHub: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example

The ODK output of electronic version of the PHMRC Shortened Questionnaire is a .csv file, such as the following: https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/blob/master/example_1.csv

But if you have data that was collected with pencil-and-paper and then laboriously digitized, you will need to map it into that format. This Guide for data entry spreadsheet is your Rosetta Stone. SmartVA-Analyze 1.1 expects the input csv file to have a column for every row in that spreadsheet, with column heading matching the entry in the “field name” column.

I like to use Python with Pandas for doing this kind of work, but I recommend you use whatever scripting language you are most comfortable with. But I strongly recommend that you use a script to do this mapping. It will be much easier to debug and reproduce your work than if you do the mapping by hand! (I also recommend that you work incrementally and use a revision control system…) To learn more about the Python/Pandas approach, I recommend the book Python for Data Analysis.

Here is a block of Python code that will create a DataFrame with columns for every field named in the Guide:

import numpy as np, pandas as pd

# load codebook
fname = 'https://github.com/aflaxman/SmartVA-Analyze-Mapping-Example/raw/master/Guide%20for%20data%20entry.xlsx'
cb = pd.read_excel(fname, index_col=2)

df = pd.DataFrame(index=[0], columns=cb.index.unique())

(you can also see this in context in an Jupyter Notebook on GitHub here.)

SmartVA-Analyze 1.1 requires a handful of additional columns that are not in the Guide (they are created automatically by the ODK form): child_3_10, agedays, child_5_7e, child_5_6e, adult_2_9a. Here is a block of Python code that will add these columns to the DataFrame created above:

df['child_3_10'] = np.nan
df['agedays']    = np.nan # see notes though http://wp.me/pk40B-Mm
df['child_5_7e'] = np.nan
df['child_5_6e'] = np.nan
df['adult_2_9a'] = np.nan

If you save this DataFrame as a csv file, it will constitute a minimal example of what is necessary to make SmartVA-Analyze 1.1 run:

fname = 'example_1.csv'
df.to_csv(fname, index=False)

Here is what it looks like when SmartVA-Analyze 1.1 is running:

The results are rather minimal, and can be found in the “neonate-predictions.csv” file (because without an age or age group specified, this is the default):

Mapping a more substantial dataset, even a the following hypothetical example is an idiosyncratic and time-consuming procedure.

Example (hypothetical) dataset:

Python code to map the id, sex, and age:

# set id
df['sid'] = hypothetical_data.index

# set sex
df['gen_5_2'] = hypothetical_data['sex'].map({'M': '1', 'F': '2'})

# set age
df['gen_5_4'] = 1  # units are years
df['gen_5_4a'] = hypothetical_data['age'].astype(int)

This is the simple stuff… to map the injury data you will need to dig into the paper questionnaire to see how the responses are coded (the Guide spreadsheet includes some codings, but will refer you to the paper questionnaire when necessary):

# map injuries to appropriate codes
# suffered injury?
df['adult_5_1'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'1', '':'0'})
# injury type
df['adult_5_2'] = hypothetical_data['injury'].map({'rti':'1', 'fall':'2'})

Mapping more columns proceeds analogously, but I recommend working incrementally, so at this point you should save the partially mapped data and make sure it runs through the SmartVA-Analyze app, and make sure that the results make some sense. For example, in this case the mapped hypothetical data from the first 2 rows are correctly identified as traffic and fall injury deaths, but the final 3 rows are undetermined (because non-injury signs and symptoms have not yet been mapped).

Mapping the additional columns proceeds analogously:

# map heart disease (to column adult_1_1i, see Guide)
df['adult_1_1i'] = hypothetical_data['heart_disease'].map({'Y':'1', 'N':'0'})

# map chest pain (to column adult_2_43, see Guide)
df['adult_2_43'] = hypothetical_data['chest_pain'].map({'Y':'1', 'N':'0', '':'9'})

I hope that this helps… if you’ve read this far, you probably have a hard job ahead of you! Please see the Jupyter Notebook version of this example here, and good luck!

2 Comments

Filed under software engineering

Tagged as ipython, python, reproducible research, va

January 6, 2016 · 8:00 am

Using the sklearn grid_search tools

Scikit-learn has a really nice grid search module. It will soon be called model_selection, because it has grown beyond simple grid search. But here is the spirit of it:

import sklearn.svm, sklearn.grid_search, sklearn.datasets.samples_generator
parameters = {'kernel':('poly', 'rbf'), 'C':[.01, .1, 1, 10, 100]}
clf = sklearn.grid_search.GridSearchCV(
    sklearn.svm.SVC(probability=True),
    parameters,
    n_jobs=64)
X, y = sklearn.datasets.samples_generator.make_classification(n_samples=200, n_features=5, random_state=12345)
clf.fit(X, y)
clf.best_params_

And say you want to take a careful look at the results? They are all in there, too. http://nbviewer.ipython.org/gist/aflaxman/cb0660e602d361d06599

Comments Off on Using the sklearn grid_search tools

Filed under machine learning, software engineering

Tagged as ipython, python, sklearn

Tag Archives: python

Introducing Vivarium

Infographics in Python: Plot a Noun Project Icon on a Matplotlib Chart

So cool–nbtutor

dfply package

py.test recipes for slowness

Delta Time in Python: Simple calendar times with Pandas

I wish I had this Python video sooner

Git says this is binary and it is not

Mapping Data for SmartVA-Analyze 1.1

Using the sklearn grid_search tools

Posts

Theory Blogs

some rights reserved

Pages

Archives

Meta