Tag Archives: pymc

Applied Approximate Counting: Malaria

My first first-authored global health paper came out today (I consider it my first “first-authored” paper ever, since the mathematicians I’ve worked with deviantly list authorship in alphabetical order regardless of seniority and contribution). It’s a bit of a mouthful by title: Rapid Scaling Up of Insecticide-Treated Bed Net Coverage in Africa and Its Relationship with Development Assistance for Health: A Systematic Synthesis of Supply, Distribution, and Household Survey Data.

What I find really pleasing about this research paper is the way it continues research I worked on in graduate school, but in a completely different and unexpected direction. Approximate counting is something that my advisor specialized in, and he won a big award for the random polynomial time algorithm for approximating the volume of convex bodies. I followed in his footsteps when I was a student, and I’m still doing approximate counting, it’s just that now, instead of approximating the amount of high-dimensional sand that will fit in an oddly shaped high-dimensional box, I’ve been approximating the number of insecticide-treated bednets that have made it from manufacturers through the distribution supply-chain and into the households of malaria-endemic regions of the world. I’m even using the same technique, Markov-chain Monte Carlo.

I’ve been itching to write about the computational details of this research for a while, and now that the paper’s out, I will have my chance. But for today, I refer you to the PLoS Med paper, and the technical appendix, and the PyMC code on github.

4 Comments

Filed under global health, MCMC

la.larry for classy number crunching in Python

I’ve got some of my colleagues interested in using Python to cruch their numbers, and that’s great, because I’m sure it will make their lives much better in the long run. It’s also great because they are pushing me to figure out the easy way to do things, since they are used to certain conveniences from their past work with R and STATA.

For example, Kyle wanted to know does Python make it easy to deal with the tables of data he is faced with every day?

It was by showing my ignorance in a blog post a year ago that I learned about handing csv files with the csv.DictReader class. And this has been convenient enough for me, but it doesn’t let Kyle do the things the way he is used to, for example using data[[‘iso3’, ‘gdp’], :] to select these two important columns from a large table.

Enter larry, the labeled array. Maybe someone will tell me an even cleaner way to deal with these tables, but after a quick easy_install la, I was able to load up some data from the National Cancer Institute (NCI) as follows:

import csv
import la
from pylab import *

csv_file = csv.reader(open('accstate5.txt'))
columns = csv_file.next()[1:]
rows = []
data = []

for d in csv_file:
    rows.append(d[0])
    data.append(d[1:])

T = la.larry(data, [rows, columns])

That’s pretty good, but the data is not properly coded as floating point…

T = la.larry(data, [rows, columns]).astype(float)

Now, T has the whole table, indexed by state and by strange NCI code

T.lix[['WA', 'OR', 'ID', 'MO'], ['RBM9094']]

Perfect, besides being slightly ugly. But is there an easier way? the la documentation mentions a fromcsv method. Unfortunately la.larry.fromcsv doesn’t like these short-format tables.

la.larry.fromcsv('accstate5.txt')  # produces a strange error

Now, if you read the NCI data download info page, where I retrieved this data from, you will learn what each cryptic column heading means:

Column heading format:   [V] [RG] [T] [A], where

V = variable: R, C, LB, or UB
RG = race / gender: BM, BF, WM, WF (B = black, W = white, F = female, M = male)
T = calendar time: 5094, 5069, 7094, and the 9 5-year periods 5054 through 9094
A = age group: blank (all ages), 019, 2049, 5074, 75+

Example:
RBM5054 = rate for black males for the time period 1950-1954

That means I’m not really dealing with a 2-dimensional table after all, and maybe I’d be better off organizing things the way they really
are:

data_dict_list = [d for d in csv.DictReader(open('accstate5.txt'))]
data_dict = {}
races = ['B', 'W']
sexes = ['M', 'F']
times = range(50,95,5)
for d in data_dict_list:
    for race in races:
        for sex in sexes:
            for time in times:
                key = 'R%s%s%d%d' % (race, sex, time, time+4)
                val = float(d.get(key, nan)) # larry uses nan to denote missingdata
                data_dict[(d['STATE'], race, sex, '19%s'%time)] = val

R = la.larry.fromdict(data_dict)

In the original 2-d labeled array we can compare cancer rates thusly:

states = ['OR', 'WA', 'ID', 'MT']
times = arange(50,94,5)
selected_cols = ['RWM%d%d' % (t, t+4) for t in times]
T_selected = T.lix[states, selected_cols]
plot(1900+times, T_selected.getx().T, 's-',
     linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers over time')
axis([1945, 1995, 0, 230])

It is slightly cooler to do the same plot with the multi-dimensional array

clf()
subplot(2,1,1)

times = [str(t) for t in range(1950, 1994, 5)]
states = ['OR', 'WA', 'ID', 'MT']
colors = ['c', 'b', 'r', 'g']
for state, color in zip(states, colors):
    plot(R.getlabel(3), R.lix[[state], ['W'], ['M'], :].getx(), 's-', color=color,
         linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in white males over time')
axis([1945, 1995, 0, 230])

That should be enough for me to remember how to use larry when I come back to it in the future. For my final trick, let’s explore how to use larrys to add a column to a 2-dimensional table. (There is some important extension of this to multidimensional tables, but it is confusing me a lot right now.)

states = T.getlabel(0)
X = la.larry(randn(len(states), 1), [states, ['new covariate']])
T = T.merge(X)

states = ['OR', 'WA', 'ID', 'MT']
T.lix[states, ['RWM9094', 'new covariate']]

Ok, encore… how well does larry play with PyMC? Let’s use a Gaussian Process model to find out.

from pymc import *

M = gp.Mean(lambda x, mu=R.mean(): mu * ones(len(x)))
C = gp.Covariance(gp.matern.euclidean, diff_degree=2., amp=1., scale=1.)
mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in arange(9)])

obs_mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_vals = array([R[i,j,k,l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_V = array([1. for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])

gp.observe(M, C, obs_mesh, obs_vals, obs_V)
imputed_vals = M(mesh)
imputed_std = sqrt(abs(C(mesh)))

Ri = la.larry(imputed_vals.reshape(R.shape), R.label)

subplot(2,1,2)
for state, color in zip(states, colors):
    t = [float(l) for l in Ri.getlabel(3)]
    plot(t, R.lix[[state], ['B'], ['M'], :].getx(), '^', color=color, linewidth=3, markersize=15, alpha=.75)
    plot(t, Ri.lix[[state], ['B'], ['M'], :].getx(), '-', color=color, linewidth=2, alpha=.75)

ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in black males over time')
axis([1945, 1995, 0, 630])

5 Comments

Filed under software engineering

Multilevel (hierarchical) modeling: what it can and cannot do in Python


I re-read a short paper of Andrew Gelman’s yesterday about multilevel modeling, and thought “That would make a nice example for PyMC”.  The paper is “Multilevel (hierarchical) modeling: what it can and cannot do, and R code for it is on his website.

To make things even easier for a casual blogger like myself, the example from the paper is extended in the “ARM book”, and Whit Armstrong has already implemented several variants from this book in PyMC. Continue reading

17 Comments

Filed under MCMC, statistics

MCMC in Python: PyMC for Bayesian Model Selection

(Updated 9/2/2009, but still unfinished; see other’s work on this that I’ve collected)

I never took a statistics class, so I only know the kind of statistics you learn on the street. But now that I’m in global health research, I’ve been doing a lot of on-the-job learning. This post is about something I’ve been reading about recently, how to decide if a simple statistical model is sufficient or if the data demands a more complicated one. To keep the matter concrete (and controversial) I’ll focus on a claim from a recent paper in Nature that my colleague, Haidong Wang, choose for our IHME journal club last week: Advances in development reverse fertility declines. The title of this short letter boldly claims a causal link between total fertility rate (an instantaneous measure of how many babies a population is making) and the human development index (a composite measure of how “developed” a country is, on a scale of 0 to 1). Exhibit A in their case is the following figure:

An astute observer of this chart might ask, “what’s up with the scales on those axes?” But this post is not about the visual display of quantitative information. It is about deciding if the data has a piecewise linear relationship that Myrskyla et al claim, and doing it in a Bayesian framework with Python and PyMC. But let’s start with a figure where the axes have a familiar linear scale! Continue reading

30 Comments

Filed under MCMC, statistics

Numbers from World TB Day + Maps of Malaria

Twice as many people were diagnosed with both HIV and tuberculosis in 2007 than were in 2006. (Science Mag, BMJ)

Quote about global health data in the article that is quite consistent with what I’ve seen comes from Richard Chaisson:

“They’re working with the best stuff they have, and the best stuff they have is not good.”

PLoS Med today has an article with some beautiful maps, co-authored by PyMC super-hacker Anand Patil. A World Malaria Map: Plasmodium falciparum Endemicity in 2007.

And, to make it a triple-crown news day for infectious disease, the Pope claims that condoms exacerbate HIV and AIDS problem. (I guess this was the big news a week ago, but it just crossed my desk today.)

2 Comments

Filed under global health

MCMC in Python: PyMC for Bayesian Probability

I’ve got an urge to write another introductory tutorial for the Python MCMC package PyMC.  This time, I say enough to the comfortable realm of Markov Chains for their own sake.  In this tutorial, I’ll test the waters of Bayesian probability.

Darwin's other Bulldog

Tom Bayes

Now, what better problem to stick my toe in than the one that inspired Reverend Thomas in the first place? Let’s talk about sex ratio. This is also convenient, because I can crib from Bayesian Data Analysis, that book Rif recommended me a month ago.

Bayes started this enterprise off with a question that has inspired many an evolutionary biologist: are girl children as likely as boy children? Or are they more likely or less likely? Laplace wondered this also, and in his time and place (from 1745 to 1770 in Paris) there were birth records of 241,945 girls and 251,527 boys. In the USA in 2005, the vital registration system recorded 2,118,982 male and 2,019,367 female live births [1]. I’ll set up a Bayesian model of this, and ask PyMC if the sex ratio could really be 1.0.

Continue reading

8 Comments

Filed under MCMC, probability

MCMC in Python: PyMC to sample uniformly from a convex body

This post is a little tutorial on how to use PyMC to sample points uniformly at random from a convex body.  This computational challenge says: if you have a magic box which will tell you yes/no when you ask, “Is this point (in n-dimensions) in the convex set S”, can you come up with a random point which is nearly uniformly distributed over S?

MCMC has been the main approach to solving this problem, and it has been a great success for the polynomial-time dogma, starting with the work of Dyer, Frieze, and Kannan which established the running-time upper bound of \mathcal{O}\left(n^{23}(\log n)^5\right).  The idea is this: you start with some point in S and try to move to a new, nearby point randomly.  “Randomly how?”, you wonder.  That is the art. Continue reading

9 Comments

Filed under MCMC, probability

MCMC: Running a chain, making it look easy

As I was saying in my last post, I’ve been getting interested in actually running Markov Chain Monte Carlo algorithms, instead of trying to prove things about their asymptotic performance. It seems like the “stats” way to do this is to use R and WinBUGS, but I’ve always thought that R programming looks messy. Python is easier on my eyes.

So, it’s my good fortune that PyMC exists. This means I don’t need to do any hard work, like coding a Gibbs sampler or learning R. Let me show you.

Continue reading

4 Comments

Filed under MCMC