Monthly Archives: June 2010

la.larry for classy number crunching in Python

I’ve got some of my colleagues interested in using Python to cruch their numbers, and that’s great, because I’m sure it will make their lives much better in the long run. It’s also great because they are pushing me to figure out the easy way to do things, since they are used to certain conveniences from their past work with R and STATA.

For example, Kyle wanted to know does Python make it easy to deal with the tables of data he is faced with every day?

It was by showing my ignorance in a blog post a year ago that I learned about handing csv files with the csv.DictReader class. And this has been convenient enough for me, but it doesn’t let Kyle do the things the way he is used to, for example using data[[‘iso3’, ‘gdp’], :] to select these two important columns from a large table.

Enter larry, the labeled array. Maybe someone will tell me an even cleaner way to deal with these tables, but after a quick easy_install la, I was able to load up some data from the National Cancer Institute (NCI) as follows:

import csv
import la
from pylab import *

csv_file = csv.reader(open('accstate5.txt'))
columns = csv_file.next()[1:]
rows = []
data = []

for d in csv_file:
    rows.append(d[0])
    data.append(d[1:])

T = la.larry(data, [rows, columns])

That’s pretty good, but the data is not properly coded as floating point…

T = la.larry(data, [rows, columns]).astype(float)

Now, T has the whole table, indexed by state and by strange NCI code

T.lix[['WA', 'OR', 'ID', 'MO'], ['RBM9094']]

Perfect, besides being slightly ugly. But is there an easier way? the la documentation mentions a fromcsv method. Unfortunately la.larry.fromcsv doesn’t like these short-format tables.

la.larry.fromcsv('accstate5.txt')  # produces a strange error

Now, if you read the NCI data download info page, where I retrieved this data from, you will learn what each cryptic column heading means:

Column heading format:   [V] [RG] [T] [A], where

V = variable: R, C, LB, or UB
RG = race / gender: BM, BF, WM, WF (B = black, W = white, F = female, M = male)
T = calendar time: 5094, 5069, 7094, and the 9 5-year periods 5054 through 9094
A = age group: blank (all ages), 019, 2049, 5074, 75+

Example:
RBM5054 = rate for black males for the time period 1950-1954

That means I’m not really dealing with a 2-dimensional table after all, and maybe I’d be better off organizing things the way they really
are:

data_dict_list = [d for d in csv.DictReader(open('accstate5.txt'))]
data_dict = {}
races = ['B', 'W']
sexes = ['M', 'F']
times = range(50,95,5)
for d in data_dict_list:
    for race in races:
        for sex in sexes:
            for time in times:
                key = 'R%s%s%d%d' % (race, sex, time, time+4)
                val = float(d.get(key, nan)) # larry uses nan to denote missingdata
                data_dict[(d['STATE'], race, sex, '19%s'%time)] = val

R = la.larry.fromdict(data_dict)

In the original 2-d labeled array we can compare cancer rates thusly:

states = ['OR', 'WA', 'ID', 'MT']
times = arange(50,94,5)
selected_cols = ['RWM%d%d' % (t, t+4) for t in times]
T_selected = T.lix[states, selected_cols]
plot(1900+times, T_selected.getx().T, 's-',
     linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers over time')
axis([1945, 1995, 0, 230])

It is slightly cooler to do the same plot with the multi-dimensional array

clf()
subplot(2,1,1)

times = [str(t) for t in range(1950, 1994, 5)]
states = ['OR', 'WA', 'ID', 'MT']
colors = ['c', 'b', 'r', 'g']
for state, color in zip(states, colors):
    plot(R.getlabel(3), R.lix[[state], ['W'], ['M'], :].getx(), 's-', color=color,
         linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in white males over time')
axis([1945, 1995, 0, 230])

That should be enough for me to remember how to use larry when I come back to it in the future. For my final trick, let’s explore how to use larrys to add a column to a 2-dimensional table. (There is some important extension of this to multidimensional tables, but it is confusing me a lot right now.)

states = T.getlabel(0)
X = la.larry(randn(len(states), 1), [states, ['new covariate']])
T = T.merge(X)

states = ['OR', 'WA', 'ID', 'MT']
T.lix[states, ['RWM9094', 'new covariate']]

Ok, encore… how well does larry play with PyMC? Let’s use a Gaussian Process model to find out.

from pymc import *

M = gp.Mean(lambda x, mu=R.mean(): mu * ones(len(x)))
C = gp.Covariance(gp.matern.euclidean, diff_degree=2., amp=1., scale=1.)
mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in arange(9)])

obs_mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_vals = array([R[i,j,k,l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_V = array([1. for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])

gp.observe(M, C, obs_mesh, obs_vals, obs_V)
imputed_vals = M(mesh)
imputed_std = sqrt(abs(C(mesh)))

Ri = la.larry(imputed_vals.reshape(R.shape), R.label)

subplot(2,1,2)
for state, color in zip(states, colors):
    t = [float(l) for l in Ri.getlabel(3)]
    plot(t, R.lix[[state], ['B'], ['M'], :].getx(), '^', color=color, linewidth=3, markersize=15, alpha=.75)
    plot(t, Ri.lix[[state], ['B'], ['M'], :].getx(), '-', color=color, linewidth=2, alpha=.75)

ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in black males over time')
axis([1945, 1995, 0, 630])

5 Comments

Filed under software engineering

Gaussian Processes in Theory

James Lee has a new post on his tcsmath blog about Gaussian Processes, a topic I’ve been enamored with for the last while. I love the graphics he includes in his posts… they look like they take a lot of work to produce.

James (and Talagrand) are interested in finding the supremum of a GP, which I can imagine being a very useful tool for studying random graphs and average-case analysis of algorithms. I’m interested in finding rapidly mixing Markov chains over GPs, which seems to be useful for disease modeling. Seemingly very different directions of research, but I’ll be watching tcsmath for the next installment of majorizing measures.

1 Comment

Filed under probability

Book Reviews in SIGACT News

I’m off twitter, and that means more short posts here.  Do fun book reviews from the SIGACT News deserve more than 140 characters?  I don’t know, but my productivity has gone up since I stopped getting tweeted at.  Feel free to read the following in a shrill chirp, however.

SIGACT News has a book review column that I scan when I have time.  This month they reviewed two books I’ve had my eye on: Combinatorics the Rota way and Logicomix. I’m sure I’ll read both of these when I have time, and pretty sure I want to own both. I’m a bit less excited about the Rota book after reading the review, though, since I was hoping it was a textbook version of his intro probability course, which I loved. It sounds like it is a reference book version of his grad combinatorics course, which I was too busy to finish. So maybe I shouldn’t be disappointed, it’s actually a chance for me to learn new things. It does sound hard, though:

The book works best as a second read of the topics covered. If you already know of a combinatorial method, like Polya’s Enumeration Theory, this book is a good place to find the starting point for an alternate and powerful treatment of the topic. The book admits to not being self contained, and has a few forward-reference problems. However, this is forgivable when you realize the goal of this book is not to teach some easy discrete mathematics before you move on to analysis, but to extract the important combinatorial methods and themes from all of mathematics.

The Logicomix review is by Bill Gasarch and it is very strange. It apologizes for not being in the form of a comic.

2 Comments

Filed under education

I ♥ my new commute

I’ve been away from the keyboard, and that’s because I’ve been moving. But it’s the last time for a while, because Jessi and I are not renters anymore. Yay!

And bonus, I love my new commute. The Seattle waterfront is a nice thing to see everyday, even if it’s sometimes overcast:

5 Comments

Filed under general