Big Week for Healthy Algorithms Posts

I’ve been in meetings literally all day, but I’ve got so much to say that I’m still here… this is a newsletter that IHME put out yesterday, and it’s got me in the front cover photo. How can I pass up announcing that? It’s mostly for my mom, but PyMC fans might also appreciate the shoutout I managed to Anand Patil, who authored the PyMC Gaussian Process package that I’ve been urging people to use lately.

4 Comments

Filed under general

The GBD 2010 Health Measurement Survey is here

Have you got 15 minutes for science? Take this strange survey that I’ve been excited about for the last two years. I’ve been calling it the Disability Weights Survey, but now that it’s all professionally implemented and communications-department approved, it is officially the GBD 2010 Health Measurement Survey.

The survey is part of the Global Burden of Diseases, Injuries, and Risk Factors Study 2010 led by the Institute for Health Metrics and Evaluation (IHME) at the University of Washington, in collaboration with four other leading institutions: Harvard University, Johns Hopkins University, the University of Queensland, and the World Health Organization.

Our goal is to collect responses from at least 50,000 people worldwide. Please consider sharing this and encouraging participation within your organization. In addition, we would ask you to consider forwarding information about this survey to colleagues and contacts outside your organization who might be interested in participating.

The survey takes about 15 minutes to complete. Participation is completely voluntary and anonymous. In the near future, we hope to translate the survey into additional languages.

2 Comments

Filed under global health

Salt: Bad Dialogue and Worse

I had a break yesterday to see one of those “summer blockbusters”, a spy flick staring Angelina Jolie called Salt. It had some good explosions and good action, but overall it was so outrageously terrible that I will reveal the entire cloak-and-dagger twist to complain. (spoiler ahead) Continue reading

4 Comments

Filed under education, videos

What is this “effects” business?

I’ve got to figure out what people mean when they say “fixed effect” and “random effect”, because I’ve been confused about it for a year and I’ve been hearing it all the time lately.

Bayesian Data Analysis is my starting guide, which includes a footnote on page 391:

The terms ‘fixed’ and ‘random’ come from the non-Bayesian statistical tradition are are somewhat confusing in a Bayesian context where all unknown parameters are treated as ‘random’. The non-Bayesian view considers fixed effects to be fixed unknown quantities, but the standard procedures proposed to estimate these parameters, based on specified repeated-sampling properties, happen to be equivalent to the Bayesian posterior inferences under a noninformative (uniform) prior distribution.

That doesn’t totally resolve my confusion, though, because my doctor-economist colleagues are often asking for the posterior mean of the random effects, or similarly non-non-Bayesian sounding quantities.

I was about to formulate my working definition, and see how long I can stick to it, but then I was volunteered to teach a seminar on this very topic! So instead of doing the work now, I turn to you, wise internet, to tell me how I can understand this thing.

6 Comments

Filed under statistics

MCMC in Python: Sudoku is a strange game

I was on a “vacation” for the last week, which included a real vacation to visit my parents, and also a scare-quotes vacation to Detroit that is a totally different subject. But the side effect of this vacation that is to be the topic of this post is the strange game of Sudoku, which is popular with travelers.

Jessi was with me and she seems to like this Sudoku business (although she denies it), so I thought about playing, too. But I always get interested in the wrong parts of these games. Solving Sudoku puzzles seems more like a game for computers than humans, and writing a program to let computers have their fun is the way I would prefer to waste time while en route. There really isn’t much to it:

def solve(T):
    # if all cells are filled in, we win
    if all(T > 0):
        return 'success'

    # solve T recursively, by trying all values for the most constrained var
    pos = possibilities(T)
    i,j = most_constrained(T, pos)

    for val in pos[(i,j)]:
        T[i,j] = val
        if solve(T) == 'success':
            return 'success'

    # if this point is reached, this branch is unsatisfiable
    T[i, j] = -1
    return 'failure'

Is this ruining anyone’s homework problems? Just in case it is, everything I said at the beginning of my post on minimum spanning trees still applies.

With plenty of time left in my flight, this was just the opening move, though. The real question is what makes this fun for the humans who find this kind of thing fun? One hypothesis, the kind that comes naturally when you have watched the statistical physicists in action, is that fun is some sort of phase transition, and if you take random Sudoku puzzles with a certain number of entries you will maximize fun. Half-baked, I know, but as interesting as the movies they show on planes, at least. And it raises the extremely important, and possibly unsolved challenge, how do you sample uniformly at random from Sudoku puzzles with n cells filled in?

In my travel-addled state, I thought maybe a good way to do it would be to start with some complete configuration, knockout a large fraction of the cells at random, permute the remaining cells, and then solve the resulting configuration. Repeating this has got to be an ergodic markov chain, right? I’m not sure… and even then, do you think the stationary distribution is uniform?

def rand(n, T=None):
    # start with an empty board
    if T == None:
        T = -1*ones([9,9])

    # solve it to generate an initial solution
    solve(T)

    # do random shuffles to approximate uniformly random solution
    for k in range(5):
        select_random_cells(T, 20)
        randomly_permute_labels(T)
        solve(T)

    # remove appropriate amount of labels
    select_random_cells(T, n)

    return T

Now, when Jessi found out that I was helping computers take her sudoku-solving job away, she thought I was a geek, but when she found out I was generating puzzles with more than one solution, she was outraged. Sudoku puzzles have a unique solution! So maybe what is really fun is a random puzzle with a unique solution, and the right number of cells filled in, where a smaller number of cells means that harder puzzles are right. Fortunately/unfortunately my travel ended before I finished investigating this important issue.

I doubt it is related, but I got pretty sick the next day.

8 Comments

Filed under MCMC

la.larry for classy number crunching in Python

I’ve got some of my colleagues interested in using Python to cruch their numbers, and that’s great, because I’m sure it will make their lives much better in the long run. It’s also great because they are pushing me to figure out the easy way to do things, since they are used to certain conveniences from their past work with R and STATA.

For example, Kyle wanted to know does Python make it easy to deal with the tables of data he is faced with every day?

It was by showing my ignorance in a blog post a year ago that I learned about handing csv files with the csv.DictReader class. And this has been convenient enough for me, but it doesn’t let Kyle do the things the way he is used to, for example using data[[‘iso3’, ‘gdp’], :] to select these two important columns from a large table.

Enter larry, the labeled array. Maybe someone will tell me an even cleaner way to deal with these tables, but after a quick easy_install la, I was able to load up some data from the National Cancer Institute (NCI) as follows:

import csv
import la
from pylab import *

csv_file = csv.reader(open('accstate5.txt'))
columns = csv_file.next()[1:]
rows = []
data = []

for d in csv_file:
    rows.append(d[0])
    data.append(d[1:])

T = la.larry(data, [rows, columns])

That’s pretty good, but the data is not properly coded as floating point…

T = la.larry(data, [rows, columns]).astype(float)

Now, T has the whole table, indexed by state and by strange NCI code

T.lix[['WA', 'OR', 'ID', 'MO'], ['RBM9094']]

Perfect, besides being slightly ugly. But is there an easier way? the la documentation mentions a fromcsv method. Unfortunately la.larry.fromcsv doesn’t like these short-format tables.

la.larry.fromcsv('accstate5.txt')  # produces a strange error

Now, if you read the NCI data download info page, where I retrieved this data from, you will learn what each cryptic column heading means:

Column heading format:   [V] [RG] [T] [A], where

V = variable: R, C, LB, or UB
RG = race / gender: BM, BF, WM, WF (B = black, W = white, F = female, M = male)
T = calendar time: 5094, 5069, 7094, and the 9 5-year periods 5054 through 9094
A = age group: blank (all ages), 019, 2049, 5074, 75+

Example:
RBM5054 = rate for black males for the time period 1950-1954

That means I’m not really dealing with a 2-dimensional table after all, and maybe I’d be better off organizing things the way they really
are:

data_dict_list = [d for d in csv.DictReader(open('accstate5.txt'))]
data_dict = {}
races = ['B', 'W']
sexes = ['M', 'F']
times = range(50,95,5)
for d in data_dict_list:
    for race in races:
        for sex in sexes:
            for time in times:
                key = 'R%s%s%d%d' % (race, sex, time, time+4)
                val = float(d.get(key, nan)) # larry uses nan to denote missingdata
                data_dict[(d['STATE'], race, sex, '19%s'%time)] = val

R = la.larry.fromdict(data_dict)

In the original 2-d labeled array we can compare cancer rates thusly:

states = ['OR', 'WA', 'ID', 'MT']
times = arange(50,94,5)
selected_cols = ['RWM%d%d' % (t, t+4) for t in times]
T_selected = T.lix[states, selected_cols]
plot(1900+times, T_selected.getx().T, 's-',
     linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers over time')
axis([1945, 1995, 0, 230])

It is slightly cooler to do the same plot with the multi-dimensional array

clf()
subplot(2,1,1)

times = [str(t) for t in range(1950, 1994, 5)]
states = ['OR', 'WA', 'ID', 'MT']
colors = ['c', 'b', 'r', 'g']
for state, color in zip(states, colors):
    plot(R.getlabel(3), R.lix[[state], ['W'], ['M'], :].getx(), 's-', color=color,
         linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in white males over time')
axis([1945, 1995, 0, 230])

That should be enough for me to remember how to use larry when I come back to it in the future. For my final trick, let’s explore how to use larrys to add a column to a 2-dimensional table. (There is some important extension of this to multidimensional tables, but it is confusing me a lot right now.)

states = T.getlabel(0)
X = la.larry(randn(len(states), 1), [states, ['new covariate']])
T = T.merge(X)

states = ['OR', 'WA', 'ID', 'MT']
T.lix[states, ['RWM9094', 'new covariate']]

Ok, encore… how well does larry play with PyMC? Let’s use a Gaussian Process model to find out.

from pymc import *

M = gp.Mean(lambda x, mu=R.mean(): mu * ones(len(x)))
C = gp.Covariance(gp.matern.euclidean, diff_degree=2., amp=1., scale=1.)
mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in arange(9)])

obs_mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_vals = array([R[i,j,k,l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_V = array([1. for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])

gp.observe(M, C, obs_mesh, obs_vals, obs_V)
imputed_vals = M(mesh)
imputed_std = sqrt(abs(C(mesh)))

Ri = la.larry(imputed_vals.reshape(R.shape), R.label)

subplot(2,1,2)
for state, color in zip(states, colors):
    t = [float(l) for l in Ri.getlabel(3)]
    plot(t, R.lix[[state], ['B'], ['M'], :].getx(), '^', color=color, linewidth=3, markersize=15, alpha=.75)
    plot(t, Ri.lix[[state], ['B'], ['M'], :].getx(), '-', color=color, linewidth=2, alpha=.75)

ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in black males over time')
axis([1945, 1995, 0, 630])

5 Comments

Filed under software engineering

Gaussian Processes in Theory

James Lee has a new post on his tcsmath blog about Gaussian Processes, a topic I’ve been enamored with for the last while. I love the graphics he includes in his posts… they look like they take a lot of work to produce.

James (and Talagrand) are interested in finding the supremum of a GP, which I can imagine being a very useful tool for studying random graphs and average-case analysis of algorithms. I’m interested in finding rapidly mixing Markov chains over GPs, which seems to be useful for disease modeling. Seemingly very different directions of research, but I’ll be watching tcsmath for the next installment of majorizing measures.

1 Comment

Filed under probability

Book Reviews in SIGACT News

I’m off twitter, and that means more short posts here.  Do fun book reviews from the SIGACT News deserve more than 140 characters?  I don’t know, but my productivity has gone up since I stopped getting tweeted at.  Feel free to read the following in a shrill chirp, however.

SIGACT News has a book review column that I scan when I have time.  This month they reviewed two books I’ve had my eye on: Combinatorics the Rota way and Logicomix. I’m sure I’ll read both of these when I have time, and pretty sure I want to own both. I’m a bit less excited about the Rota book after reading the review, though, since I was hoping it was a textbook version of his intro probability course, which I loved. It sounds like it is a reference book version of his grad combinatorics course, which I was too busy to finish. So maybe I shouldn’t be disappointed, it’s actually a chance for me to learn new things. It does sound hard, though:

The book works best as a second read of the topics covered. If you already know of a combinatorial method, like Polya’s Enumeration Theory, this book is a good place to find the starting point for an alternate and powerful treatment of the topic. The book admits to not being self contained, and has a few forward-reference problems. However, this is forgivable when you realize the goal of this book is not to teach some easy discrete mathematics before you move on to analysis, but to extract the important combinatorial methods and themes from all of mathematics.

The Logicomix review is by Bill Gasarch and it is very strange. It apologizes for not being in the form of a comic.

2 Comments

Filed under education

I ♥ my new commute

I’ve been away from the keyboard, and that’s because I’ve been moving. But it’s the last time for a while, because Jessi and I are not renters anymore. Yay!

And bonus, I love my new commute. The Seattle waterfront is a nice thing to see everyday, even if it’s sometimes overcast:

5 Comments

Filed under general

Child Mortality Paper

Check it out, my first published research in global health: Neonatal, postneonatal, childhood, and under-5 mortality for 187 countries, 1970—2010: a systematic analysis of progress towards Millennium Development Goal 4. I’m the ‘t’ in et al, and my contribution was talking them into using the really fun Gaussian Process in their model (and helping do it).

I’ve long wanted to write a how-to style tutorial about using Gaussian Processes in PyMC, but time continues to be on someone else’s side. Instead of waiting for that day, you can enjoy the GP Users Guide now.

3 Comments

Filed under global health