Category Archives: Uncategorized

IHME Seminar: Universal health coverage, equity, and health outcomes

Last week we had a talk from Rodrigo Moreno-Serra on Universal health coverage, equity, and health outcomes. This research used instrumental variables to show that universal health coverage is good for health. One day I will understand instrumental variables—I think there should be a simple way to explain it to combinatorialists who know some epidemiology.

Comments Off

Filed under Uncategorized

BigCLAM talk

My grad school colleague Jure Leskovec was at UW recently, talking about a cool new model for clusters in random graphs. It was part generative model, and part statistical model, and made me miss my grad school days when I spent long hours proving theorems about such things. This statistical part of his model has a very forced acronym: BigCLAM, and also a software implementation: https://github.com/snap-stanford/snap/tree/master/examples/bigclam

Maybe one day I’ll get to do some applied network research for global health, and have a chance to use this.

Comments Off

Filed under Uncategorized

IHME Seminar: Ambient Air Pollution and Cardiovascular Disease

The semester is starting up again, and that means that weekly IHME seminars are starting up again. This week, we heard from Joel Kaufman, a doctor and UW professor who knows quite a lot about how air pollution is bad for the heart. He had some great historical photos of air pollution from the Great Smog of London, which I had not heard of before. Searching later led me to this collection in the Guardian. Dr. Kaufman also had some recent photos of extreme air pollution, which looked sort of like this one.

I remember when I started this blog, I had a goal to draw connections between the issues in global health and the methods and results of theoretical computer science. What does the air-pollution/cardio-health talk inspire along these lines? Well, there are two “big data” sources going on here: continuously updated measurements of air quality from a number of geographically dispersed sensors, and regularly conducted CT scans of participants in a large longitudinal study. It was only an hour long talk, so I’m not sure what problems arise when you put these things together, but I bet you can’t store it all in memory, even on a pretty large computer. And that’s one definition of big…

Comments Off

Filed under Uncategorized

Thoughtful blog about using IPython Notebook in scientific workflow

Here: http://www.pgbovine.net/ipython-notebook-first-impressions.htm

I’ve been doing this for about a year now, and it is working super-well. What I wish for is a way to paste images directly into the notebook. I think it would be pretty easy to add but I haven’t figured out how to do it yet.

Comments Off

Filed under Uncategorized

DSP in Python: Active Noise Reduction with PyAudio

I had a fun little project a while back, to deal with some night noise that was getting in the way of my sleep. Active noise reduction, hacked together in Python. It really works (for me)! There is tons of room for improvement, and at least one interested party. I’m finally pushing it out into the world, so maybe someone will improve it.

Comments Off

Filed under Uncategorized

Graph Analytics Challenge

I was playing around with SPARQL queries and the semantic web earlier this year, inspired in part by a contest I entered. Well, the good news came out that my project was second runner-up! Of course, I would like to be first place, but the projects that beat mine were both really cool. Information Week did a nice story on all of us.

Comments Off

Filed under Uncategorized

Math and Dance

I almost didn’t share these HarleMCMC videos, but how long could I resist, really?

We’ll see how this holds up to repeated viewing…

Here is a math/dance video for the ages:

Comments Off

Filed under MCMC, Uncategorized

Better typography for IPython notebooks, now

I came across a long blog about how to make ipython notebooks more aesthetically pleasing recently, and I think there is a lot to be learned there.

The good news is that you can try this out on a notebook-by-notebook basis with a little trick. Just drop this into a cell in any IPython notebook:

import IPython.core.display

IPython.core.display.HTML("""

div.input {
width: 105ex; /* about 80 chars + buffer */
}

div.text_cell {
width: 105ex /* instead of 100%, */
}

div.text_cell_render {
/*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/
font-family: "Charis SIL", serif; /* Make non-code text serif. */
line-height: 145%; /* added for some line spacing of text. */
width: 105ex; /* instead of 'inherit' for shorter lines */
}

/* Set the size of the headers */
div.text_cell_render h1 {
font-size: 18pt;
}

div.text_cell_render h2 {
font-size: 14pt;
}

.CodeMirror {
 font-family: Consolas, monospace;
 }


""")

3 Comments

Filed under Uncategorized

ipython autoreloading

This will be useful.

Comments Off

Filed under Uncategorized

Classic EM in Python: Multinomial sampling

In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:

We can think of the complete data as an n \times p matrix x whose (i,j) element is unity if the i-th unit belongs in the j-th of p possible cells, and is zero otherwise. The i-th row of x contains p-1 zeros and one unity, but if the i-th unit has incomplete data, some of the indicators in the i-th row of x are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of \phi. The M-step then becomes the usual estimation of \phi from the observed and assigned values of the indicators summed over the units.

In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.

When I first made some data to try this out, it looked like this:

import pymc as mc, numpy as np, pandas as pd, random

n = 100000
p = 5

pi_true = mc.rdirichlet(np.ones(p))
pi_true = np.hstack([pi_true, 1-pi_true.sum()])
x_true = mc.rmultinomial(1, pi_true, size=n)

x_obs = array(x_true, dtype=float)
for i in range(n):
    for j in random.sample(range(p), 3):
        x_obs[i,j] = np.nan

At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of \pi. The model is not super slick:

pi = mc.Uninformative('pi', value=np.ones(p)/p)

x_missing = np.isnan(x_obs)
x_initial = x_obs.copy()
x_initial[x_missing] = 0.
for i in range(n):
    if x_initial[i].sum() == 0:
        j = np.where(x_missing[i])[0][0]
        x_initial[i,j] = 1.
@mc.stochastic
def x(pi=pi, value=x_initial):
    return mc.multinomial_like(value, 1, pi)

@mc.observed
def y(x=x, value=x_obs):
    if np.allclose(x[~x_missing], value[~x_missing]):
        return 0
    else:
        return -np.inf

And the E-step/M-step parts are pretty simple:

def E_step():
    x_new = array(x_obs, dtype=float)
    for i in range(n):
        if x_new[i, ~x_missing[i]].sum() == 0:
            conditional_pi_sum = pi.value[x_missing[i]].sum()
            for j in np.where(x_missing[i])[0]:
                x_new[i,j] = pi.value[j] / conditional_pi_sum
        else:
            x_new[i, x_missing[i]] = 0.
    x.value = x_new

def M_step():
    counts = x.value.sum(axis=0)
    pi.value = (counts / counts.sum())

But the way the values converge does look nice:
em

The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of x_obs:

ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0),
                    naive=pd.DataFrame(x_obs).mean(), em=pi.value),
                    columns=['pr', 'true', 'naive', 'em']).sort('true')
print np.round_(ests, 3)
      pr   true  naive     em
2  0.101  0.101  0.100  0.101
0  0.106  0.106  0.108  0.108
3  0.211  0.208  0.209  0.208
1  0.269  0.271  0.272  0.271
4  0.313  0.313  0.314  0.313

Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.

To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?

Comments Off

Filed under statistics, Uncategorized