Category Archives: Uncategorized

January 18, 2013 · 7:42 pm

ipython autoreloading

Comments Off on ipython autoreloading

January 11, 2013 · 8:00 am

Classic EM in Python: Multinomial sampling

In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:

We can think of the complete data as an $n \times p$ matrix $x$ whose $(i,j)$ element is unity if the $i$ -th unit belongs in the $j$ -th of $p$ possible cells, and is zero otherwise. The $i$ -th row of $x$ contains $p-1$ zeros and one unity, but if the $i$ -th unit has incomplete data, some of the indicators in the $i$ -th row of $x$ are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of $\phi$ . The M-step then becomes the usual estimation of $\phi$ from the observed and assigned values of the indicators summed over the units.

In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.

When I first made some data to try this out, it looked like this:

import pymc as mc, numpy as np, pandas as pd, random

n = 100000
p = 5

pi_true = mc.rdirichlet(np.ones(p))
pi_true = np.hstack([pi_true, 1-pi_true.sum()])
x_true = mc.rmultinomial(1, pi_true, size=n)

x_obs = array(x_true, dtype=float)
for i in range(n):
    for j in random.sample(range(p), 3):
        x_obs[i,j] = np.nan

At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of $\pi$ . The model is not super slick:

pi = mc.Uninformative('pi', value=np.ones(p)/p)

x_missing = np.isnan(x_obs)
x_initial = x_obs.copy()
x_initial[x_missing] = 0.
for i in range(n):
    if x_initial[i].sum() == 0:
        j = np.where(x_missing[i])[0][0]
        x_initial[i,j] = 1.
@mc.stochastic
def x(pi=pi, value=x_initial):
    return mc.multinomial_like(value, 1, pi)

@mc.observed
def y(x=x, value=x_obs):
    if np.allclose(x[~x_missing], value[~x_missing]):
        return 0
    else:
        return -np.inf

And the E-step/M-step parts are pretty simple:

def E_step():
    x_new = array(x_obs, dtype=float)
    for i in range(n):
        if x_new[i, ~x_missing[i]].sum() == 0:
            conditional_pi_sum = pi.value[x_missing[i]].sum()
            for j in np.where(x_missing[i])[0]:
                x_new[i,j] = pi.value[j] / conditional_pi_sum
        else:
            x_new[i, x_missing[i]] = 0.
    x.value = x_new

def M_step():
    counts = x.value.sum(axis=0)
    pi.value = (counts / counts.sum())

But the way the values converge does look nice:

The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of x_obs:

ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0),
                    naive=pd.DataFrame(x_obs).mean(), em=pi.value),
                    columns=['pr', 'true', 'naive', 'em']).sort('true')
print np.round_(ests, 3)
      pr   true  naive     em
2  0.101  0.101  0.100  0.101
0  0.106  0.106  0.108  0.108
3  0.211  0.208  0.209  0.208
1  0.269  0.271  0.272  0.271
4  0.313  0.313  0.314  0.313

Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.

To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?

Comments Off on Classic EM in Python: Multinomial sampling

Filed under statistics, Uncategorized

Tagged as EM, pymc

October 1, 2012 · 8:00 am

R in ipython

This is going to be great. Now I’ve got to figure out how to make Rpy work, though!

There is a funny little discussion on twitter, where I found out about it.

Comments Off on R in ipython

Filed under Uncategorized

July 27, 2012 · 6:00 am

Turing Centennial Series from In Theory blog

I have greatly enjoyed the series of posts that Luca Trevisan brought together in honor of Alan Turing’s 100th birthday. He introduced the series as follows:

Within the Turing festivities, I think it would be interesting to talk about how things have changed (or not) since Turing’s time for people who do academic work in cryptography and in the theory of computing and who are gay or lesbian.

So I have invited a number of gay and lesbian colleagues to write guest posts talking about how things have been for them, and so far half a dozen have tentatively accepted. Their posts will appear next month which, besides being Turing’s centennial month, also happens to be the anniversary of the Stonewall riots.

The 8 contributed posts (starting from post 0) are collected here, but I recommend that you start from Post 0 and work your way up:

Comments Off on Turing Centennial Series from In Theory blog

Filed under Uncategorized

July 20, 2012 · 9:00 am

The road to knowledge is asking many questions

I got a little peeved when reading too many documents on the computer this morning, and I asked the internet “Why does acrobat have hyperlinks but no back button???”

And it told me: Just use ALT-back arrow, or click a few times and you can add it.

Thanks internet.

Comments Off on The road to knowledge is asking many questions

Filed under Uncategorized

February 9, 2012 · 8:00 am

Powell’s Method for Maximization in PyMC

I have been using “Powell’s Method” to find maximum likelihood (ahem, maximum a posteriori) estimates in my PyMC models for years now. It is something that I arrived at by trying all the options once, and it seemed to work most reliably. But what does it do? I never bothered to figure out, until today.

It does something very reasonable. That is to optimize a multidimensional function along one dimension at a time. And it does something very tricky, which is to update the basis for one-dimensional optimization periodically, so that it quickly finds a “good” set of dimensions to optimize over. Now that sounds familiar, doesn’t it? It is definitely the same kind of trick that makes the Adaptive Metropolis step method a winner in MCMC.

The 48-year-old paper introducing the approach, M. J. D. Powell, An efficient method for finding the minimum of a function of several variables without calculating derivatives, is quite readable today. If you want to see it in action, I added an ipython notebook to my pymc examples repository on github: [ipynb] [py] [pdf].

Comments Off on Powell’s Method for Maximization in PyMC

Filed under Uncategorized

Tagged as pymc, python

December 1, 2011 · 8:00 am

NSF Program Solicitation for Smart Health

This NSF Program Solicitation crossed my desk recently. It is for “Smart Health and Wellbeing”, which includes a lot of healthy algorithms topic in the “new tools and methods” it lists. For example:

From Data to Knowledge to Decisions: Investigate methods and algorithms for aggregation of multi-scale clinical, biomedical, contextual, and environmental data about each patient (EHR, personal health records – PHR, etc.), and unified and extensible metadata standards, and decision support tools to facilitate optimized patient-centered evidence-based decisions. Integrate patient information with delivery systems performance and economic models to support operations management decisions. Develop robust knowledge representations and reasoning algorithms to support inferences based on individual or population health data, multiple sources of potentially conflicting information while complying with applicable policies and preferences. Develop innovative technology for the secondary use of health data to support assisted and automated discovery of reliable knowledge from aggregated population health records and predictive modeling and simulation of health and disease at multiple levels from cellular to individuals/patients to populations, along with robust validation and integration of empirical data into the models. Develop understanding of how families, communities, informal caregivers, professional medical teams and patients interpret care and treatment. Increase understanding of issues (technological, behavioral, socio-economic, value-driven actions, ethical, systemic) that interfere with patients’ collaboration in care team and adherence to treatment and wellness regimens.

Comments Off on NSF Program Solicitation for Smart Health

Filed under Uncategorized

November 21, 2011 · 8:00 am

Resusicitating Healthy Algorithms

Wow, 2.5 months can just fly by! I’m crawling out from under a crushing workload, and ready to rejoin the world of applied theoretical computer science and python hacking. Did I miss anything?

1 Comment

Filed under Uncategorized

July 18, 2011 · 7:00 am

Coolest Demo at SciPy 2011

Speaking of SciPy 2011 (as I was in my last post), the coolest, most draw-dropping-est demo I saw there was hands-down for the new ipython. The most cutting edge stuff is available on the web. I want it.

Category Archives: Uncategorized

ipython autoreloading

Classic EM in Python: Multinomial sampling

R in ipython

Turing Centennial Series from In Theory blog

The road to knowledge is asking many questions

Powell’s Method for Maximization in PyMC

NSF Program Solicitation for Smart Health

Resusicitating Healthy Algorithms

Coolest Demo at SciPy 2011

What is life expectance?

Posts

Theory Blogs

some rights reserved

Pages

Archives

Meta