# Category Archives: statistics

## Classic EM in Python: Multinomial sampling

In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:

We can think of the complete data as an $n \times p$ matrix $x$ whose $(i,j)$ element is unity if the $i$-th unit belongs in the $j$-th of $p$ possible cells, and is zero otherwise. The $i$-th row of $x$ contains $p-1$ zeros and one unity, but if the $i$-th unit has incomplete data, some of the indicators in the $i$-th row of $x$ are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of $\phi$. The M-step then becomes the usual estimation of $\phi$ from the observed and assigned values of the indicators summed over the units.

In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.

When I first made some data to try this out, it looked like this:

import pymc as mc, numpy as np, pandas as pd, random

n = 100000
p = 5

pi_true = mc.rdirichlet(np.ones(p))
pi_true = np.hstack([pi_true, 1-pi_true.sum()])
x_true = mc.rmultinomial(1, pi_true, size=n)

x_obs = array(x_true, dtype=float)
for i in range(n):
for j in random.sample(range(p), 3):
x_obs[i,j] = np.nan


At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of $\pi$. The model is not super slick:

pi = mc.Uninformative('pi', value=np.ones(p)/p)

x_missing = np.isnan(x_obs)
x_initial = x_obs.copy()
x_initial[x_missing] = 0.
for i in range(n):
if x_initial[i].sum() == 0:
j = np.where(x_missing[i])[0][0]
x_initial[i,j] = 1.
@mc.stochastic
def x(pi=pi, value=x_initial):
return mc.multinomial_like(value, 1, pi)

@mc.observed
def y(x=x, value=x_obs):
if np.allclose(x[~x_missing], value[~x_missing]):
return 0
else:
return -np.inf


And the E-step/M-step parts are pretty simple:

def E_step():
x_new = array(x_obs, dtype=float)
for i in range(n):
if x_new[i, ~x_missing[i]].sum() == 0:
conditional_pi_sum = pi.value[x_missing[i]].sum()
for j in np.where(x_missing[i])[0]:
x_new[i,j] = pi.value[j] / conditional_pi_sum
else:
x_new[i, x_missing[i]] = 0.
x.value = x_new

def M_step():
counts = x.value.sum(axis=0)
pi.value = (counts / counts.sum())


But the way the values converge does look nice:

The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of x_obs:

ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0),
naive=pd.DataFrame(x_obs).mean(), em=pi.value),
columns=['pr', 'true', 'naive', 'em']).sort('true')
print np.round_(ests, 3)
pr   true  naive     em
2  0.101  0.101  0.100  0.101
0  0.106  0.106  0.108  0.108
3  0.211  0.208  0.209  0.208
1  0.269  0.271  0.272  0.271
4  0.313  0.313  0.314  0.313


Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.

To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?

Filed under statistics, Uncategorized

## Classic EM in Python: Warm-up problem of 197 animals in PyMC

The classic paper on the EM algorithm begins with a little application in multinomial modeling:

Rao (1965, pp. 368-369) presents data in which 197 animals are distributed multinomially into four categories, so that the observed data consist of $y = (y_1, y_2, y_3, y_4) = (125, 18, 20, 34).$

A genetic model for the population specifies cell probabilities $\left(\frac12 + \frac14\pi, \frac14(1-\pi), \frac14(1 -\pi), \frac14\pi\right)$ for some $\pi$ with $0\leq \pi \leq 1$.

Solving this analytically sounds very much like a 1960′s pastime to me (answer: $\pi = \frac{15 + \sqrt{53809}}{394})$, and a modern computer with PyMC can make short work of numerically approximating this maximum likelihood estimation. It takes no more than this:

import pymc as mc, numpy as np

pi = mc.Uniform('pi', 0, 1)
y = mc.Multinomial('y', n=197,
p=[.5 + .25*pi, .25*(1-pi), .25*(1-pi), .25*pi],
value=[125, 18, 20, 34], observed=True)

mc.MAP([pi,y]).fit()
print pi.value


But the point is to warm-up, not to innovate. The EM way is to introduce a latent variable $x = (x_1, x_2, x_3, x_4, x_5)$ and set $y_1 = x_1+x_2, y_2=x_3, y_3=x_4, y_4=x_5$, and work with a multinomial distribution for $x$ that induces the multinomial distribution above on $y$. The art is determining a good latent variable, mapping, and distribution, and “good” is meant in terms of the computation it yields:

import pymc as mc, numpy as np

pi = mc.Uniform('pi', 0, 1)

@mc.stochastic
def x(n=197, p=[.5, .25*pi, .25*(1-pi), .25*(1-pi), .25*pi], value=[125.,0.,18.,20.,34.]):
return mc.multinomial_like(np.round_(value), n, p)

@mc.observed
def y(x=x, value=[125, 18, 20, 34]):
if np.allclose([x[0]+x[1], x[2], x[3], x[4]], value):
return 0
else:
return -np.inf


It is no longer possible to get a good fit to an mc.MAP object for this model (why?), but EM does not need to. The EM approach is to alternate between two steps:

• Update $x$ (E-step, because it is set to its conditional expectation based on the current value of $\pi$)
• Update $\pi$ (M-step, because it is chosen to maximize the likelihood for the current value of $x$)

This is simply implemented in PyMC and quick to get the goods:

def E_step():
x.value = [125 * .5 / (.5 + .25*pi.value), 125 * .25*pi.value / (.5 + .25*pi.value), 18, 20, 34]
def M_step():
pi.value = (x.value[1] + 34.) / (x.value[1] + 34. + 18. + 20.)
for i in range(5):
print 'iter %2d: pi=%0.4f, X=%s' %(i, pi.value, x.value)
E_step()
M_step()


It is not fair to compare speeds without making both run to the same tolerance, but when that is added EM is 2.5 times faster. That could be important sometimes, I suppose.

1 Comment

Filed under statistics

## Rewind Data Augmentation to EM

The original paper on Data Augmentation (DA) got me thinking it was time to have a careful look at the original paper on EM. These are both highly cited papers, and Google scholar says the DA paper has been cited 2538 times (a lot!) and the EM paper has been cited 31328 times (is that possible?).

EM (which stands of Expectation-Maximization) is something I have heard about a lot, and, as an algorithm, it is even something I understood when I read a blog post relating it to a clustering algorithm that I now cannot find (maybe on the Geomblog). The arguments from the EM to DA paper, which I read recently, makes me think that there is something more than the algorithm going on here. In this later paper, Tanner and Wong identify the importance of Section 4 of the EM paper, which provides a collection of example applications of their EM algorithm. Is there something about this way of thinking that is even more important than the algorithm itself?

Filed under statistics

## What is data augmentation?

I’ve started summarizing papers for MathSciNet again, a strangely arcane practice coordinated by the American Mathematical Society. MathSciNet is a vast database of short reviews of math publications, and I loved it when I was writing the background sections of my graduate thesis. It was so useful then that I’m compelled to write summaries for it now, if I can make the time. Like reviewing, it makes me read a range of recent papers, but unlike reviewing, I don’t need to make a fuss about things that seem wrong. Fun.

I used to get to read some of the latest and greatest results in random graphs this way, but since I started up again, I’ve adjusted my preferences to receive some latest and greatest results in statistical computation. Hence my new appreciation for “data augmentation”, a term that doesn’t seem like good branding to me, but might be the key to MCMC coming into vogue for Bayesian computation.

Here is a little research list for my future self to investigate further:

Filed under statistics

## Tukey quote I half-remembered

I was trying to remember some quote by the exploratory data analysis master John Tukey yesterday, and I think this is it:

No catalog of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper—and transparencies—are there, not as a technique, but rather as a recognition that the picture-examining eye is the best finder we have of the wholly unanticipated.

It is from John W. Tukey, We Need Both Exploratory and Confirmatory, The American Statistician, Vol. 34, No. 1 (Feb., 1980), pp. 23-25.

I remembered a version about the visual cortex as a the most advance signal processing device, so maybe there is another version of this out there.

Filed under statistics

## Parameterizing Negative Binomial distributions

The negative binomial distribution is cool. Sometimes I think that.

Sometimes I think it is more trouble than it’s worth, a complicated mess.

Today, both.

Wikipedia and PyMC parameterize it differently, and it is a source of continuing confusion for me, so I’m just going to write it out here and have my own reference. (Which will match with PyMC, I hope!)

The important thing about the negative binomial, as far as I’m concerned, is that it is like a Poisson distribution, but “over-dispersed”. That is to say that the standard deviation is not always the square root of the mean. So I’d like to parameterize it with a parameter $\mu$ for the mean and $\delta$ for the dispersion. This is almost what PyMC does, except it calls the dispersion parameter $\alpha$ instead of $\delta$.

The slightly less important, but still informative, thing about the negative binomial, as far as I’m concerned, is that the way it is like a Poisson distribution is very direct. A negative binomial is a Poisson that has a Gamma-distributed random variable for its rate. In other words (symbols?), $Y \sim \text{NegativeBinomial}(\mu, \delta)$ is just shorthand for

$Y \sim \text{Poisson}(\lambda),$
$\lambda \sim \text{Gamma}(\mu, \delta).$

Unfortunately, nobody parameterizes the Gamma distribution this way. And so things get really confusing.

The way to get unconfused is to write out the distributions, although after they’re written, you might doubt me:

The negative binomial distribution is
$f(k \mid \mu, \delta) = \frac{\Gamma(k+\delta)}{k! \Gamma(\delta)} (\delta/(\mu+\delta))^\delta (\mu/(\mu+\delta))^k$
and the Poisson distribution is
$f(k \mid \lambda) = \frac{e^{-\lambda}\lambda^k}{k!}$
and the Gamma distribution is
$f(x \mid \alpha, \beta) = \frac{\beta^{\alpha}x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)}$

Hmm, does that help yet? If $\alpha = \delta$ and $\beta = \delta/\mu$, it all works out:
$\frac{\Gamma(k+\delta)}{\Gamma(\delta)k!} \left(\frac{\delta}{\pi+\delta}\right)^\delta \left(\frac{\pi}{\pi+\delta}\right)^k = \int_0^\infty \frac{e^{-\lambda}\lambda^k}{k!} \lambda^{\delta-1} e^{-\lambda \delta/\mu} \frac{(\delta/\mu)^{\delta}}{\Gamma(\delta)}d \lambda.$

But instead of integrating it analytically (or in addition to), I am extra re-assured by seeing the results of a little PyMC model for this:

I put a notebook for making this plot in my pymc-examples repository. Love those notebooks. [pdf] [ipynb]

1 Comment

Filed under statistics

## PyMC at SciPy 2011

I just returned from the SciPy 2011 conference in Austin. Definitely a different experience than a theory conference, and definitely different than the mega-conferences I’ve found myself at lately. I think I like it. My goal was to evangelize for PyMC a little bit, and I think that went successfully. I even got to meet PyMC founder Chris Fonnesbeck in person (about 30 seconds before we presented a 4 hour tutorial together).

For the tutorial, I put together a set of PyMC-by-Example slides and code to dig into that silly relationship between Human Development Index and Total Fertility Rate that foiled my best attempts at Bayesian model selection so long ago.

I’m not sure the slides stand on their own, but together with the code samples they should reproduce my portion of the talk pretty well. I even started writing it up for people who want to read it in paper form, but then I ran out of momentum. Patches welcome.

Filed under education, global health, MCMC, statistics

## Computation time and model development

20 seconds, 20 minutes, or 20 hours.  These are all amounts of time that a computational method I’ve been working at some time has taken to complete processing.  They each lead to a very different experience for the model developer, and probably in the end for the model, too. Twenty seconds is definitely what I prefer.

Filed under statistics, TCS

## New Stats Books

Here is a new book on Bayesian stats that Kyle forwarded on to me: Principles of Uncertainty.  Chapter 11 looks unique, on “multiparty problems”, and a pdf of the whole thing is available from the book website for download.