# Category Archives: Uncategorized

## Graph Analytics Challenge

I was playing around with SPARQL queries and the semantic web earlier this year, inspired in part by a contest I entered. Well, the good news came out that my project was second runner-up! Of course, I would like to be first place, but the projects that beat mine were both really cool. Information Week did a nice story on all of us.

Filed under Uncategorized

## Math and Dance

I almost didn’t share these HarleMCMC videos, but how long could I resist, really?

We’ll see how this holds up to repeated viewing…

Here is a math/dance video for the ages:

Filed under MCMC, Uncategorized

## Better typography for IPython notebooks, now

I came across a long blog about how to make ipython notebooks more aesthetically pleasing recently, and I think there is a lot to be learned there.

The good news is that you can try this out on a notebook-by-notebook basis with a little trick. Just drop this into a cell in any IPython notebook:

import IPython.core.display

IPython.core.display.HTML("""

div.input {
width: 105ex; /* about 80 chars + buffer */
}

div.text_cell {
width: 105ex /* instead of 100%, */
}

div.text_cell_render {
/*font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;*/
font-family: "Charis SIL", serif; /* Make non-code text serif. */
line-height: 145%; /* added for some line spacing of text. */
width: 105ex; /* instead of 'inherit' for shorter lines */
}

/* Set the size of the headers */
div.text_cell_render h1 {
font-size: 18pt;
}

div.text_cell_render h2 {
font-size: 14pt;
}

.CodeMirror {
font-family: Consolas, monospace;
}

""")


Filed under Uncategorized

Filed under Uncategorized

## Classic EM in Python: Multinomial sampling

In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:

We can think of the complete data as an $n \times p$ matrix $x$ whose $(i,j)$ element is unity if the $i$-th unit belongs in the $j$-th of $p$ possible cells, and is zero otherwise. The $i$-th row of $x$ contains $p-1$ zeros and one unity, but if the $i$-th unit has incomplete data, some of the indicators in the $i$-th row of $x$ are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of $\phi$. The M-step then becomes the usual estimation of $\phi$ from the observed and assigned values of the indicators summed over the units.

In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.

When I first made some data to try this out, it looked like this:

import pymc as mc, numpy as np, pandas as pd, random

n = 100000
p = 5

pi_true = mc.rdirichlet(np.ones(p))
pi_true = np.hstack([pi_true, 1-pi_true.sum()])
x_true = mc.rmultinomial(1, pi_true, size=n)

x_obs = array(x_true, dtype=float)
for i in range(n):
for j in random.sample(range(p), 3):
x_obs[i,j] = np.nan


At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of $\pi$. The model is not super slick:

pi = mc.Uninformative('pi', value=np.ones(p)/p)

x_missing = np.isnan(x_obs)
x_initial = x_obs.copy()
x_initial[x_missing] = 0.
for i in range(n):
if x_initial[i].sum() == 0:
j = np.where(x_missing[i])[0][0]
x_initial[i,j] = 1.
@mc.stochastic
def x(pi=pi, value=x_initial):
return mc.multinomial_like(value, 1, pi)

@mc.observed
def y(x=x, value=x_obs):
if np.allclose(x[~x_missing], value[~x_missing]):
return 0
else:
return -np.inf


And the E-step/M-step parts are pretty simple:

def E_step():
x_new = array(x_obs, dtype=float)
for i in range(n):
if x_new[i, ~x_missing[i]].sum() == 0:
conditional_pi_sum = pi.value[x_missing[i]].sum()
for j in np.where(x_missing[i])[0]:
x_new[i,j] = pi.value[j] / conditional_pi_sum
else:
x_new[i, x_missing[i]] = 0.
x.value = x_new

def M_step():
counts = x.value.sum(axis=0)
pi.value = (counts / counts.sum())


But the way the values converge does look nice:

The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of x_obs:

ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0),
naive=pd.DataFrame(x_obs).mean(), em=pi.value),
columns=['pr', 'true', 'naive', 'em']).sort('true')
print np.round_(ests, 3)
pr   true  naive     em
2  0.101  0.101  0.100  0.101
0  0.106  0.106  0.108  0.108
3  0.211  0.208  0.209  0.208
1  0.269  0.271  0.272  0.271
4  0.313  0.313  0.314  0.313


Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.

To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?

Filed under statistics, Uncategorized

## R in ipython

This is going to be great. Now I’ve got to figure out how to make Rpy work, though!

There is a funny little discussion on twitter, where I found out about it.

Filed under Uncategorized

## Turing Centennial Series from In Theory blog

I have greatly enjoyed the series of posts that Luca Trevisan brought together in honor of Alan Turing’s 100th birthday.  He introduced the series as follows:

Within the Turing festivities, I think it would be interesting to talk about how things have changed (or not) since Turing’s time for people who do academic work in cryptography and in the theory of computing and who are gay or lesbian.

So I have invited a number of gay and lesbian colleagues to write guest posts talking about how things have been for them, and so far half a dozen have tentatively accepted. Their posts will appear next month which, besides being Turing’s centennial month, also happens to be the anniversary of the Stonewall riots.

The 8 contributed posts (starting from post 0) are collected here, but I recommend that you start from Post 0 and work your way up:

Filed under Uncategorized

I got a little peeved when reading too many documents on the computer this morning, and I asked the internet “Why does acrobat have hyperlinks but no back button???

And it told me: Just use ALT-back arrow, or click a few times and you can add it.

Thanks internet.

Filed under Uncategorized

## Powell’s Method for Maximization in PyMC

I have been using “Powell’s Method” to find maximum likelihood (ahem, maximum a posteriori) estimates in my PyMC models for years now. It is something that I arrived at by trying all the options once, and it seemed to work most reliably. But what does it do? I never bothered to figure out, until today.

It does something very reasonable. That is to optimize a multidimensional function along one dimension at a time. And it does something very tricky, which is to update the basis for one-dimensional optimization periodically, so that it quickly finds a “good” set of dimensions to optimize over. Now that sounds familiar, doesn’t it? It is definitely the same kind of trick that makes the Adaptive Metropolis step method a winner in MCMC.

The 48-year-old paper introducing the approach, M. J. D. Powell, An efficient method for finding the minimum of a function of several variables without calculating derivatives, is quite readable today. If you want to see it in action, I added an ipython notebook to my pymc examples repository on github: [ipynb] [py] [pdf].