Category Archives: Uncategorized
ipython autoreloading
Comments Off on ipython autoreloading
Filed under Uncategorized
Classic EM in Python: Multinomial sampling
In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:
We can think of the complete data as an
matrix
whose
element is unity if the
-th unit belongs in the
-th of
possible cells, and is zero otherwise. The
-th row of
contains
zeros and one unity, but if the
-th unit has incomplete data, some of the indicators in the
-th row of
are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of
. The M-step then becomes the usual estimation of
from the observed and assigned values of the indicators summed over the units.
In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.
When I first made some data to try this out, it looked like this:
import pymc as mc, numpy as np, pandas as pd, random
n = 100000
p = 5
pi_true = mc.rdirichlet(np.ones(p))
pi_true = np.hstack([pi_true, 1-pi_true.sum()])
x_true = mc.rmultinomial(1, pi_true, size=n)
x_obs = array(x_true, dtype=float)
for i in range(n):
for j in random.sample(range(p), 3):
x_obs[i,j] = np.nan
At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of . The model is not super slick:
pi = mc.Uninformative('pi', value=np.ones(p)/p)
x_missing = np.isnan(x_obs)
x_initial = x_obs.copy()
x_initial[x_missing] = 0.
for i in range(n):
if x_initial[i].sum() == 0:
j = np.where(x_missing[i])[0][0]
x_initial[i,j] = 1.
@mc.stochastic
def x(pi=pi, value=x_initial):
return mc.multinomial_like(value, 1, pi)
@mc.observed
def y(x=x, value=x_obs):
if np.allclose(x[~x_missing], value[~x_missing]):
return 0
else:
return -np.inf
And the E-step/M-step parts are pretty simple:
def E_step():
x_new = array(x_obs, dtype=float)
for i in range(n):
if x_new[i, ~x_missing[i]].sum() == 0:
conditional_pi_sum = pi.value[x_missing[i]].sum()
for j in np.where(x_missing[i])[0]:
x_new[i,j] = pi.value[j] / conditional_pi_sum
else:
x_new[i, x_missing[i]] = 0.
x.value = x_new
def M_step():
counts = x.value.sum(axis=0)
pi.value = (counts / counts.sum())
But the way the values converge does look nice:

The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of x_obs:
ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0),
naive=pd.DataFrame(x_obs).mean(), em=pi.value),
columns=['pr', 'true', 'naive', 'em']).sort('true')
print np.round_(ests, 3)
pr true naive em
2 0.101 0.101 0.100 0.101
0 0.106 0.106 0.108 0.108
3 0.211 0.208 0.209 0.208
1 0.269 0.271 0.272 0.271
4 0.313 0.313 0.314 0.313
Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.
To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?
Comments Off on Classic EM in Python: Multinomial sampling
Filed under statistics, Uncategorized
R in ipython
This is going to be great. Now I’ve got to figure out how to make Rpy work, though!
There is a funny little discussion on twitter, where I found out about it.
Comments Off on R in ipython
Filed under Uncategorized
Turing Centennial Series from In Theory blog
I have greatly enjoyed the series of posts that Luca Trevisan brought together in honor of Alan Turing’s 100th birthday. He introduced the series as follows:
Within the Turing festivities, I think it would be interesting to talk about how things have changed (or not) since Turing’s time for people who do academic work in cryptography and in the theory of computing and who are gay or lesbian.
So I have invited a number of gay and lesbian colleagues to write guest posts talking about how things have been for them, and so far half a dozen have tentatively accepted. Their posts will appear next month which, besides being Turing’s centennial month, also happens to be the anniversary of the Stonewall riots.
- Turing Centennial Post 0: Oded Goldreich
- Turing Centennial Post 1: Irit Dinur
- Turing Centennial Post 2: Günter Ziegler
- Turing Centennial Post 3: Sampath Kannan
- Turing Centennial Post 4: Luca Trevisan
- Turing Centennial Post 5: Martin Farach-Colton
- Turing Centennial Post 6: Rosario Gennaro
- Turing Centennial Post 7: Ashwin Nayak
Comments Off on Turing Centennial Series from In Theory blog
Filed under Uncategorized
The road to knowledge is asking many questions
I got a little peeved when reading too many documents on the computer this morning, and I asked the internet “Why does acrobat have hyperlinks but no back button???”
And it told me: Just use ALT-back arrow, or click a few times and you can add it.
Thanks internet.
Comments Off on The road to knowledge is asking many questions
Filed under Uncategorized
Powell’s Method for Maximization in PyMC
I have been using “Powell’s Method” to find maximum likelihood (ahem, maximum a posteriori) estimates in my PyMC models for years now. It is something that I arrived at by trying all the options once, and it seemed to work most reliably. But what does it do? I never bothered to figure out, until today.
It does something very reasonable. That is to optimize a multidimensional function along one dimension at a time. And it does something very tricky, which is to update the basis for one-dimensional optimization periodically, so that it quickly finds a “good” set of dimensions to optimize over. Now that sounds familiar, doesn’t it? It is definitely the same kind of trick that makes the Adaptive Metropolis step method a winner in MCMC.
The 48-year-old paper introducing the approach, M. J. D. Powell, An efficient method for finding the minimum of a function of several variables without calculating derivatives, is quite readable today. If you want to see it in action, I added an ipython notebook to my pymc examples repository on github: [ipynb] [py] [pdf].
Comments Off on Powell’s Method for Maximization in PyMC
Filed under Uncategorized
NSF Program Solicitation for Smart Health
This NSF Program Solicitation crossed my desk recently. It is for “Smart Health and Wellbeing”, which includes a lot of healthy algorithms topic in the “new tools and methods” it lists. For example:
From Data to Knowledge to Decisions: Investigate methods and algorithms for aggregation of multi-scale clinical, biomedical, contextual, and environmental data about each patient (EHR, personal health records – PHR, etc.), and unified and extensible metadata standards, and decision support tools to facilitate optimized patient-centered evidence-based decisions. Integrate patient information with delivery systems performance and economic models to support operations management decisions. Develop robust knowledge representations and reasoning algorithms to support inferences based on individual or population health data, multiple sources of potentially conflicting information while complying with applicable policies and preferences. Develop innovative technology for the secondary use of health data to support assisted and automated discovery of reliable knowledge from aggregated population health records and predictive modeling and simulation of health and disease at multiple levels from cellular to individuals/patients to populations, along with robust validation and integration of empirical data into the models. Develop understanding of how families, communities, informal caregivers, professional medical teams and patients interpret care and treatment. Increase understanding of issues (technological, behavioral, socio-economic, value-driven actions, ethical, systemic) that interfere with patients’ collaboration in care team and adherence to treatment and wellness regimens.
Comments Off on NSF Program Solicitation for Smart Health
Filed under Uncategorized
Resusicitating Healthy Algorithms
Wow, 2.5 months can just fly by! I’m crawling out from under a crushing workload, and ready to rejoin the world of applied theoretical computer science and python hacking. Did I miss anything?
Filed under Uncategorized
Coolest Demo at SciPy 2011
Speaking of SciPy 2011 (as I was in my last post), the coolest, most draw-dropping-est demo I saw there was hands-down for the new ipython. The most cutting edge stuff is available on the web. I want it.
Filed under software engineering, Uncategorized
What is life expectance?
Hint: It’s not what you think.
(This is a post that I never finished/barely started almost two years ago.)
Filed under Uncategorized
