I’ve been reading about Sequential Monte Carlo recently, and I think it will fit well into the PyMC3 framework. I will give it a try when I have a free minute, but maybe someone else will be inspired to try it first. This paper includes some pseudocode.
Tag Archives: pymc
I have been watching the development of PyMC3 from a distance for some time now, and finally have had a chance to play around with it myself. It is coming along quite nicely! Here is a notebook Kyle posted to the mailing list recently which has a clean demonstration of using Normal and Laplace likelihoods in linear regression: http://nbviewer.ipython.org/c212194ecbd2ee050192/variable_selection.ipynb
I’ve been watching the next generation of PyMC come together over the last months, and there is some very exciting stuff happening. The part on GLM regression led me to a different project which is also of interest, a regression modeling minilanguage, called Patsy which “brings the convenience of R ‘formulas’ to Python.”
This package recently introduced a method for spline regression, and avoided all puns in naming. Impressive.
A recent question on the PyMC mailing list inspired me to make a really inefficient version of the Naive Bayes classifier. Enjoy.
In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:
We can think of the complete data as an matrix whose element is unity if the -th unit belongs in the -th of possible cells, and is zero otherwise. The -th row of contains zeros and one unity, but if the -th unit has incomplete data, some of the indicators in the -th row of are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of . The M-step then becomes the usual estimation of from the observed and assigned values of the indicators summed over the units.
In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.
When I first made some data to try this out, it looked like this:
import pymc as mc, numpy as np, pandas as pd, random n = 100000 p = 5 pi_true = mc.rdirichlet(np.ones(p)) pi_true = np.hstack([pi_true, 1-pi_true.sum()]) x_true = mc.rmultinomial(1, pi_true, size=n) x_obs = array(x_true, dtype=float) for i in range(n): for j in random.sample(range(p), 3): x_obs[i,j] = np.nan
At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of . The model is not super slick:
pi = mc.Uninformative('pi', value=np.ones(p)/p) x_missing = np.isnan(x_obs) x_initial = x_obs.copy() x_initial[x_missing] = 0. for i in range(n): if x_initial[i].sum() == 0: j = np.where(x_missing[i]) x_initial[i,j] = 1. @mc.stochastic def x(pi=pi, value=x_initial): return mc.multinomial_like(value, 1, pi) @mc.observed def y(x=x, value=x_obs): if np.allclose(x[~x_missing], value[~x_missing]): return 0 else: return -np.inf
And the E-step/M-step parts are pretty simple:
def E_step(): x_new = array(x_obs, dtype=float) for i in range(n): if x_new[i, ~x_missing[i]].sum() == 0: conditional_pi_sum = pi.value[x_missing[i]].sum() for j in np.where(x_missing[i]): x_new[i,j] = pi.value[j] / conditional_pi_sum else: x_new[i, x_missing[i]] = 0. x.value = x_new def M_step(): counts = x.value.sum(axis=0) pi.value = (counts / counts.sum())
The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of
ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0), naive=pd.DataFrame(x_obs).mean(), em=pi.value), columns=['pr', 'true', 'naive', 'em']).sort('true') print np.round_(ests, 3) pr true naive em 2 0.101 0.101 0.100 0.101 0 0.106 0.106 0.108 0.108 3 0.211 0.208 0.209 0.208 1 0.269 0.271 0.272 0.271 4 0.313 0.313 0.314 0.313
Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.
To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?
The classic paper on the EM algorithm begins with a little application in multinomial modeling:
Rao (1965, pp. 368-369) presents data in which 197 animals are distributed multinomially into four categories, so that the observed data consist of
A genetic model for the population specifies cell probabilities for some with .
Solving this analytically sounds very much like a 1960’s pastime to me (answer: , and a modern computer with PyMC can make short work of numerically approximating this maximum likelihood estimation. It takes no more than this:
import pymc as mc, numpy as np pi = mc.Uniform('pi', 0, 1) y = mc.Multinomial('y', n=197, p=[.5 + .25*pi, .25*(1-pi), .25*(1-pi), .25*pi], value=[125, 18, 20, 34], observed=True) mc.MAP([pi,y]).fit() print pi.value
But the point is to warm-up, not to innovate. The EM way is to introduce a latent variable and set , and work with a multinomial distribution for that induces the multinomial distribution above on . The art is determining a good latent variable, mapping, and distribution, and “good” is meant in terms of the computation it yields:
import pymc as mc, numpy as np pi = mc.Uniform('pi', 0, 1) @mc.stochastic def x(n=197, p=[.5, .25*pi, .25*(1-pi), .25*(1-pi), .25*pi], value=[125.,0.,18.,20.,34.]): return mc.multinomial_like(np.round_(value), n, p) @mc.observed def y(x=x, value=[125, 18, 20, 34]): if np.allclose([x+x, x, x, x], value): return 0 else: return -np.inf
It is no longer possible to get a good fit to an mc.MAP object for this model (why?), but EM does not need to. The EM approach is to alternate between two steps:
- Update (E-step, because it is set to its conditional expectation based on the current value of )
- Update (M-step, because it is chosen to maximize the likelihood for the current value of )
This is simply implemented in PyMC and quick to get the goods:
def E_step(): x.value = [125 * .5 / (.5 + .25*pi.value), 125 * .25*pi.value / (.5 + .25*pi.value), 18, 20, 34] def M_step(): pi.value = (x.value + 34.) / (x.value + 34. + 18. + 20.) for i in range(5): print 'iter %2d: pi=%0.4f, X=%s' %(i, pi.value, x.value) E_step() M_step()
It is not fair to compare speeds without making both run to the same tolerance, but when that is added EM is 2.5 times faster. That could be important sometimes, I suppose.
I have had this idea for a while, to go through the examples from the OpenBUGS webpage and port them to PyMC, so that I can be sure I’m not going much slower than I could be, and so that people can compare MCMC samplers “apples-to-apples”. But its easy to have ideas. Acting on them takes more time.
So I’m happy that I finally found a little time to sit with Kyle Foreman and get started. We ported one example over, the “seeds” random effects logistic regression. It is a nice little example, and it also gave me a chance to put something in the ipython notebook, which I continue to think is a great way to share code.
I have been using “Powell’s Method” to find maximum likelihood (ahem, maximum a posteriori) estimates in my PyMC models for years now. It is something that I arrived at by trying all the options once, and it seemed to work most reliably. But what does it do? I never bothered to figure out, until today.
It does something very reasonable. That is to optimize a multidimensional function along one dimension at a time. And it does something very tricky, which is to update the basis for one-dimensional optimization periodically, so that it quickly finds a “good” set of dimensions to optimize over. Now that sounds familiar, doesn’t it? It is definitely the same kind of trick that makes the Adaptive Metropolis step method a winner in MCMC.
The 48-year-old paper introducing the approach, M. J. D. Powell, An efficient method for finding the minimum of a function of several variables without calculating derivatives, is quite readable today. If you want to see it in action, I added an ipython notebook to my pymc examples repository on github: [ipynb] [py] [pdf].
The negative binomial distribution is cool. Sometimes I think that.
Sometimes I think it is more trouble than it’s worth, a complicated mess.
Wikipedia and PyMC parameterize it differently, and it is a source of continuing confusion for me, so I’m just going to write it out here and have my own reference. (Which will match with PyMC, I hope!)
The important thing about the negative binomial, as far as I’m concerned, is that it is like a Poisson distribution, but “over-dispersed”. That is to say that the standard deviation is not always the square root of the mean. So I’d like to parameterize it with a parameter for the mean and for the dispersion. This is almost what PyMC does, except it calls the dispersion parameter instead of .
The slightly less important, but still informative, thing about the negative binomial, as far as I’m concerned, is that the way it is like a Poisson distribution is very direct. A negative binomial is a Poisson that has a Gamma-distributed random variable for its rate. In other words (symbols?), is just shorthand for
Unfortunately, nobody parameterizes the Gamma distribution this way. And so things get really confusing.
The way to get unconfused is to write out the distributions, although after they’re written, you might doubt me:
The negative binomial distribution is
and the Poisson distribution is
and the Gamma distribution is
Hmm, does that help yet? If and , it all works out:
But instead of integrating it analytically (or in addition to), I am extra re-assured by seeing the results of a little PyMC model for this: