As I was saying in my last post, I’ve been getting interested in actually running Markov Chain Monte Carlo algorithms, instead of trying to prove things about their asymptotic performance. It seems like the “stats” way to do this is to use R and WinBUGS, but I’ve always thought that R programming looks messy. Python is easier on my eyes.
So, it’s my good fortune that PyMC exists. This means I don’t need to do any hard work, like coding a Gibbs sampler or learning R. Let me show you.
Say we want to sample independent sets uniformly from a graph. This might not be what the creators of PyMC were imagining when they designed it, but it turns out to be pretty easy. To follow along at home, you need to have Python, NetworkX, and PyMC installed. Fortunately, these are all free (in both ways). If you’re using Windows, you can make life easy by using Enthought Python.
Now we’re ready to sample independent sets in 20 lines or less.
from pymc import MCMC, Bernoulli, Potential; from pylab import Inf; from networkx import cycle_graph class IndepSetMC(MCMC): def __init__(self, G=cycle_graph(9), beta=0.0): self.G, self.beta = G, beta self.x = [ Bernoulli(str(v), 0.5, value=0) for v in G.nodes_iter() ] self.psi = [ self.IndepSetPotential(v, G[v]) for v in G.nodes_iter() ] MCMC.__init__(self, [self.x,self.psi]) def IndepSetPotential(self, v, N_v): def potential_logp(v, N_v): if v + max(N_v) > 1: return -Inf else: return self.beta*v return Potential(logp = potential_logp, name = "N_%d" % v, parents = {'v': self.x[v], 'N_v': [self.x[w] for w in N_v]}, doc = 'vertex potential term') M = IndepSetMC() M.sample(1000)
Sidenote: NetworkX is another gem for theorists who find the urge to do some computer experimenting. If you want to sample independent sets from , to check that none are larger than
, all you need to change in the above is the second to last line:
import networkx M = IndepSetMC(networkx.fast_gnp_random_graph(128,0.5)) M.sample(10)
This chain is supposed to have a stationary distribution where an independent set appears with probability
. And it works wonderfully when
, where all independent sets are equally likely. For other
, it is not working quite as I expected, so maybe I’ve misunderstood something. But as
, I was expecting all the mass to concentrate on the largest independent sets, meaning under 20 lines of code to solve an NP-hard optimization problem, not too shabby. (It seems like the mass concentrates on the large sets as
, which is what I was saying I’m confused about.)
How can we tell if PyMC is doing a good job?
Does any of the above seem suspicious to you? This chain should not mix well when is large; I don’t expect to find a efficient algorithm for an NP-hard optimization problem floating around on the web. But does the algorithm know when it fails?
From a theory perspective, this is an open question, even for many specific (simpler) chains. And it is something that practitioners (like me?) actually need to be able to trust our results. Apparently practitioners who know their stuff (unlike me!) have some heuristic techniques for doing this. I’ll report back more when I investigate further, but it might mean reading code in R.
While we’re talking about research problems, if we take , we have come very close to the hidden clique problem. With high probability
contains no clique of size exceeding
. That means that the largest clique in
is the
we stuffed in (for
). But, for any
, finding the hidden clique efficiently is an unsolved algorithmic challenge, and a fine candidate for a harder-than-poly-on-average distribution. Since a maximum independent set in
is a maximum clique in
, it would be a fun little project to show how MCMC fairs on finding the hidden clique.
p.s. What does PyMC actually do when I ask it for a step in the chain? I think it tries each variable in a particular order using Gibbs sampling. In other words, it orders the vertices somehow, and for each vertex, in order, it proposes a new state randomly, and accepts or rejects with probability which makes the joint distribution correct.
Pingback: Hurrah for Free/Open Software like PyMC « Healthy Algorithms
To answer your question in ‘ps’: A stripped-down version of M.sample is
M.step_methods is a set of ‘step methods’, objects that are responsible for updating stochastic variables or groups thereof. You can see the step method handling variable x by first calling M.assign_step_methods(), then using the step_method_dict attribute:
If you call s.step(), x will be updated once as it would be in an MCMC loop.
The step method class that gets tapped to handle x in your example is called BinaryMetropolis. The Gibbs updating algorithm in BinaryMetropolis.step() is short enough to be pretty readable.
PyMC can make use of user-defined step methods in two ways. First, you can just tell the MCMC object to use your step method class, call it ‘my_sm’, on variable ‘x’:
where args and kwargs will be passed to my_sm.__init__. Second, you can implement a static method called ‘competence’ in my_sm:
The body of the method should examine any stochastic variable ‘x’ and return an integer from 0 to 3. 0 means ‘I do not know how to update this variable’ and 3 means ‘I will do an amazing job updating this variable’. The competence method may need to examine the sets x.children and x.extended_children.
Now my_sm will automagically be considered by the MCMC object when its assign_step_method method is called.
Oops, looks like the indents didn’t show up in the code…
hopefully it’s readable anyway!
Thanks again, Anand! To make wordpress typeset python nicely, you have to start the code with [
sourcecode language='python'
], and end it similarly. Now, how I write the details of _that_ in wordpress markup is a self-referential mystery to me still.