MCMC in Python: PyMC for Bayesian Probability

I’ve got an urge to write another introductory tutorial for the Python MCMC package PyMC.  This time, I say enough to the comfortable realm of Markov Chains for their own sake.  In this tutorial, I’ll test the waters of Bayesian probability.

Darwin's other Bulldog

Tom Bayes

Now, what better problem to stick my toe in than the one that inspired Reverend Thomas in the first place? Let’s talk about sex ratio. This is also convenient, because I can crib from Bayesian Data Analysis, that book Rif recommended me a month ago.

Bayes started this enterprise off with a question that has inspired many an evolutionary biologist: are girl children as likely as boy children? Or are they more likely or less likely? Laplace wondered this also, and in his time and place (from 1745 to 1770 in Paris) there were birth records of 241,945 girls and 251,527 boys. In the USA in 2005, the vital registration system recorded 2,118,982 male and 2,019,367 female live births [1]. I’ll set up a Bayesian model of this, and ask PyMC if the sex ratio could really be 1.0.

The simple model of the situation is that there is some number p_b \in [0,1], and every baby flips a biased coin, which says “be a boy” with probability p_b and “be a girl” with probability 1-p_b. This model ignores correlations between twins, the biophysics of sperm, the existence of intersex babies, as well as many other things. All models are wrong, but some are simple examples.

Mathematically, if N denotes the number of live births, and N_b denotes the number of boys, the model says

\Pr[N_b = k | p_b = p, N = n] = \binom{n}{k} p^k(1-p)^{n-k}

You can think of this as a one parameter model. If I tell the model p_b, then for any N and N_b, the model can tell me the probability. Now, to answer a question like “is it possible that p_b = 0.5?”, the Bayesians need some sort of prior distribution on p_b. Where does this prior come from? Maybe I should invoke the “principle of insufficient reason”, and say that, since I have no idea, I expect any value between 0 and 1 for p_b, which leads to a uniform distribution on [0,1]. But maybe that’s not an accurate representation of truth; even before I looked up the numbers quoted above, I would have been shocked to learn that p_b \geq 0.9, much more shocked than if you told me that 0.45 \leq p_b \leq 0.55. But I’ll hold on to that thought, and go with the uniform prior for now.

In PyMC, it’s 4 lines to set up this model:

from pymc import *
p_b = Uniform('p_b', 0.0, 1.0)
N_b = Binomial('N_b', n=4138349, p=p_b, value=2118982, observed=True)
m = Model([p_b, N_b])

To see the posterior log-probability for a particular value of p_b, do this:

p_b.value = 0.5
print m.logp

Here is the plot:

First Bayesian Example

First Bayesian Example

This simple example is so simple that you don’t need any MCMC. But it would be easy to do some if you did.

mc = MCMC(m)
mc.sample(iter=50000,burn=10000)
hist(p_b.trace())

Histogram of samples from posterior distribution

To return to the thought that I’ve been holding, I know that p_b \not\geq 0.9. How can I encode this? I could replace my definition of p_b using something like p_b = Uniform('p_b', 0.0, 0.95), but I’d also be surprised to learn that p_b \leq 0.1, so maybe something like p_b = Normal('p_b', 0.5, 0.2) is a good way to go. I don’t know what the pros would think of that; they seem to favor an exotic prior, like p_b = Beta('p_b', 10, 10) which is “conjugate” in this case.

To turn the problem around (and put it in a context I’m familiar with from work), how concentrated does the prior have to be to model a doctor who believes that p_b = 0.5 despite the evidence?

For bonus points, with PyMC we can go beyond the work of Bayes, and model the uncertainty in the vital registration system. These numbers for the USA are probably very close to perfect. But I would be more skeptical of the accuracy for the 241,945 girls and 251,527 boys recorded during Laplace’s day. And how can I represent my skepticism? Maybe I’ll represent it as a normally distributed error with standard deviation 10000.

from pymc import *
p_b = Uniform('p_b', 0.0, 1.0)
e_N = Normal('e_N', 0.0, 1.0/(10000.0)**2)
e_N_b = Normal('e_N_b', 0.0, 1.0/(10000.0)**2)
@potential
def likelihood(N_b=251527, e_N_b=e_N_b, N=241945+251527, e_N=e_N, p=p_b):
    return binomial_like(N_b + e_N_b, N + e_N, p)
mc = MCMC([p_b, N, N_b, likelihood])

Easy, right? Well, easy to write. Can you get PyMC to draw believable samples from it? I can, but it takes a little longer. Remember to thin your samples!

8 Comments

Filed under MCMC, probability

8 responses to “MCMC in Python: PyMC for Bayesian Probability

  1. Erik Tollerud

    How did you make the plot of the posterior log-probability? Is there a function in PyMC to do that, or did you just plot the distribution by writing a separate function?

  2. Erik: There is not a PyMC function to make that plot, and it took me a line or two more Python than it should, so I didn’t include it in the example. Here is what it takes to get started:

    p_vals = arange(0.01,1.0,.01)
    logpr_vals = []
    for p in p_vals:
        p_b.value = p
        logpr_vals.append(m.logp)
    plot(p_vals, logpr_vals)
    

    To get a zoomed-in view of the piece where the action is, use p_vals = arange(0.51,0.515,0.00001) as line 1. Then it will even look nice with plot(p_vals,exp(array(logpr_vals))).

  3. I found the cool way to make that plot.

    map = MAP(m)
    p_vals = arange(0.01,1.0,.01)
    plot(p_vals, [-map.func(p) for p in p_vals])  
    

    🙂

  4. > so maybe something like p_b = Normal(‘p_b’, 0.5, 0.2) is a good way
    > to go

    The problem with allowing the parameter p_b to be distributed Normally
    is that a Normal distribution would assign positive probability to
    p_b>1.0 and p_b<0. Though you can make this probability vanishingly
    small by choosing an appropritately small variance, it will always be
    positive. This is why more "exotic" distributions are prefered for
    distributions over parameters than have a constrained range.

  5. @Josh, there is also the Truncnorm stoch in PyMC, which combines the familiarity of the normal distribution with the appropriate support of the beta distribution. Here is some PyMC documentation on it, together with an example of a custom step method that doesn’t waste time stepping over the edge: example-an-asymmetric-metropolis-step.

  6. Pingback: 2010 in review | Healthy Algorithms

  7. Jonas

    Thank you, that was a useful post. Going to use this tutorial for a completely different project.

  8. Great, if there is something to show from your project please post a link here, I’d love to see how this lives on.