A recent question on the PyMC mailing list inspired me. How can you estimate transition parameters in a compartmental model? I did a lit search for just this when I started up my generic disease modeling project two years ago. Much information, I did not find. I turned up one paper which said basically that using a Bayesian approach was a great idea and someone should try it (and I can’t even find that now!).
Part of the problem was language. I’ve since learned that micro-simulators call it “calibration” when you estimate parameter values, and there is a whole community of researchers working on “
black-box modeling plug-and-play inference” that do something similar as well. These magic phrases are incantations to the search engines that help find some relevant prior work.
But I started blazing my own path before I learned any of the right words; with PyMC, it is relatively simple. Consider the classic SIR model from mathematical epidemology. It’s a great place to start, and it’s what Jason Andrews started with on the PyMC list. I’ll show you how to formulate it for Bayesian parameter estimation in PyMC, and how to make sure your MCMC has run for long enough.
My impression of what a typical mathematical epidemiological analysis world do with this model is the following: start with a few parameters, see what the results are from fiddling around with them, make to model more complicated to incorporate a new feature of interest, fiddle around with the parameters some more, and then write a paper about some insight gained during all this fiddling. I prefer to connect the model directly to some data. Here is how:
## SI model from pymc import * from numpy import * #observed data T = 10 susceptible_data = array([999,997,996,994,993,992,990,989,986,984]) infected_data = array([1,2,5,6,7,8,9,11,13,15]) # stochastic priors beta = Uniform('beta', 0., 40., value=1.) gamma = Uniform('gamma', 0., 1., value=.001) SI_0 = Uninformative('SI_0', value=[999., 1.]) # deterministic compartmental model @deterministic def SI(SI_0=SI_0, beta=beta, gamma=gamma): S = zeros(T) I = zeros(T) S = SI_0 I = SI_0 for i in range(1,T): S[i] = S[i-1] - 0.05*beta*S[i-1]*I[i-1]/(S[i-1]+I[i-1]) I[i] = max(0., I[i-1] + 0.05*beta*S[i-1]*I[i-1]/(S[i-1]+I[i-1]) - gamma*I[i-1]) return S, I S = Lambda('S', lambda SI=SI: SI) I = Lambda('I', lambda SI=SI: SI) # data likelihood A = Poisson('A', mu=S, value=susceptible_data, observed=True) B = Poisson('B', mu=I, value=infected_data, observed=True)
The model takes 32 short lines; not bad. Lines 7-9 are the data, observed counts of susceptible and infected population sizes by day for 10 days. Jason says that he just made this data up for the example, but he has good leads on some real data that I’ll describe below in the “exercises for the reader” section. Lines 12-14 are the “priors”, which I have chosen to be uninformative. There is some discussion to have about what is truly uninformative, so go rile up your neighborhood statistician by asking them about it.
Lines 17-26 are where all the action is. This is where I solve the (deterministic) difference equations that constitute the SIR model. By decorating the SI function with the
@deterministic tag, I’ve told PyMC that this function is actually part of the model, a deterministic function of the stochastic variables I’ve passed in. Lines 27 and 28 are really just for convenience, a way to talk about
I separately instead of having them stuck together in
SI the way that they were calculated.
Lines 31 and 32 set up the data likelihood, the novel part of this approach. This is all it takes to stick a statistical model on a system dynamics model, once you have the latter set up in PyMC. And getting the latter set up in PyMC isn’t much of an ordeal to begin with, if you’ve got it coded up in Python. Just get your priors figured out and preface the whole thing in a
@deterministic tag, as above.
What do you get for these minor labors? Well, I like to test out a model like this in ipython, for which I would save the above code as “si_model.py” and then say things like:
from pylab import * from pymc import * # load the model import si_model as mod reload(mod) # this reload streamlines interactive debugging # fit the model with mcmc mc = MCMC(mod) mc.sample(iter=200000, burn=100000, thin=20, verbose=1) mod.beta.stats()
After a couple of minutes you should have got something, but what? To see exactly is an exploratory process for me, and I messed around for a while to get some graphics generating code I like for it. You can see the full code, try it out, change it, whatever.
For the time-series of the epidemic, here is my picture:
It shows that the model fits the data reasonably well, at least to my eye.
For my beta and gamma distributions I got this:
This is quite a lovely joint posterior distribution, I think, which shows that the parameter values that fit the data are highly dependent. You can have any
gamma value you like, but your
beta must be chosen to correspond. That is something you might miss with a more fiddly, less systematic exploration of parameter space.
There is a big problem with MCMC, though, and that is how do you know that you’ve run the chain for long enough? I’ve read up on a number of methods, and I’ll tell you two. First, you can look at the
'mc error' value that PyMC provides in each stochastic’s stats:
This breaks up the samples into 100 batches, and reports the standard deviation of the means of the batches. You want that to be small. How small? Well, small relative to the mean overall, I guess. This approach is simple, but I don’t trust it as much as my next approach. Which is the one that generates the following figure:
This approach, as alluded in the figure caption, is to plot the autocorrelation of the MCMC samples. In theory-world, where I come from, you just thin your samples by a constant factor, and reduce the dependence between samples magnificently. Of course, if you are not living in theory-world, you might not want to wait for a constant factor of thinning, especially if the constant is measured in days.
Fortunately there is a one-line solution. Before running the chain, but after creating the
MCMC object, I’ll just ask for a step method which uses the state-of-the-art Adaptive Metropolis updates. Here is the code:
mc = MCMC(mod) mc.use_step_method(AdaptiveMetropolis, [mod.beta, mod.gamma, mod.SI_0]) mc.sample(iter=200000, burn=100000, thin=20, verbose=1)
and here is how it changes the figure:
This is what I really like to see. The autocorrelation of the sample trace at one lag is small, but visable, and the autocorrelation at two lags is basically zero. So I’m not waiting too long for mixing, but I’m not drawing very correlated samples that lead to incorrect estimates. It also makes a lot of sense. The joint posterior distribution of beta and gamma shows that they are highly correlated, so the MCMC mixes more rapidly if it can take advantage of this fact. And AdaptiveMetropolis does just this, by learning the best directions for randomly wandering around the space as it goes.
Now, the exercise for the reader: extend all of this to use real data and an additional compartment. Here is Jason’s suggestion:
I found the smallpox example that might be a good historical example to use. It describes an outbreak of 32 cases of smallpox that occurred in Nigeria in 1967 among a religious group that refused vaccination. The original WHO/CDC report.
A couple of authors have used this to estimate transmission parameters. One example is here.
All the data is also in tables there, and they’ve pulled in some estimates from other outbreaks. Many possibilities for how to specify a model here for an example, but you could do a SEIR model:
Where the population in E(xposed) aren’t infectious, as the authors of that paper concluded that there was negligible infectiousness prior to rash (or to test this, use beta1 as transmission parameter for E*S and a beta2 for I*S). They found Ro to be 0.16 prior to rash and >6 after rash.
So one possibility:
S(t+1) = S(t) - beta*S(t)*I(t)/N(t)
E(t+1) = E(t) + beta*S(t)*I(t)/N(t) - mu*E(t)
I(t+1) = I(t) + mu*E(t) - lamda*I(t)
R(t+1) = R(t) + lamda*I(t)
where 1/mu is duration of infection prior to rash = (11.6+2.49) from table 3
1/lamda is the duration of infectiousness = 16 from table 3
If you wanted to put in two betas, would just subtract beta2*S(t)*E(t) to line one and add it to line two.
Going to take a crack at a solution? Let me know what you find, or maybe you can have a guest post about it!
Finally, let me show you what the fit looked like before I made
SI_0 a stochastic variable. When I just started it at the “right” initial value, I got this:
Which (in language of philosophy statistics that I just learned) I would say fails my “pure significance test” by looking like a crappy fit to the data. I’ll have more to say about that in the future, but until then, here is the applied philosophy treatise on the matter.