Category Archives: statistics

August 27, 2010 · 3:34 am

MCMC in Python: Global Temperature Reconstruction with PyMC

A short note on the PyMC mailing list alerted me that the Apeescape, the author of mind of a Markov chain blog, was thinking of using PyMC for replicating some controversial climate data analysis, but was having problems with it. Since I’m a sucker for controversial data, I decided to see if I could do the replication exercise in PyMC myself.

I didn’t dig in to what the climate-hockey-stick fuss is about, that’s something I’ll leave for my copious spare time. What I did do is find the data pretty easily available on the original author’s website, and make a translation of the R/bugs model into pymc/python. My work is all in a github repository if you want to try it yourself, here.

Based on Apeescape’s bugs model, I want to have $\textnormal{temp}_t = N(\mu_t, \sigma^2)$ where $\mu_t = \beta_0 + \beta_1\textnormal{temp}_{t-1} + \beta_2\textnormal{temp}_{t-2} + \sum_{i=3}^{12} \beta_i(\textnormal{PC})_{t,i}$ , with priors $\vec{\beta} \sim N(\vec{0}, 1000 I)$ and $\sigma \sim \textnormal{Uniform}(0,100)$ .

I implemented this in a satisfyingly concise 21 lines of code, that also generate posterior predictive values for model validation:

# load data                                                                                                                                                                      
data = csv2rec('BUGS_data.txt', delimiter='\t')


# define priors                                                                                                                                                                  
beta = Normal('beta', mu=zeros(13), tau=.001, value=zeros(13))
sigma = Uniform('sigma', lower=0., upper=100., value=1.)


# define predictions                                                                                                                                                             
pc = array([data['pc%d'%(ii+1)] for ii in range(10)]) # copy pc data into an array for speed & convenience                                                                       
@deterministic
def mu(beta=beta, temp1=data.lagy1, temp2=data.lagy2, pc=pc):
    return beta[0] + beta[1]*temp1 + beta[2]*temp2 + dot(beta[3:], pc)

@deterministic
def predicted(mu=mu, sigma=sigma):
    return rnormal(mu, sigma**-2.)

# define likelihood                                                                                                                                                              
@observed
def y(value=data.y, mu=mu, sigma=sigma):
    return normal_like(value, mu, sigma**-2.)

Making an image out of this to match the r version got me stuck for a little bit, because the author snuck in a call to “Friedman’s SuperSmoother” in the plot generation code. That seems unnecessarily sneaky to me, especially after going through all the work of setting up a model with fully bayesian priors. Don’t you want to see the model output before running it through some highly complicated smoothing function? (The super-smoother supsmu is a “running lines smoother which chooses between three spans for the lines”, whatever that is.) In case you do, here it is, together with an alternative smoother I hacked together, since python has no super-smoother that I know of.

Since I have the posterior predictions handy, I plotted the median residuals against the median predicted temperature values. I think this shows that the error model is fitting the data pretty well:

5 Comments

Filed under MCMC, statistics

Tagged as bugs, MCMC, pymc, python, temperature reconstruction

July 14, 2010 · 10:57 pm

What is this “effects” business?

I’ve got to figure out what people mean when they say “fixed effect” and “random effect”, because I’ve been confused about it for a year and I’ve been hearing it all the time lately.

Bayesian Data Analysis is my starting guide, which includes a footnote on page 391:

The terms ‘fixed’ and ‘random’ come from the non-Bayesian statistical tradition are are somewhat confusing in a Bayesian context where all unknown parameters are treated as ‘random’. The non-Bayesian view considers fixed effects to be fixed unknown quantities, but the standard procedures proposed to estimate these parameters, based on specified repeated-sampling properties, happen to be equivalent to the Bayesian posterior inferences under a noninformative (uniform) prior distribution.

That doesn’t totally resolve my confusion, though, because my doctor-economist colleagues are often asking for the posterior mean of the random effects, or similarly non-non-Bayesian sounding quantities.

I was about to formulate my working definition, and see how long I can stick to it, but then I was volunteered to teach a seminar on this very topic! So instead of doing the work now, I turn to you, wise internet, to tell me how I can understand this thing.

6 Comments

Filed under statistics

Tagged as non-bayesian, random effects, statistics

April 26, 2010 · 4:30 am

Teleportation Measurements

I’m not attending WWW this week, but I am promoting a paper that I helped with, Tracking the random surfer: Empirically measured teleportation parameters in PageRank. My main contribution was connecting the people with the idea to the people with the data, but I’m happy with the results.

Incidentally, this sort of measurement has a great application in health metrics. But I’m going to keep it secret for a little while, to get my thoughts in order.

Comments Off on Teleportation Measurements

Filed under statistics

Tagged as my research, PageRank, TCS

April 19, 2010 · 7:10 pm

Practical MCMC Advice: When to Stop

I read some good practical advice about when enough is enough in Markov Chain Monte Carlo sampling this morning. In their “Inference from simulations and monitoring convergence” chapter of Handbook of Markov Chain Monte Carlo, Andrew Gelman and Kenneth Shirley say many useful things in a quickly digested format. Continue reading →

Comments Off on Practical MCMC Advice: When to Stop

Filed under MCMC, statistics, TCS

Tagged as MCMC, statistics

December 2, 2009 · 7:12 am

Multilevel (hierarchical) modeling: what it can and cannot do in Python

I re-read a short paper of Andrew Gelman’s yesterday about multilevel modeling, and thought “That would make a nice example for PyMC”. The paper is “Multilevel (hierarchical) modeling: what it can and cannot do, and R code for it is on his website.

To make things even easier for a casual blogger like myself, the example from the paper is extended in the “ARM book”, and Whit Armstrong has already implemented several variants from this book in PyMC. Continue reading →

17 Comments

Filed under MCMC, statistics

Tagged as Bayesian, MCMC, multilevel modeling, pymc, python, radon, statistics, tutorial

September 17, 2009 · 10:51 pm

Top

I don’t feel like having that post about how big things are brewing in US health care reform on the top of my blog anymore, so here is a quick replacement: a ranking paper that caught my eye recently on arxiv, where computer scientists is applied to politics: On Ranking Senators By Their Votes, by my fellow CMU alum, Mugizi Rwebangira (@rweba on twitter).

Comments Off on Top

Filed under statistics

Tagged as arxiv, political science, statistics, TCS

August 25, 2009 · 12:07 am

MCMC in Python: PyMC for Bayesian Model Selection

(Updated 9/2/2009, but still unfinished; see other’s work on this that I’ve collected)

I never took a statistics class, so I only know the kind of statistics you learn on the street. But now that I’m in global health research, I’ve been doing a lot of on-the-job learning. This post is about something I’ve been reading about recently, how to decide if a simple statistical model is sufficient or if the data demands a more complicated one. To keep the matter concrete (and controversial) I’ll focus on a claim from a recent paper in Nature that my colleague, Haidong Wang, choose for our IHME journal club last week: Advances in development reverse fertility declines. The title of this short letter boldly claims a causal link between total fertility rate (an instantaneous measure of how many babies a population is making) and the human development index (a composite measure of how “developed” a country is, on a scale of 0 to 1). Exhibit A in their case is the following figure:

An astute observer of this chart might ask, “what’s up with the scales on those axes?” But this post is not about the visual display of quantitative information. It is about deciding if the data has a piecewise linear relationship that Myrskyla et al claim, and doing it in a Bayesian framework with Python and PyMC. But let’s start with a figure where the axes have a familiar linear scale! Continue reading →

30 Comments

Filed under MCMC, statistics

Tagged as Bayesian, human development index, MCMC, probability, pymc, python, total fertility rate, tutorial

Category Archives: statistics

MCMC in Python: Global Temperature Reconstruction with PyMC

What is this “effects” business?

Teleportation Measurements

Practical MCMC Advice: When to Stop

Multilevel (hierarchical) modeling: what it can and cannot do in Python

Top

MCMC in Python: PyMC for Bayesian Model Selection

Posts

Theory Blogs

some rights reserved

Pages

Archives

Meta