Category Archives: statistics

In the media

Jake called my attention to a recent public interest story in the NYTimes, about using the results of an on-going telephone survey on predict the demographic profile of happiest man in America. It seems like they’re making fun of simple models, a pursuit which I approve of:

The New York Times asked Gallup to come up with a statistical composite for the happiest person in America, based on the characteristics that most closely correlated with happiness in 2010. Men, for example, tend to be happier than women, older people are happier than middle-aged people, and so on.

Gallup’s answer: he’s a tall, Asian-American, observant Jew who is at least 65 and married, has children, lives in Hawaii, runs his own business and has a household income of more than $120,000 a year.

But the joke is that they made a few phone calls and found the guy… he looks pretty happy.

1 Comment

Filed under global health, statistics

MCMC in Python: Part IIb of PyMC Step Methods and their pitfalls

Two weeks ago, I slapped together a post about the banana distribution and some of my new movies sampling from it. Today I’m going to explore the performance of MCMC with the Adaptive Metropolis step method when the banana dimension grows and it becomes more curvy.

To start things off, have a look at the banana model, as implemented for PyMC:

    C_1 = pl.ones(dim)
    C_1[0] = 100.
    X = mc.Uninformative('X', value=pl.zeros(dim))

    def banana_like(X, tau, b):
        phi_X = pl.copy(X)
        phi_X *= 30. # rescale X to match scale of other models
        phi_X[1] = phi_X[1] + b*phi_X[0]**2 - 100*b

        return mc.normal_like(phi_X, 0., tau)

    @mc.potential
    def banana(X=X, tau=C_1**-1, b=b):
        return banana_like(X, tau, b)

Pretty simple stuff, right? This is a pattern that I find myself using a lot, an uninformative stochastic variable that has the intended distribution implemented as a potential. If I understand correctly, it is something that you can’t do easily with BUGS.

Adaptive Metropolis is (or was) the default step method that PyMC uses to fit a model like this, and you can ensure that it is used (and fiddle with the AM parameters) with the use_step_method() function:

mod.use_step_method(mc.AdaptiveMetropolis, mod.X)

Important Tip: If you want to experiment with different step methods, you need to call mod.use_step_method() before doing any sampling. When you call mod.sample(), any stochastics without step methods assigned are given the method that PyMC deems best, and after they are assigned, use_step_method adds additional methods, but doesn’t remove the automatically added ones. (See this discussion from PyMC mailing list.)

So let’s see how this Adaptive Metropolis performs on a flattened banana as the dimension increases:

You can tell what dimension the banana lives in by counting up the panels in the autocorrelation plots in the upper right of each video, 2, 4 or 8. Definitely AM is up to the challenge of sampling from an 8-dimensional “flattened banana”, but this is nothing more than an uncorrelated normal distribution that has been stretched along one axis, so the results shouldn’t impress anyone.

Here is what happens in a more curvy banana:

Now you can see the performance start to degrade. AM does just fine in a 2-d banana, but its going to need more time to sample appropriately from a 4-d one and even more for the 8-d. This is similar to the way a cross-diagonal region breaks AM, but a lot more like something that could come up in a routine statistical application.

For a serious stress test, I can curve the banana even more:

Now AM has trouble even with the two dimensional version.

I’ve had an idea to take this qualitative exploration and quantify it to do a real “experimental analysis of algorithms” -style analysis. Have you seen something like this already? I do have enough to do without making additional work for myself.

Comments Off on MCMC in Python: Part IIb of PyMC Step Methods and their pitfalls

Filed under statistics, TCS

MCMC in Python: Part II of PyMC Step Methods and their pitfalls

I had a good time with the first round of my Step Method Pitfalls: besides making some great movies, I got a tip on how to combine Hit-and-Run with Adaptive Metropolis (together they are “the H-RAM“, fitting since the approach was suggested by Aram). And even more important than getting the tip, I did enough of a proof-of-concept to inspire Anand to rewrite it in the correct PyMC style. H-RAM lives.

Enter the GHME, where I had a lot of good discussions about methods for metrics, and where Mariel Finucane told me that her stress test for new step methods is always the banana (from Haario, Saksman, Tamminen, Adaptive proposal distribution for random walk Metropolis algorithm, Computational Statistics 1999):

The non-linear banana-shaped distributions are constructed from the Gaussian ones by ‘twisting’ them as follows. Let f be
the density of the multivariate normal distribution N(0, C_1) with the covariance again given by C_1 = {\rm diag}(100, 1, ..., 1). The density function of the ‘twisted’ Gaussian with the nonlinearity parameter b > 0 is given by f_b = f \circ \phi_b, where the function \phi_b(x) = (x_1, x_2 + b x_1^2 - 100b, x_3, ..., x_n).

It’s a good distribution, and it makes for a good movie.

More detailed explorations to follow. What do you want to see?

1 Comment

Filed under statistics, TCS

Gaussian Processes and Jigsaw Puzzles with PyMC.gp

I was thinking about cutting something up into little pieces the other day, let’s not get into the details. The point is, I turned my destructive urge into creative energy when I started thinking about jigsaw puzzles. You might remember when my hobby was maze making with randomly generated bounded depth spanning trees a few months ago. It turns out that jigsaw puzzles are just as fun.

The secret ingredient to my jigsaw puzzle design is the Gaussian process with a Matern covariance function. (Maybe you knew that was coming.) GPs are an elegant way to make the little nubs that hold the puzzle together. It’s best to use two of them together to make the nub, like this:

Doing this is not hard at all, once you sort out the intricacies of the PyMC.gp package, and takes only a few lines of Python code:

def gp_puzzle_nub(diff_degree=2., amp=1., scale=1.5, steps=100):
    """ Generate a puzzle nub connecting point a to point b"""

    M, C = uninformative_prior_gp(0., diff_degree, amp, scale)
    gp.observe(M, C, data.puzzle_t, data.puzzle_x, data.puzzle_V)
    GPx = gp.GPSubmodel('GP', M, C, pl.arange(1))
    X = GPx.value.f(pl.arange(0., 1.0001, 1. / steps))

    M, C = uninformative_prior_gp(0., diff_degree, amp, scale)
    gp.observe(M, C, data.puzzle_t, data.puzzle_y, data.puzzle_V)
    GPy = gp.GPSubmodel('GP', M, C, pl.arange(1))
    Y = GPy.value.f(pl.arange(0., 1.0001, 1. / steps))
    
    return X, Y

(full code here)

I was inspired by something Andrew Gelman blogged, about the utility of writing a paper and a blog post about this or that. So I tried it out. It didn’t work for me, though. There isn’t a paper’s worth of ideas here, but now I’ve depleted my energy before finishing the blog. Here it is: an attempted paper to accompany this post. Patches welcome.

In addition to a aesthetically pleasing diversion, I also got something potentially useful out of this, a diagram of how misspecifying any one of the parameters of the Matern covariance function can lead to similarly strange looking results. This is my evidence that you can’t tell if your amplitude is too small or your scale is too large from a single bad fit:

Comments Off on Gaussian Processes and Jigsaw Puzzles with PyMC.gp

Filed under statistics

Black History #Math

If I was still tweeting, I’d tweet this to you: a nice post about David Blackwell and his math on Higher Cohomology is Inevitable.

2 Comments

Filed under statistics

That Bayes Factor

A year or more ago, when I was trying to learn about model selection, I started writing a tutorial about doing it with PyMC. It ended up being more difficult than I expected though, and I left it for later. And later has become later and later. Yet people continue landing on this page and I’m sure that they are leaving disappointed because of its unfinishedness. Sad. But fortunately another tutorial exists, Estimating Bayes factors using PyMC by PyMC developer Chris Fonnesbeck. So try that for now, and I’ll work on mine more later (and later and later).

There is also an extensive discussion on the PyMC mailing list that you can turn to here. It produced this monte carlo calculation.

1 Comment

Filed under MCMC, statistics

MCMC in Python: Statistical model stuck on a stochastic system dynamics model in PyMC

My recent tutorial on how to stick a statistical model on a systems dynamics model in PyMC generated a good amount of reader interest, as well as an interesting comment from Anand Patil, who writes:

Something that might interest you is that, in continuous-time stochastic differential equation models, handling the unobserved sample path between observations is really tricky. Roberts and Stramer’s On inference for partially observed nonlinear diffusion models using the Metropolis-Hastings algorithm explains why. This difficulty can apply to discrete-time models with loads of missing data as well. Alexandros Beskos has produced several really cool solutions.

This body of research is quite far from the vocabulary I am familiar with, so I’m not sure how serious a problem this could be for me. It did get me interested in sticking my statistical model to a systems model with stochastic dynamics, though, something which took only a few additional lines… thanks PyMC!

## Stochastic SI model

from pymc import *
from numpy import *

#observed data
T = 10
susceptible_data = array([999,997,996,994,993,992,990,989,986,984])
infected_data = array([1,2,5,6,7,18,19,21,23,25])

# stochastic priors
beta = Uniform('beta', 0., 1., value=.05)
gamma = Uniform('gamma', 0., 1., value=.001)
tau = Normal('tau', mu=.01, tau=100., value=.01)

# stochastic compartmental model
S = {}
I = {}

## uninformative initial conditions
S[0] = Uninformative('S_0', value=999.)
I[0] = Uninformative('I_0', value=1.)

## stochastic difference equations for later times
for i in range(1,T):
    @deterministic(name='E[S_%d]'%i)
    def E_S_i(S=S[i-1], I=I[i-1], beta=beta):
        return S - beta * S * I / (S + I)
    S[i] = Normal('S_%d'%i, mu=E_S_i, tau=tau, value=E_S_i.value)

    @deterministic(name='E[I_%d]'%i)
    def E_I_i(S=S[i-1], I=I[i-1], beta=beta, gamma=gamma):
        return I + beta * S * I / (S + I) - gamma * I
    I[i] = Normal('I_%d'%i, mu=E_I_i, tau=tau, value=E_I_i.value)

# data likelihood
A = Poisson('A', mu=[S[i] for i in range(T)], value=susceptible_data, observed=True)
B = Poisson('B', mu=[I[i] for i in range(T)], value=infected_data, observed=True)

This ends up taking a total of 6 lines more than the deterministic version, and all the substantial changes are from lines 24-34. So, question one is for Anand, do I have to worry about unobserved sample paths here? If I’ve understood Roberts and Stramer’s introduction, I should be ok. Question two returns to a blog topic from one year ago, that I’ve continued to try to educate myself about: how do I decide if and when this more complicated model should be used?

Comments Off on MCMC in Python: Statistical model stuck on a stochastic system dynamics model in PyMC

Filed under global health, MCMC, statistics

Election Season Infographics

I’ve seen a lot of visual display of quantitative information in the news lately, and I like that. But I’ve seen a lot of ink used for more style than substance, and that bugs me, especially when the point is stronger with more substance.

In Exhibit A, I draw your attention to the graphic from last week’s NYTimes front page article Top Corporations Aid U.S. Chamber of Commerce Campaign. The graphic on the web differs slightly from online, but both obscure the point: conservative groups are drastically outspending their rivals in the current election cycle.

A casual observer might miss this though, because of the stylish way the data is represented as red and blue squares, each standing on edge. The artfully arbitrary spacing between the overlapping squares makes it even harder to interpret.

Here’s my remix:

With a pro designer to work this over, the NYTimes could have a sexy front page infographic that’s meaningful, too. Look at that: among the top ten organizations, conservative spending is two times liberal. And if you pull out the “party spenders”, i.e. NRCC, NRSC, DCCC, DSCC, then conservatives are spending five times more. A picture is worth a large number of words, but we should still make them mean something.

I have another remix to share… actually, this is the one that got me to make some graphs of my own. Seeing misleading areas in print once a week, that I can stand, but when I was reading up on Washington State’s “Tax the Ultra-Rich” ballot initiative and I saw it again this morning… well, you’re reading the results.

Behold Exhibit B:

In this case, there is no pretense that the pyramid slabs mean something about the number of returns that they represent. They’re not even separate slabs, take a look at the top. This pyramid is metaphorical, and it does have a nice color scheme.

But why not make an actual plot? Again, if you get a professional designer to work it over, it can have nice fonts and margins and all, but doesn’t my remix below get the point across better?

Here’s some code if you want to remix my remix.

3 Comments

Filed under statistics

MCMC in Python: How to stick a statistical model on a system dynamics model in PyMC

A recent question on the PyMC mailing list inspired me.  How can you estimate transition parameters in a compartmental model?  I did a lit search for just this when I started up my generic disease modeling project two years ago.  Much information, I did not find.  I turned up one paper which said basically that using a Bayesian approach was a great idea and someone should try it (and I can’t even find that now!).

Part of the problem was language.  I’ve since learned that micro-simulators call it “calibration” when you estimate parameter values, and there is a whole community of researchers working on “black-box modeling plug-and-play inference” that do something similar as well.  These magic phrases are incantations to the search engines that help find some relevant prior work.

But I started blazing my own path before I learned any of the right words; with PyMC, it is relatively simple.  Consider the classic SIR model from mathematical epidemology.  It’s a great place to start, and it’s what Jason Andrews started with on the PyMC list.  I’ll show you how to formulate it for Bayesian parameter estimation in PyMC, and how to make sure your MCMC has run for long enough. Continue reading

8 Comments

Filed under global health, MCMC, statistics

Network Stats Continue

A couple of new papers on networks crossed my desk this week.   Well, more than a couple, since I’m PC-ing for the Web Algorithms Workshop (WAW) right now.  But a couple crossed my desk that I’m not reviewing, which means I can write about them.

Brendan Nyhan writes:

Just came across your blog post on Christakis/Fowler and the various critiques – thought this paper I just posted to arXiv with Hans Noel might also be of interest: The “Unfriending” Problem: The Consequences of Homophily in Friendship Retention for Causal Estimates of Social Influence

Unfriending is interesting, and an area that seems understudied.  In online social networks, there is often no cost to keep a tie in place.  The XBox Live friend network is not such a case: an XBox gamer sees frequent updates about their friends’ activities. That’s why I thought it made sense when I learned that the XBox Live social network does not exhibit the heavy tailed degree-distribution phenomenon that has been widely reported in real-world networks. Someone should talk Microsoft into releasing an anonymized edition of this graph (if such an anonymization is possible…).

Meanwhile, Anton Westveld and Peter Hoff’s paper on modeling longitudinal network data caught my eye on arxiv: A Mixed Effects Model for Longitudinal Relational and Network Data, with Applications to International Trade and Conflict.

All the things I’d like to read… I could write a book about it. Before I even had time to finish writing this post, I saw another one: On the Existence of the MLE for a Directed Random Graph Network Model with Reciprocation.

1 Comment

Filed under statistics