Here is a web tool that recently crossed my desk, an interactive map on the human cost of mountaintop removal. It is a mashup of a bunch of different data sources, all on a Google map, that all say it is not healthy to live near a mountaintop removal coal mine.
Powell’s Method for Maximization in PyMC
I have been using “Powell’s Method” to find maximum likelihood (ahem, maximum a posteriori) estimates in my PyMC models for years now. It is something that I arrived at by trying all the options once, and it seemed to work most reliably. But what does it do? I never bothered to figure out, until today.
It does something very reasonable. That is to optimize a multidimensional function along one dimension at a time. And it does something very tricky, which is to update the basis for one-dimensional optimization periodically, so that it quickly finds a “good” set of dimensions to optimize over. Now that sounds familiar, doesn’t it? It is definitely the same kind of trick that makes the Adaptive Metropolis step method a winner in MCMC.
The 48-year-old paper introducing the approach, M. J. D. Powell, An efficient method for finding the minimum of a function of several variables without calculating derivatives, is quite readable today. If you want to see it in action, I added an ipython notebook to my pymc examples repository on github: [ipynb] [py] [pdf].
Filed under Uncategorized
Parameterizing Negative Binomial distributions
The negative binomial distribution is cool. Sometimes I think that.
Sometimes I think it is more trouble than it’s worth, a complicated mess.
Today, both.
Wikipedia and PyMC parameterize it differently, and it is a source of continuing confusion for me, so I’m just going to write it out here and have my own reference. (Which will match with PyMC, I hope!)
The important thing about the negative binomial, as far as I’m concerned, is that it is like a Poisson distribution, but “over-dispersed”. That is to say that the standard deviation is not always the square root of the mean. So I’d like to parameterize it with a parameter for the mean and
for the dispersion. This is almost what PyMC does, except it calls the dispersion parameter
instead of
.
The slightly less important, but still informative, thing about the negative binomial, as far as I’m concerned, is that the way it is like a Poisson distribution is very direct. A negative binomial is a Poisson that has a Gamma-distributed random variable for its rate. In other words (symbols?), is just shorthand for
Unfortunately, nobody parameterizes the Gamma distribution this way. And so things get really confusing.
The way to get unconfused is to write out the distributions, although after they’re written, you might doubt me:
The negative binomial distribution is
and the Poisson distribution is
and the Gamma distribution is
Hmm, does that help yet? If and
, it all works out:
But instead of integrating it analytically (or in addition to), I am extra re-assured by seeing the results of a little PyMC model for this:
I put a notebook for making this plot in my pymc-examples repository. Love those notebooks. [pdf] [ipynb]
Filed under statistics
PyMC+Pandas: Poisson Regression Example
When I was gushing about the python data package pandas, commenter Rafael S. Calsaverini asked about combining it with PyMC, the python MCMC package that I usually gush about. I had a few minutes free and gave it a try. And just for fun I gave it a try in the new ipython notebook. It works, but it could work even better. See attached:
Filed under MCMC, software engineering
My new favorite for pythonic data wrangling
I’ve written before about my search for the way to deal with data in python. It’s time to write again, though because I have a new favorite: pandas, the panel data package.
There is copious, and growing documentation for pandas, but it assumes a level of familiarity with python and numpy. I thought I’d write some little examples calculations that I’ve done with pandas recently to complement the real docs with some “recipes”. You don’t really need to know python to use these, let alone numpy.
To begin, here are the creation and subset routines in pandas that do the same work that my last foray into this subject accomplished with the rec_array:
import pandas
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
df = pandas.DataFrame({'country': a, 'age': b, 'year': c, 'data': d})
This is cooler than a rec_array because you don’t have to dig in the docs for the constructor, and you can use a dictionary to name each column.
You can select the subset of data relevant to a particular country-year-age thusly:
df[(df['country']=='USA') & (df['age']==6) & (df['year']==2005)]
This is not as cool as a rec_array, because writing It’s good that I complained about my uncool df['age'] has more characters than df.age, but I feel churlish to complain about it.
df['age'] business, because I learned that df.age works, too, as long as you are using an up-to-date pandas.
More substantial recipe to come. Is there already a cookbook out there?
Filed under software engineering
Code as Play
Cool project for teaching programming through web games: Play My Code
How to embed the game in the blog?
Filed under education
PyMC and PyMCMC
I learned last week about a Python Package for doing MCMC estimation, called PyMCMC. It sounds sort of like something I’m always writing about, doesn’t it?
From my quick look, it appears that pyMCMC has some advanced sampling methods (like Slice sampling) that are not yet implemented for PyMC. On the other hand, it seems like PyMC has a more flexible modeling language, which permits formulation of complex models without writing out likelihood functions explicitly.
Has anyone used PyMCMC? How did it go for you?
Filed under MCMC
Bitcoin and Anonymity
Bitcoin is intriguing, a digital currency where the entire transaction history of economy is held in common by all participants. I think that this will be a great observatory for research for someone. I read a recent paper that has some of the elements of this, An Analysis of Anonymity in the Bitcoin System by Fergal Reid and Martin Harrigan recently. As the name implies, it is mostly about the anonymity of the system. But it also includes a description of “the alleged theft of Bitcoins, which, at the time of the theft, had a market value of approximately half a million U.S. dollars”. That could be the plot of a good heist movie.
The paper led me to the bitcoin tools repository, which I’ll have to look into in more detail in the future.
Filed under Complex Networks
Complex Networks in the Kitchen
Two papers in the Arxiv caught my eye recently, (I have time to keep up on papers again!) both about networks and cooking. Both came out around Thanksgiving, too, but maybe that is just a coincidence.
Chun-Yuen Teng, Yu-Ru Lin, Lada A. Adamic, Recipe recommendation using ingredient networks
Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-László Barabási, Flavor network and the principles of food pairing
They both have wonderfully complex network graphics, although the lack of information in these beautiful figures is acknowledged:


I’d love to combine this sort of analysis with the work on nutritional risk factors that has been going on around here recently. Did either of these papers come with a dataset I can explore?
Filed under Complex Networks
Teaching in Video Blips
Claire Mathieu has been blogging about intro math and CS videos from Khan Academy and from others:
I’ve heard about this Khan Academy, and it seems like more and more course material is appearing as tiny web videos.
Also I recently found out that there is a free, online version of the Stanford Intro AI Class taught by Peter Norvig and Sebastian Thrun, for which 56,000 students signed up. I think I accidentally did their homework.
Filed under education




