Category Archives: software engineering

hello, world of statistical graphics in IPython notebook

A few months ago, I had great success invoking the internet to come up with the “hello, world” of statistical graphics.

There are some exciting new developments in javascript-based plotting, and this graphic is just the thing to compare them. D3js has conquered the world in recent years, and is something that my colleagues are starting to think they need to know. Meanwhile, one of the d3js instigators has unveiled the next in his series of revolutions in data visualization, Vega. This is still in development, but may be more appropriate than d3js for routine plots. And it was very soon after the Vega specification and runtime appeared that a python package for it was also released.

Here is an IPython notebook comparing all of these options. The notebook doesn’t save javascript in a way that redisplays, but if you put it in your own notebook server and execute all the cells you should see something like this:

vincent_vega

p.s. google vincent vega to learn the pop culture joke behind this strangely named python package.

2 Comments

Filed under software engineering

Happy Pi Day

I recently came across a stack overflow post just perfect for Pi day. The path to knowledge is asking many questions, and it is a strange feature of the days in which we live how steep this path can be: a question that starts “How to determine whether my calculation of pi is accurate? I was trying various methods to implement a program that gives the digits of pi sequentially…” eventually receives an answer that starts “Since I’m the current world record holder for the most digits of pi, I’ll add my two cents…”

All here.

2 Comments

Filed under software engineering

Stan in IPython: reproducing 8 schools

Continuing my experiment using Stan in IPython, here is a notebook to do a bit of the eight schools example from the RStan Getting Started Guide.

Leave a Comment

Filed under software engineering

Stan in IPython: getting starting

There has been a low murmur about new MCMC package bouncing through my email inbox for a while now. Stan, it is. The project has reached the point where the developers are soliciting Python integration volunteers, so I decided it is time to check it out.

Good news, it installed and ran the example without frustration! I don’t take that for granted with research software.

IPython Notebook here.

Leave a Comment

Filed under software engineering

MCMC in Python: Bayesian meta-analysis example

In slow progress on my plan to to go through the examples from the OpenBUGS webpage and port them to PyMC, I offer you now Blockers, a random effects meta-analysis of clinical trials.



[py] [pdf]

1 Comment

Filed under MCMC, software engineering

MCMC in Python: A random effects logistic regression example

I have had this idea for a while, to go through the examples from the OpenBUGS webpage and port them to PyMC, so that I can be sure I’m not going much slower than I could be, and so that people can compare MCMC samplers “apples-to-apples”. But its easy to have ideas. Acting on them takes more time.

So I’m happy that I finally found a little time to sit with Kyle Foreman and get started. We ported one example over, the “seeds” random effects logistic regression. It is a nice little example, and it also gave me a chance to put something in the ipython notebook, which I continue to think is a great way to share code.





[py] [pdf]

3 Comments

Filed under MCMC, software engineering

PyMC+Pandas: Poisson Regression Example

When I was gushing about the python data package pandas, commenter Rafael S. Calsaverini asked about combining it with PyMC, the python MCMC package that I usually gush about. I had a few minutes free and gave it a try. And just for fun I gave it a try in the new ipython notebook. It works, but it could work even better. See attached:

[pdf] [ipynb]

3 Comments

Filed under MCMC, software engineering

My new favorite for pythonic data wrangling

I’ve written before about my search for the way to deal with data in python. It’s time to write again, though because I have a new favorite: pandas, the panel data package.

There is copious, and growing documentation for pandas, but it assumes a level of familiarity with python and numpy. I thought I’d write some little examples calculations that I’ve done with pandas recently to complement the real docs with some “recipes”. You don’t really need to know python to use these, let alone numpy.

To begin, here are the creation and subset routines in pandas that do the same work that my last foray into this subject accomplished with the rec_array:

import pandas
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
df = pandas.DataFrame({'country': a, 'age': b, 'year': c, 'data': d})

This is cooler than a rec_array because you don’t have to dig in the docs for the constructor, and you can use a dictionary to name each column.

You can select the subset of data relevant to a particular country-year-age thusly:

df[(df['country']=='USA') & (df['age']==6) & (df['year']==2005)]

This is not as cool as a rec_array, because writing df['age'] has more characters than df.age, but I feel churlish to complain about it.
It’s good that I complained about my uncool df['age'] business, because I learned that df.age works, too, as long as you are using an up-to-date pandas.

More substantial recipe to come. Is there already a cookbook out there?

5 Comments

Filed under software engineering

Validating Statistical Models

I’ve been thinking a lot about validating statistical models. My disease models are complicated, there are many places to make a little mistake. And people care about the numbers, so they will care if I make mistakes. My concern is grounded in experience; when I was re-implementing my disease modeling system, I realized that I mis-parameterized a bit of the model, giving undue influence to observations with small sample size. Good thing I caught it before anything was published based on the resultsI published anything based on the results!

How do I avoid this trouble going forwards? A well-timed blog post from Statistical Modeling, Causal Inference, and Social Science highlights one way, described in a paper linked there. I like this and I partially replicated in PyMC. But I’m concerned about something, which the authors mention in their conclusion:

To help ensure that errors, when present, are apparent from the simulation results, we caution against using “nice” numbers for fixed inputs or “balanced” dimensions in these simulations. For example, consider a generic hyperprior scale parameter s. If software were incorrectly written to use s^2 instead of s, the software could still appear to work correctly if tested with the fixed value of s set to 1 (or very close to 1), but would not work correctly for other values of s.

How do I avoid nice numbers in practice? I have an idea, but I’m not sure I like it. Does anyone else have ideas?

Also, my replication only works part of the time for my simple example, I guess because one of my errors is not enough of an error:


Leave a Comment

Filed under MCMC, software engineering

Other Way Cool Demos from SciPy 2011

Besides the marvelous upgrade to ipython, there were some other things I saw at SciPy 2011 that I want to remember to remember.

I think I’ll have a lot more to say about Dexy soon, because I really need something like that. A tool to make documentation sexy. If only the tool itself had more documentation!

2 Comments

Filed under software engineering