This National Health Statistics Report that we read toward the end of last quarter’s journal club has one of the driest names we’ve seen. But the topic is a fascinating glimpse into the limits of our knowledge about society. How many households in USA have given up their landline phone entirely and only have a cell phone? Well, we answer most questions like that with a telephone survey. Uh-oh. Fortunately the National Health Interview Survey (in my experience, pronounced most commonly as “en-hiss”) is a health survey were enumerators visit households in person, and even though it is about population health, it can also answer this pressing question about technology use (and the potential invalidity of all of the surveys that do not visit households in person, but just call on the phone).
Monthly Archives: January 2014
Journal Club: Wireless Substitution: State-level Estimates From the National Health Interview Survey, January 2007–June 2010
It was exactly a year ago when I firmed up a workflow wherein the IPython Notebook was the center of my daily scientific research. All notes end up in a .ipynb file, and my code, plots, and equations all live together there. Looking back on 2013, how did it go and what should I change for 2014?
I am very happy with it overall. I have 641 .ipynb files, with names like
2013_12_22a_dm_pde_for_pop_prediction.ipynb. This includes notes for two courses I taught and plan to teach again, for several papers that we published, and for a large number of projects that didn’t pan out. I’ll definitely use the course notes again the next time I teach, I’ve already had to look up the calculations from some of those papers for responses to reviewers and clarifications after publication, and maybe I can come back to projects that didn’t pan out in the future with some new insight.
What could go better? I couldn’t decide if my lab book should capture everything, like I was taught in science class, or have a curated collection of my work including only the parts I would need in the future. Probably some blend is best, and since it is hard to know the right balance ahead of time, I tried to keep everything in a git repo, so that I could curate and edit, but recover anything that I realized I still wanted after cutting. I only ended up with 59 git commits, though. If that approach was working, I would expect more commits than notebooks.
I sometimes lost things in my stack of notebooks. The .ipynb format is not easy to search, so I kept a .py copy of everything and grepped through them looking for the notebooks about a specific technique or project. Since I organized my notebooks chronologically, I ended up doing this a lot more than if I had organized them thematically, but even if I already had all of my congenital heart disease notes in one place, I would still find myself saying, “I know I did some data munging like this for a different project recently, how does the pandas.melt function work again?”, or whatever.
The feature I would like the most is a way to paste images into my notebook. I wrote some notes about it in a github issue page about IPython Notebook feature requests. I want the digital equivalent of stapling a copy into my lab book, and I want it to be easy.
Collaboration worked pretty well. I have a lot of colleagues who don’t want to see Python code, no matter how much easier it would make their lives. I’ve had good success sending them pdf version of notebooks, or sticking my research notes in a github gist and sending them a link to nbviewer. I think there is room for improvement in this, too, though.
I’ve been organizing my thoughts on probabilistic programming and Bayesian computation. Try this out: there are four things a probabilistic programming system needs, depending on who is using it for what:
- Expressive language for formulating models
- Efficient computation of objective functions
- Flexible inference algorithms
- Appropriate data analysis workflow
Maybe I can come up with better names for these pieces, and maybe they are not all different. And maybe I am missing something. This is sort of preliminary. But let me elaborate on how it works in the case of PyMC.
Expressive language for formulating models: this is what drew me to PyMC when I started doing applied work five-ish years ago. Just write Python. For simple things, it reads as easily as equations in a stats paper, and for complex things it can have subroutines, data structures, and all of the nice things I expect from a modern programming language.
Efficient computation of objective functions: PyMC2 has a strange confection of Python and Fortran under the hood, which works well enough for the stuff I’ve been doing. But (if I understand correctly) PyMC3 pushes everything off into Theano, which does a more sophisticated translation/compilation of the code.
Flexible inference algorithms: I think that a lot of the inspiration for PyMC3 is the possibility of using Hamiltonian Monte Carlo methods for generating MCMC steps, which requires quickly computing the derivative of the objective function. PyMC2 has relied heavily on the Adaptive Metropolis step method. In the past, I’ve had a lot of fun experimenting with alternative approaches.
Appropriate data analysis workflow: I’ve had a few long discussions with other researchers who are using these methods about the barriers for their work and their colleagues, and this is the part that seems most important. How do you get the data all in place to evaluate objective functions and run flexible inference algorithms? This is not really a core part of PyMC, but rather something to be done with general Python, which suits me just fine.
I’d love to workshop this a little bit with you, dear reader, so I’m going to try turning on comments again. I hope I don’t get spammed into oblivion.