Monthly Archives: March 2011

Data Wrangling in R, Stata and Python

It was nearly a year ago when I was accosted by students I had urged to try Python and their complaints that the data manipulation capabilities they found so convenient in R and Stata were nowhere to be found. At the time, I did some digging and told them to try la.larry (or pandas, as mentioned by readers of that post). With some more experience, these recommendations have come up again, and in hindsight it seems like la.larry is too heavy a hammer for our most common tasks.

I’m hoping to put together a translation guide for R, Stata, and Python (there is already an extensive one… ours will be much more specialized, to just a few data wrangling commands), and until then, here are Kyle’s top two:

The easiest way to build record arrays (aside from csv2rec) IMO:

import numpy as np
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
some_recarray = np.core.records.fromarrays([a,b,c,d], names=['country','age','year','data'])

The fromarrays method is especially nice because it automatically deals with datatypes.

To subset a particular country-year-age:

some_recarray[(some_recarray.country=='USA')
                & (some_recarray.age==6)
                & (some_recarray.year==2005)]

I’ve also found that caching each of the indices independently vastly speeds things up if you’re going to be iterating.

Love the recarray, hate the documentation.

6 Comments

Filed under software engineering

Black History #Math

If I was still tweeting, I’d tweet this to you: a nice post about David Blackwell and his math on Higher Cohomology is Inevitable.

2 Comments

Filed under statistics

That Bayes Factor

A year or more ago, when I was trying to learn about model selection, I started writing a tutorial about doing it with PyMC. It ended up being more difficult than I expected though, and I left it for later. And later has become later and later. Yet people continue landing on this page and I’m sure that they are leaving disappointed because of its unfinishedness. Sad. But fortunately another tutorial exists, Estimating Bayes factors using PyMC by PyMC developer Chris Fonnesbeck. So try that for now, and I’ll work on mine more later (and later and later).

There is also an extensive discussion on the PyMC mailing list that you can turn to here. It produced this monte carlo calculation.

1 Comment

Filed under MCMC, statistics

Ancient Probability

I had a need to look in the first-ever probability textbook this weekend. It isn’t really ancient, but it is definitely old. And it’s all on the Internet Archive as a pdf. Good times.

Abraham de Moivre, The doctrine of chances: or, A method of calculating the probabilities of events in play (1756)

2 Comments

Filed under education