Daily Archives: March 31, 2011

Data Wrangling in R, Stata and Python

It was nearly a year ago when I was accosted by students I had urged to try Python and their complaints that the data manipulation capabilities they found so convenient in R and Stata were nowhere to be found. At the time, I did some digging and told them to try la.larry (or pandas, as mentioned by readers of that post). With some more experience, these recommendations have come up again, and in hindsight it seems like la.larry is too heavy a hammer for our most common tasks.

I’m hoping to put together a translation guide for R, Stata, and Python (there is already an extensive one… ours will be much more specialized, to just a few data wrangling commands), and until then, here are Kyle’s top two:

The easiest way to build record arrays (aside from csv2rec) IMO:

import numpy as np
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
some_recarray = np.core.records.fromarrays([a,b,c,d], names=['country','age','year','data'])

The fromarrays method is especially nice because it automatically deals with datatypes.

To subset a particular country-year-age:

some_recarray[(some_recarray.country=='USA')
                & (some_recarray.age==6)
                & (some_recarray.year==2005)]

I’ve also found that caching each of the indices independently vastly speeds things up if you’re going to be iterating.

Love the recarray, hate the documentation.

6 Comments

Filed under software engineering