Data Wrangling in R, Stata and Python

It was nearly a year ago when I was accosted by students I had urged to try Python and their complaints that the data manipulation capabilities they found so convenient in R and Stata were nowhere to be found. At the time, I did some digging and told them to try la.larry (or pandas, as mentioned by readers of that post). With some more experience, these recommendations have come up again, and in hindsight it seems like la.larry is too heavy a hammer for our most common tasks.

I’m hoping to put together a translation guide for R, Stata, and Python (there is already an extensive one… ours will be much more specialized, to just a few data wrangling commands), and until then, here are Kyle’s top two:

The easiest way to build record arrays (aside from csv2rec) IMO:

import numpy as np
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
some_recarray = np.core.records.fromarrays([a,b,c,d], names=['country','age','year','data'])

The fromarrays method is especially nice because it automatically deals with datatypes.

To subset a particular country-year-age:

                & (some_recarray.age==6)
                & (some_recarray.year==2005)]

I’ve also found that caching each of the indices independently vastly speeds things up if you’re going to be iterating.

Love the recarray, hate the documentation.


Filed under software engineering

6 responses to “Data Wrangling in R, Stata and Python

  1. I would love to learn more about data manipulation in Python. Do you have an resources that you recommend for this R user?

  2. All I’ve got so far is what’s in this post and the one I mentioned from 1 year ago… more to come soon(ish).

  3. Pingback: My new favorite for pythonic data wrangling | Healthy Algorithms

  4. Kat

    Hi. Do you know of a book translating R/Stata person to Python/Pandas/Numpy/Scipy??

  5. I don’t know of translation book, but it seems like now is the time for someone to write one. What do you want translated?