My new favorite for pythonic data wrangling

I’ve written before about my search for the way to deal with data in python. It’s time to write again, though because I have a new favorite: pandas, the panel data package.

There is copious, and growing documentation for pandas, but it assumes a level of familiarity with python and numpy. I thought I’d write some little examples calculations that I’ve done with pandas recently to complement the real docs with some “recipes”. You don’t really need to know python to use these, let alone numpy.

To begin, here are the creation and subset routines in pandas that do the same work that my last foray into this subject accomplished with the rec_array:

import pandas
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
df = pandas.DataFrame({'country': a, 'age': b, 'year': c, 'data': d})

This is cooler than a rec_array because you don’t have to dig in the docs for the constructor, and you can use a dictionary to name each column.

You can select the subset of data relevant to a particular country-year-age thusly:

df[(df['country']=='USA') & (df['age']==6) & (df['year']==2005)]

This is not as cool as a rec_array, because writing df['age'] has more characters than df.age, but I feel churlish to complain about it.
It’s good that I complained about my uncool df['age'] business, because I learned that df.age works, too, as long as you are using an up-to-date pandas.

More substantial recipe to come. Is there already a cookbook out there?


Filed under software engineering

5 responses to “My new favorite for pythonic data wrangling

  1. Ben

    Pandas can do data.age! When you add a column via dictionary, their names get added to the data frame as an attribute.

  2. Awesome, you’re right! I thought it didn’t because I was using an old version.

  3. Hi!
    Did you tried using pandas with pymc? Do these two packages talk well to one another?
    Can I have a pymc stochastic that is a DataFrame or a Series?

  4. Good question… I expect that you can, so give it a try. Maybe I will to in a future blog post.

  5. Pingback: PyMC+Pandas: Poisson Regression Example | Healthy Algorithms