Tag Archives: pandas

dfply package

Potentially of interest, although I’ve done enough d3js to think that .select .head is fine notation:

dfply Version: 0.2.4

GitHub – kieferk from November 28, 2016
“The dfply package makes it possible to do R’s dplyr-style data manipulation with pipes in python on pandas DataFrames.”
https://github.com/kieferk/dfply

from dfply import *

diamonds >> select(X.carat, X.cut) >> head(3)

   carat      cut
0   0.23    Ideal
1   0.21  Premium
2   0.23     Good

Comments Off on dfply package

Filed under software engineering

Delta Time in Python: Simple calendar times with Pandas

Here is something that Google did not help with as quickly as I would have expected: how do I convert start and stop times into the time between events in seconds (or minutes)?

Or for the busy searcher “how do I convert Pandas Timedelta to seconds”?

The classy answer is:

start_time = df.interviewstarttime.map(pd.Timestamp)
end_time = df.interviewendtime.map(pd.Timestamp)

((end_time-start_time) / pd.Timedelta(minutes=1)).describe()

I found it hidden away here: http://www.datasciencebytes.com/bytes/2015/05/16/pandas-timedelta-histograms-unit-conversion-and-overflow-danger/

6 Comments

Filed under statistics

Dates and Times in Python: average of two dates with Pandas

I spent a little longer than expected figuring out how to find the midpoint of two dates for a little table of data recently. Here is a code snippet in case I (or you) have to do this again:

# midpoint of two date columns
df = pd.DataFrame({'a': ['5/1/2012 0:00', '4/1/2014 0:00'],
                   'b': ['4/1/2014 0:00', 'unknown']})

# make time data into Timestamp format
def try_totime(t):
    try:
        return pd.Timestamp(t)
    except:
        return np.nan
    
df['start'] = df.a.map(try_totime)
df['end'] = df.b.map(try_totime)

# generate midpoint time
# harder than it would seem...
df['time'] = df.start + (df.end - df.start)/2

df

2 Comments

Filed under software engineering

Styling Excel with Pandas

I had a bunch of stylish tables to make once long ago, and I thought, “why don’t I do that automatically?” It would take longer the first time, but it would be faster in future iterations. Unfortunately, there never were any future iterations, but fortunately, it was more fun to research automatic generation of stylish tables than do what I needed to get done.

The seeds I planted have started to sprout a little bit, though, and the latest pandas now supports openpyxl2 which supports a lot of style. So here is a start on the stylish table writing feature.

Comments Off on Styling Excel with Pandas

Filed under software engineering

Python Pandas Intros

I’m going to give a Python Pandas guest lecture in the Python Science class next week, and I thought I’d take a look at the Pandas intros that are out there. There are a lot now! Here are some that I flipped through:

http://pandas.pydata.org/pandas-docs/stable/10min.html
http://nbviewer.ipython.org/gist/fonnesbeck/5850375
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/
http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/
http://synesthesiam.com/posts/an-introduction-to-pandas.html

http://www.datarobot.com/blog/introduction-to-python-for-statistical-learning/
http://www.kevinsheppard.com/images/0/09/Python_introduction.pdf
http://blog.kaggle.com/2013/01/17/getting-started-with-pandas-predicting-sat-scores-for-new-york-city-schools/

Its fun being a teacher in the age of information.

2 Comments

Filed under education

Tabular Data in Python: Getting just the columns I want from pandas.DataFrame.describe

The Python Pandas DataFrame object has become the mainstay of my data manipulation work over the last two years. One thing that I like about it is the `.describe()` method, that computes lots of interesting things about columns of a table. I often want those results stratified, and `.groupby(col)` + `.describe()` is a powerful combination for doing that.

*But* today, and many days, I don’t want all of the things that `.describe()` describes. And the ones that I do want, I want as columns. Here is the recipe for that:

import pandas as pd

df = pd.DataFrame({'A': [0,0,0,0,1,1],
                   'B': [1,2,3,4,5,6],
                   'C': [8,9,10,11,12,13]})

df.groupby('A').describe().unstack()\
    .loc[:,(slice(None),['count','mean']),]

and out comes just what I wanted:

       B            C
   count  mean  count  mean
A
0      4   2.5      4   9.5
1      2   5.5      2  12.5

It took me a while to figure this out, and these docs helped:
http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-xs

Here it is as a ipython notebook.

(Note: this requires Pandas version at least 0.14.)

Comments Off on Tabular Data in Python: Getting just the columns I want from pandas.DataFrame.describe

Filed under software engineering

Statistics in Python: Bootstrap resampling with numpy and, optionally, pandas

I’m almost ready to do all my writing in the IPython notebook. If only there was a drag-and-drop solution to move it into a wordpress blog. The next closest thing: An IPython Notebook on Github’s Gist, linked from here. This one is about bootstrap resampling with numpy and, optionally, pandas.

Comments Off on Statistics in Python: Bootstrap resampling with numpy and, optionally, pandas

Filed under statistics