Category Archives: software engineering

Close Enough for Scientific Work

This Software Carpentry project to find out how people are testing their scientific code looks great: http://software-carpentry.org/blog/2014/11/close-enough-for-scientific-work.html

I’ll have to keep my eye on the associated GitHub page https://github.com/swcarpentry/close-enough-for-scientific-work

Comments Off on Close Enough for Scientific Work

Filed under software engineering

Software Carpentry on software testing

Greg Wilson has sparked an interesting discussion in the last little while, about writing automatic tests for scientific code. Here is his blog about it, which ends with a request for input about how you would unit test this physics simulation benchmark.

I’ve been thinking about testing recently myself, so this discussion was well timed. For me, the answer is that it is too late… you need to think about and maybe even write your tests _before_ you write your n-body simulation, or whatever. And it is too removed from context. The point of automatic tests is that you can run them again and again. But why would you run them again? It all depends what you are going to change. If I’m reading this right, the reason debian developers are interested in reference implementations of the n-body problem is to compare the speed of this algorithm when implemented in different programming languages. So the most important test is really a “regression test”: does the output generated match the output expected?
Actually, this test is recommended precisely:

ndiff -abserr 1.0e-8 program output N = 1000 with this output file to check your program is correct before contributing.

Some of the things I want to test over and over and over again are: Is the input data formatted correctly? Does it look reasonable? Did I convert dates correctly? Did I make a change that breaks something which I will not see for hours (or days) when running on my full dataset?

Comments Off on Software Carpentry on software testing

Filed under software engineering

Dates and Times in Python: average of two dates with Pandas

I spent a little longer than expected figuring out how to find the midpoint of two dates for a little table of data recently. Here is a code snippet in case I (or you) have to do this again:

# midpoint of two date columns
df = pd.DataFrame({'a': ['5/1/2012 0:00', '4/1/2014 0:00'],
                   'b': ['4/1/2014 0:00', 'unknown']})

# make time data into Timestamp format
def try_totime(t):
    try:
        return pd.Timestamp(t)
    except:
        return np.nan
    
df['start'] = df.a.map(try_totime)
df['end'] = df.b.map(try_totime)

# generate midpoint time
# harder than it would seem...
df['time'] = df.start + (df.end - df.start)/2

df

2 Comments

Filed under software engineering

IDV in Python: Retrieve Data From Dynamic mpld3 plot in python

Mpld3 questions show up on Stack Overflow from time to time, too, and they can get really informative answers if they pull in the javascript experts. This one got a comprehensive answer that was perhaps too expert, and so this follow up was an opportunity to show off my interactive plot call-out plugin yet again.

Comments Off on IDV in Python: Retrieve Data From Dynamic mpld3 plot in python

Filed under software engineering

Styling Excel with Pandas

I had a bunch of stylish tables to make once long ago, and I thought, “why don’t I do that automatically?” It would take longer the first time, but it would be faster in future iterations. Unfortunately, there never were any future iterations, but fortunately, it was more fun to research automatic generation of stylish tables than do what I needed to get done.

The seeds I planted have started to sprout a little bit, though, and the latest pandas now supports openpyxl2 which supports a lot of style. So here is a start on the stylish table writing feature.

Comments Off on Styling Excel with Pandas

Filed under software engineering

IPython Notebook Clipboard Extension

I was so excited when I got the image pasting to work in my IPython Notebook, (although now I can’t find any mention of it on Healthy Algorithms…) but then things changed and I didn’t keep up, and it stopped working for me for a while. But then I _needed_ it, and so I figured out how to make it work again:

* upgrade IPython to the latest development version from github – https://github.com/ipython/ipython
* install the chrome_clipboard ipython notebook extension – https://github.com/ipython-contrib/IPython-notebook-extensions/wiki/chrome_clipboard
* make it work each time, by adding a line to `~/.ipython/profile_[name]/static/custom/custom.js`:

$([IPython.events]).on('app_initialized.NotebookApp', function(){
require(['nbextensions/chrome_clipboard'],function(module){
module.load_ipython_extension();
});
});

So nice to have it back.

Comments Off on IPython Notebook Clipboard Extension

Filed under software engineering

Anyone want to fix things in mpld3?

People are actually using mpld3. It would be great if there was more progress addressing the many issues that this use has uncovered. Interested?

Thanks for your interest in this project. I think that all of these points can be addressed, but it would be helpful to have a minimal example of python code that generates the issue in point (1) reliably. The github issue tracker has discussions related to points (3) and (4), and something that might be related to point (1). For point (2), it would be great to have a specific example in mind also, so that we can be sure any solution reduces file size substantially without compromising graphically accuracy.

  1. https://github.com/jakevdp/mpld3/issues/226 https://github.com/jakevdp/mpld3/issues/250
  2. Would be good to add an issue: https://github.com/jakevdp/mpld3/issues/new
  3. https://github.com/jakevdp/mpld3/issues/247
  4. https://github.com/jakevdp/mpld3/issues/198

As far as when all of these issues will be addressed, that is a pitfall of certain open-source projects that you might already be familiar with from your work with [related project]… I suspect that each fix will require a few hours of debugging at least, with (2) being easiest and (4) being hardest. I have a long list of issues to address, and although I’m happy to put these on it, I never seem to make progress on any of them.

Pull requests are certainly welcome, and if you or your collaborators wants to make these improvements, the mpld3 project will be happy to incorporate them into the codebase.

–Abie

Comments Off on Anyone want to fix things in mpld3?

Filed under software engineering

Tabular Data in Python: Getting just the columns I want from pandas.DataFrame.describe

The Python Pandas DataFrame object has become the mainstay of my data manipulation work over the last two years. One thing that I like about it is the `.describe()` method, that computes lots of interesting things about columns of a table. I often want those results stratified, and `.groupby(col)` + `.describe()` is a powerful combination for doing that.

*But* today, and many days, I don’t want all of the things that `.describe()` describes. And the ones that I do want, I want as columns. Here is the recipe for that:

import pandas as pd

df = pd.DataFrame({'A': [0,0,0,0,1,1],
                   'B': [1,2,3,4,5,6],
                   'C': [8,9,10,11,12,13]})

df.groupby('A').describe().unstack()\
    .loc[:,(slice(None),['count','mean']),]

and out comes just what I wanted:

       B            C
   count  mean  count  mean
A
0      4   2.5      4   9.5
1      2   5.5      2  12.5

It took me a while to figure this out, and these docs helped:
http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-xs

Here it is as a ipython notebook.

(Note: this requires Pandas version at least 0.14.)

Comments Off on Tabular Data in Python: Getting just the columns I want from pandas.DataFrame.describe

Filed under software engineering

MCMC in Python: sampling in parallel with PyMC

Question and answer on Stack Overflow.

Comments Off on MCMC in Python: sampling in parallel with PyMC

Filed under software engineering

IDV in Python: Interactive heatmap with Pandas and mpld3

I’ve been having a good time following the development of the mpld3 package, and I think it has a lot of potential for making interactive data visualization part of my regular workflow instead of that special something extra. A few weeks ago, an mpld3 user showed up with an interesting challenge, and solved their own problem quite well.

I finally got a chance to look at it today, and with a little spit-and-polish this could be something really useful for me.

ihm

Comments Off on IDV in Python: Interactive heatmap with Pandas and mpld3

Filed under dataviz, software engineering