Category Archives: software engineering

Schedule of a workshop I could use

Effective use of programming in scientific research, soon, far away, and already full.

Maybe we should do our own in Seattle.

Comments Off on Schedule of a workshop I could use

Filed under software engineering

Interesting Book about Open-Source Software

I learned about an interesting book today, The Architecture of Open Source Applications, edited by Amy Brown and Greg Wilson. The introduction caught my attention:

Architects look at thousands of buildings during their training, and study critiques of those buildings written by masters. In contrast, most software developers only ever get to know a handful of large programs well—usually programs they wrote themselves—and never study the great programs of history. As a result, they repeat one another’s mistakes rather than building on one another’s successes.

This book’s goal is to change that. In it, the authors of twenty-five open source applications explain how their software is structured, and why. What are each program’s major components? How do they interact? And what did their builders learn during their development? In answering these questions, the contributors to this book provide unique insights into how they think.

If you are a junior developer, and want to learn how your more experienced colleagues think, this book is the place to start. If you are an intermediate or senior developer, and want to see how your peers have solved hard design problems, this book can help you too.

There are chapters on several software packages that I’ve enjoyed using, and chapters on several scientific/data analysis tools, but nothing on the tools I’m using day to day. Still interesting. Let me know if there is a great chapter you come across, since I’m going to be too busy to read the whole thing in the near future.

Comments Off on Interesting Book about Open-Source Software

Filed under software engineering

Data Wrangling in R, Stata and Python

It was nearly a year ago when I was accosted by students I had urged to try Python and their complaints that the data manipulation capabilities they found so convenient in R and Stata were nowhere to be found. At the time, I did some digging and told them to try la.larry (or pandas, as mentioned by readers of that post). With some more experience, these recommendations have come up again, and in hindsight it seems like la.larry is too heavy a hammer for our most common tasks.

I’m hoping to put together a translation guide for R, Stata, and Python (there is already an extensive one… ours will be much more specialized, to just a few data wrangling commands), and until then, here are Kyle’s top two:

The easiest way to build record arrays (aside from csv2rec) IMO:

import numpy as np
a = ['USA','USA','CAN']
b = [1,6,4]
c = [1990.1,2005.,1995.]
d = ['x','y','z']
some_recarray = np.core.records.fromarrays([a,b,c,d], names=['country','age','year','data'])

The fromarrays method is especially nice because it automatically deals with datatypes.

To subset a particular country-year-age:

some_recarray[(some_recarray.country=='USA')
                & (some_recarray.age==6)
                & (some_recarray.year==2005)]

I’ve also found that caching each of the indices independently vastly speeds things up if you’re going to be iterating.

Love the recarray, hate the documentation.

6 Comments

Filed under software engineering

Nice post on Cython

Python/Sage guru William Stein has a nice article about Cython and how it can speed up routine calculations at the expense of mathematical elegance (and why that is a good thing). Could Sage be my interactive shell for PyMC modeling?

Comments Off on Nice post on Cython

Filed under software engineering

Coding tips for grad students, by grad students

Kyle writes from Sri Lanka with his stats programming tips for the new PBFs. It’s all things that old PBFs and even old young professors can benefit from:

• It’s taken me 2 years to jump on the version control bandwagon (~18 months after your PToW on git….), so I certainly can’t claim to be an exemplar myself. But I think the main themes would be:

• Location, location, location! Can you find your code? Can others find your code? Do both the directory and the filename make sense?

• Replicability – even of the mistakes. If you do something right, you want to be able to do it again.
o But often, even if it’s wrong, you want to do it again. Chris will say, let’s go back to the broken version from 2 months ago, I liked that better. So if you change your code, keep a record of the old parts (and maybe even why you ditched them).

• If others can’t look at your code and figure out quickly what each “chunk” of code does, it’s not well documented enough. If you can’t even tell within 30 seconds what a particular piece does (and you wrote it!), that’s a problem.

• On the other hand: Yes, a few PBFs were lit majors, but that doesn’t mean your code should be in novella format. Concise, readable code is often more understandable than a few sentences of explanation.

• Whitespace! Headers! Tab and Enter are your close and personal friends: “Without negative space how would we appreciate the positive in our art and in our lives?” – Dyan Law (some artist I’ve never heard of)

• Good exercises: Take someone else’s raw code, figure out what it does, and comment it. Read through a program you wrote and haven’t used in months – how long does it take you to figure out? Have someone else comment your own raw code; did they explicate things you left implied? did they misinterpret anything?

All good advice, and I often regret it when I don’t follow it. Anything else that should be on this list?

Comments Off on Coding tips for grad students, by grad students

Filed under software engineering

la.larry for classy number crunching in Python

I’ve got some of my colleagues interested in using Python to cruch their numbers, and that’s great, because I’m sure it will make their lives much better in the long run. It’s also great because they are pushing me to figure out the easy way to do things, since they are used to certain conveniences from their past work with R and STATA.

For example, Kyle wanted to know does Python make it easy to deal with the tables of data he is faced with every day?

It was by showing my ignorance in a blog post a year ago that I learned about handing csv files with the csv.DictReader class. And this has been convenient enough for me, but it doesn’t let Kyle do the things the way he is used to, for example using data[[‘iso3’, ‘gdp’], :] to select these two important columns from a large table.

Enter larry, the labeled array. Maybe someone will tell me an even cleaner way to deal with these tables, but after a quick easy_install la, I was able to load up some data from the National Cancer Institute (NCI) as follows:

import csv
import la
from pylab import *

csv_file = csv.reader(open('accstate5.txt'))
columns = csv_file.next()[1:]
rows = []
data = []

for d in csv_file:
    rows.append(d[0])
    data.append(d[1:])

T = la.larry(data, [rows, columns])

That’s pretty good, but the data is not properly coded as floating point…

T = la.larry(data, [rows, columns]).astype(float)

Now, T has the whole table, indexed by state and by strange NCI code

T.lix[['WA', 'OR', 'ID', 'MO'], ['RBM9094']]

Perfect, besides being slightly ugly. But is there an easier way? the la documentation mentions a fromcsv method. Unfortunately la.larry.fromcsv doesn’t like these short-format tables.

la.larry.fromcsv('accstate5.txt')  # produces a strange error

Now, if you read the NCI data download info page, where I retrieved this data from, you will learn what each cryptic column heading means:

Column heading format:   [V] [RG] [T] [A], where

V = variable: R, C, LB, or UB
RG = race / gender: BM, BF, WM, WF (B = black, W = white, F = female, M = male)
T = calendar time: 5094, 5069, 7094, and the 9 5-year periods 5054 through 9094
A = age group: blank (all ages), 019, 2049, 5074, 75+

Example:
RBM5054 = rate for black males for the time period 1950-1954

That means I’m not really dealing with a 2-dimensional table after all, and maybe I’d be better off organizing things the way they really
are:

data_dict_list = [d for d in csv.DictReader(open('accstate5.txt'))]
data_dict = {}
races = ['B', 'W']
sexes = ['M', 'F']
times = range(50,95,5)
for d in data_dict_list:
    for race in races:
        for sex in sexes:
            for time in times:
                key = 'R%s%s%d%d' % (race, sex, time, time+4)
                val = float(d.get(key, nan)) # larry uses nan to denote missingdata
                data_dict[(d['STATE'], race, sex, '19%s'%time)] = val

R = la.larry.fromdict(data_dict)

In the original 2-d labeled array we can compare cancer rates thusly:

states = ['OR', 'WA', 'ID', 'MT']
times = arange(50,94,5)
selected_cols = ['RWM%d%d' % (t, t+4) for t in times]
T_selected = T.lix[states, selected_cols]
plot(1900+times, T_selected.getx().T, 's-',
     linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers over time')
axis([1945, 1995, 0, 230])

It is slightly cooler to do the same plot with the multi-dimensional array

clf()
subplot(2,1,1)

times = [str(t) for t in range(1950, 1994, 5)]
states = ['OR', 'WA', 'ID', 'MT']
colors = ['c', 'b', 'r', 'g']
for state, color in zip(states, colors):
    plot(R.getlabel(3), R.lix[[state], ['W'], ['M'], :].getx(), 's-', color=color,
         linewidth=3, markersize=15, alpha=.75)
legend(T_selected.getlabel(0), loc='lower right', numpoints=3, markerscale=.5)
ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in white males over time')
axis([1945, 1995, 0, 230])

That should be enough for me to remember how to use larry when I come back to it in the future. For my final trick, let’s explore how to use larrys to add a column to a 2-dimensional table. (There is some important extension of this to multidimensional tables, but it is confusing me a lot right now.)

states = T.getlabel(0)
X = la.larry(randn(len(states), 1), [states, ['new covariate']])
T = T.merge(X)

states = ['OR', 'WA', 'ID', 'MT']
T.lix[states, ['RWM9094', 'new covariate']]

Ok, encore… how well does larry play with PyMC? Let’s use a Gaussian Process model to find out.

from pymc import *

M = gp.Mean(lambda x, mu=R.mean(): mu * ones(len(x)))
C = gp.Covariance(gp.matern.euclidean, diff_degree=2., amp=1., scale=1.)
mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in arange(9)])

obs_mesh = array([[i,j,k,.5*l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_vals = array([R[i,j,k,l] for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])
obs_V = array([1. for i in range(52) for j in range(2) for k in range(2) for l in range(9) if not isnan(R[i,j,k,l])])

gp.observe(M, C, obs_mesh, obs_vals, obs_V)
imputed_vals = M(mesh)
imputed_std = sqrt(abs(C(mesh)))

Ri = la.larry(imputed_vals.reshape(R.shape), R.label)

subplot(2,1,2)
for state, color in zip(states, colors):
    t = [float(l) for l in Ri.getlabel(3)]
    plot(t, R.lix[[state], ['B'], ['M'], :].getx(), '^', color=color, linewidth=3, markersize=15, alpha=.75)
    plot(t, Ri.lix[[state], ['B'], ['M'], :].getx(), '-', color=color, linewidth=2, alpha=.75)

ylabel('Mortality rate (per 100,000 person-years)')
xlabel('Time (years)')
title('Mortality rate for all cancers in black males over time')
axis([1945, 1995, 0, 630])

5 Comments

Filed under software engineering

The Two Rules of Program Optimization

Wow, where does the day go? I spent all my non-meeting time debugging something. At least I fixed it before 5 PM.

The details of the problem are boring, but the whole ordeal could have been avoided if I had just followed the two rules of optimizing software in my Generic Disease Modeling System. What are they?

      First Rule of Program Optimization: Don’t do it
      Second Rule of Program Optimization (for experts only!): Don’t do it yet

Maybe next week I’ll get a second to write about the good kind of optimization; my statistical physics friends have posted an article on the arxiv which I am a co-author on, about an application of bounded-depth minimum spanning trees, Clustering with Shallow Trees.

Comments Off on The Two Rules of Program Optimization

Filed under software engineering

August is Too-Many-Projects Month

(Tap… tap… tap… is this thing on? Good.)

July was vacation month, where I went on a glorious bike tour of the Oregon/California coast, and learned definitively that I don’t like biking on the side of a highway all day. Don’t worry, I escaped in Coos Bay and took trains and buses between Eugene, Santa Cruz, Berkeley, and SF for a vacation more my speed.

But now that I’m back, August is turning out to be project month. I have 3 great TCS applications to global health in the pipeline, and I have big plans to tell you about them soon. But one mixed blessing about these applications is that people actually want to see the results, like, yesterday! So first I have to deal with the results, and then I can write papers and blogs about the techniques.

Since Project Month is a little over-booked with projects, I’m going to have to triage one today. You’ve heard of the NetFlix Challenge, right? Well, github.com is running a smaller scale recommendation contest, and I was messing around with personal page rank, which seems like a fine approach for recommending code repositories to hackers. I haven’t got it working very well (best results, 15% of holdout set recovered), but I was having fun with it. Maybe someone else will take it up, let me know if you get it to work; networkx + data = good times.

    f = open('download/data.txt')
    for l in f:
        u_id, r_id = l.strip().split(':')
        G.add_edge(user(u_id), repo(r_id))

[get the code]

2 Comments

Filed under combinatorial optimization, software engineering, TCS

Anatomy of a Django-driven Data Server

I haven’t had time to write anything this week because I am up to my neck in this Seven-Samurai-style software engineering project. You know, where a bunch of untrained villagers (that’s me) need to defend themselves against marauding bandits (that’s the Global Burden of Disease 2005 Study), so they have to learn everything about being a samurai (that’s writing an actual application that people other than this one villager can use) as quickly as possible.

I guess this analogy is stretching so thin that you could chop it with Toshirō Mifune’s wooden sword. But, if anyone knows how a mild-mannered theoretical computer scientist can get a web-app built in two weeks, holler. If you prefer to explain in terms of wild-west gunslingers, that is fine.

Here’s my game plan so far: I’m going to make the lightest of light-weight Python/Django apps to hold all the Global Disease Data, and then try to get my epidemologist doctors to interact with it on the command-line via an interactive python session.

The rest of this post is basically a repeat of the Django tutorial, but specialized for building a data server for global population data. As far as interesting theoretical math stuff, hidden somewhere towards the end, I’ll do some interpolation with PyMC’s Gaussian Processes using the exotic (to me) Matérn covariance function. Continue reading

2 Comments

Filed under global health, software engineering