I haven’t had time to write anything this week because I am up to my neck in this Seven-Samurai-style software engineering project. You know, where a bunch of untrained villagers (that’s me) need to defend themselves against marauding bandits (that’s the Global Burden of Disease 2005 Study), so they have to learn everything about being a samurai (that’s writing an actual application that people other than this one villager can use) as quickly as possible.
I guess this analogy is stretching so thin that you could chop it with Toshirō Mifune’s wooden sword. But, if anyone knows how a mild-mannered theoretical computer scientist can get a web-app built in two weeks, holler. If you prefer to explain in terms of wild-west gunslingers, that is fine.
Here’s my game plan so far: I’m going to make the lightest of light-weight Python/Django apps to hold all the Global Disease Data, and then try to get my epidemologist doctors to interact with it on the command-line via an interactive python session.
The rest of this post is basically a repeat of the Django tutorial, but specialized for building a data server for global population data. As far as interesting theoretical math stuff, hidden somewhere towards the end, I’ll do some interpolation with PyMC’s Gaussian Processes using the exotic (to me) Matérn covariance function.
Here is everything important for making Django smart in the ways of global population data:
gbd/population_data_server
Django calls the objects that correspond to things it stores in a database “models”, which is a little confusing, because I also have “statistical models” which are entirely different. But in models.py
, you’ll find not too much. Only a little more than this:
class Population(models.Model): region = models.CharField(max_length=200) year = models.IntegerField() sex = SexField() params_json = models.TextField(default=json.dumps({}))
This defines the contents of the population data table; each unit of population data knows a region, a year, a sex, and some less structured params, stored as a json string.
To see how to interact with these data objects, take a peek at load_population_csv.py
, which is about 100 lines long, but 20% comments. It is a little bit of a mess, but basically an elaboration of the following:
for x in csv_file: opts = {} opts['region'] = smart_unicode(x[0].strip(), errors='ignore') opts['year'] = int(x[3]) opts['sex'] = x[4].strip().lower() pop, is_new = Population.objects.get_or_create(**opts) pop_counter += is_new
views.py
is the file where Django looks for the things everyone else calls “controllers” in the model-view-controller framework, which is at least as confusing as the many meaning of “model” in this business. But the contents of that file are pretty simple. They run about 150 lines, but most of them are for setting axes of matplotlib graphs and what-have-you. Stuff that takes lots of typing and lots of time, but not lots of deep thought. Besides setting the axis ticks, there really is nothing more than an elaboration of the following:
def population_show(request, id, format='png'): pop = get_object_or_404(Population, pk=id) M,C = pop.gaussian_process() x = np.arange(0.,100.,1.) p = np.maximum(0., M(x)) if format == 'json': response = {'age': list(x), 'population': list(p)} response = json.dumps(response) return HttpResponse(response, view_utils.MIMETYPE[format])
Oh, that’s the one mathy bit, in there. Did you catch it? On line 4, above, it says M,C = pop.gaussian_process()
; I’d like to write about it in more detail someday soon. For now, it is all just 40 lines at the end of models.py
. And at least half of those are comments.
Finally, tests.py
and fixtures.json
, are the very important business that seem like an after thought, because they are last on the list. This is where the test-driven development happens. tests.py
says simple things like:
def test_gp(self): """ Test Gaussian Process interpolation""" M, C = self.pop.gaussian_process() self.assertEqual(M(0), 1.)
and
def test_population_show(self): """ Test plotting population curve""" c = Client() url = self.pop.get_absolute_url() response = c.get(url) self.assertPng(response)
This way I’ll know if I change something somewhere to fix one bug and break something else. fixtures.json
is a very annoying-to-get-right file that fills the test database with something resembling the data that the actual app with be dealing with. It is not the true population of Australia, however.
Now it’s back to (metaphorically) learning how to defend my village from bandits. I hope to be back to (metaphorically) farming in two weeks.
Not being a samurai, I prefer to fight small enemies than big ones: I would split the more than 10 lines functions into small ones in order to ease my understanding and their maintenance (esp. your load csv module).
For the csv, I like the csv.DictReader, which access the values given a field name instead of its index. I find it more readable when I come back from defending the village to farming and need to take care of the code.
A farmer’s 2 cents
Thanks! csv.DictReader sounds like something I should be using.