Monthly Archives: November 2012

Machine Learning in Python: an exercise for the reader in Milk

I have a few extensions and variations on the previous example that you can try, to test your understanding:

  1. (From same questioning reader as before, now revealed to be Health Metrics PhD student David Phillips)

    How does it work with more than two dimensions? I presume model.tree just gets really crazy with more if-else’s?

    The great thing about computer research is that you can experiment with this at no additional cost. Give it a try in 3-d (and let me know if you make a nice visualization of it!), and give it a try in 10-d or something even more imposing.

  2. Note that the file where the tree_learner class is housed also contains a class called stump_learner. Give that a try, and figure out what it is doing. It is pretty simple, but we can make it into something complicated soon.

Comments Off on Machine Learning in Python: an exercise for the reader in Milk

Filed under machine learning

Machine Learning in Python: decision trees in Milk

I started my exposition of the random forest in the last post with the first simple idea necessary to making it work, the decision tree. The video clip I linked there has a nice description at a high level, but leaves a lot of details as exercise for the reader.

One such reader, possibly to be mentioned by name, writes:

Question (I hope you knew you were getting yourself into my rabbit hole of questions): Does order matter? The way everyone’s description of a decision tree goes, they draw the thing sequentially (we start from the first branch/leaf, then narrow it down with the second branch/leaf etc). But it seems to me that as long as every branch is considered, it doesn’t matter which question came first or last; it’s basically just a huge string of conditional statements. Yes?

You can pretty much convert a decision tree to a huge string of conditional statements, an “or of ands” if the labels happen to be true and false! This brings it dangerously close to conjugate normal form for boolean formulas that used to preoccupy so much of my time as a graduate student. But, the order does matter, and matters quite a lot, because (the hope is) representing the decision tree as a tree is going to be much more efficient than representing it as an “or of ands”.

This order-matters part becomes clear when you start really trying to build a decision tree, which I will now do twice. First, I will do it using Milk, a Machine Learning Toolkit for python. Then I will do it again, looking at the Milk sourcecode.

Version one is in this notebook, which you can clone for yourself and play around with. It includes the huge list of conditional statements that correspond to a simple example.

In version one, all of the machine learning action is in the code block that says

learner = milk.supervised.tree_learner()
model = learner.train(features, labels)

What is this actually doing? You can dig into it by browsing the Milk source (thanks open source movement!). The code for the tree_learner is pretty minimal, see it here. It pushes off all the work into a function called build_tree, which uses recursion, but is not overwhelming either. The precise choice of how to split at each node of the tree is handled by a function called _split, which is also relatively readable.

Should that paragraph count as version two? What I really want is a narrated movie of stepping through this code in a debugger.

1 Comment

Filed under machine learning

Random Forests as in the Verbal Autopsy

Here is an interesting method that spans my old life in theory world and my new one in global health metrics: the Random Forest.

This is a technique that grows (get it?) out of research on decision trees, and it is a great example of how combining a few simple ideas can get complicated very quickly.

The task is the following: learn from labeled examples. (Is this yet another baby-related research topic? Not as directly as the last few.) To be specific, I start with a training data set, which to be specifically about the task at hand in global health, may be the results of verbal autopsy interviews, all digitized and encoded as numeric data; together with the true underlying cause of death (as identified by gold-standard clinical diagnostic criteria) as the labels.

To “learn” in this case means to build a predictor that can take new, unlabeled examples and assign a cause of death to them.

The first simple idea needed for the random forest is the decision tree, and I found a nice youtube video that explains it, so I don’t need to write it up myself:

Well, this video is not perfect; if you have not seen this before, you may be left with a few questions.

1 Comment

Filed under machine learning

System Dynamics Modeling in Python: Diaper Distribution System with Pandas

The figure from the last post was quite slickly generate with a little Python pattern that I want to share. It is all based on my new Python knowledge of the “with” statement, and my new mainstay of pythonic data manipulation: pandas.

For easy reference, the compartmental model from that last post looks like this:

The “with” command was a little hard for me to read up on, but I figured out a simple way to use it, the context manager:

from contextlib import contextmanager

def columns_in(df):
    col_names = df.columns
    col_list = [df[col] for col in col_names]
        yield tuple(col_list)
        for i, col in enumerate(col_names):
            df[col] = col_list[i]

With “with”, the python code for the simulation is almost exactly the mathematical description of the simulation:

with columns_in(state) as (C,D,q,r,s):
    for t in range(T-1):
        C[t+1] = C[t] + q[t] - r[t]
        if C[t+1] < 0:
            C[t+1] = 0

        D[t+1] = D[t] + r[t] - s[t]
        q[t+1] = s[t+1-7] if t % 7 == 6 else 0
        # special case, first pickup, need to deliver something
        if t == 6:
            q[t+1] = 70.
        r[t+1] = 8.
        s[t+1] = D[t] if t % 7 == 6 else 0
        # special case, forget to take the diapers out
        if t == 34:
            s[t+1] = 0

Pretty cool, huh? Here is a notebook full of it (code).

Comments Off on System Dynamics Modeling in Python: Diaper Distribution System with Pandas

Filed under disease modeling

System Dynamics of Diaper Delivery

The impetus for my first personal, practical application of pure system dynamics modeling came from my sleep-deprived thoughts about Baby Diaper Service, which works as follows:

On your first and second deliveries you receive the full number of diapers you order. Thereafter, you will be on a rotation system. The number of diapers you turn in one week will be returned to you the next, up to your maximum order.

Simple? Sure, but complex enough to have two hallmarks of SDM: feedback and delay.

  • Feedback: The number of diapers delivered depends on the number I turn in.
  • Delay: The number of diapers I get this week is the number I turned in last week.

SDM, as developed in the early days of computer research, focused on “what if?” analysis. Not predicting the future that will be, but generating some range of ideas of the futures that could be.

This led to a very different sort of science that the contemporaneously developing field of econometrics. This divide is captured in a dialog I enjoyed in The Electronic Oracle (if I recall correctly, I don’t have it nearby).

However, the futures that could be are just what I was concerned about in those sleepless early months. I kept worrying about a dirty diaper catastrophe, what if we forget to put dirty diapers out this week?

SDM can answer:

If I can make it through one tough week, I’ll earn a free pass for when I forget again.

Needless to say, I was greatly relieved to have the results of this simulation in my back pocket. Thank you, System Dynamics Modeling!

Epilogue: eventually I did miss one week and failed to turn in diapers to the delivery service. I was all ready to tough it out for my reward. But it turns out that you just call up the service and get them to come by for a special pickup for a small fee. Thank you, Baby Diaper Service!

Comments Off on System Dynamics of Diaper Delivery

Filed under disease modeling

I meet system dynamics modeling

There is an area of management sciences thta I didn’t see much of in my days of ACO at CMU. It is yet another thing that I didn’t realize I’d be interested in when I was a student (like statistics). System dynamics modeling is its name.

I started to understand the importance of this research area when I was building my model of the Insecticide Treated mosquito Net distribution system a few years ago.

After we drew up a wonderful mess of boxes and arrows representing a compartmental model of nets flowing from manufacturers to distributors to households, I started thinking that this must have been done before. And indeed, in the early days of computer research, a bustling group centerd at the Sloan School at MIT did do it. I learned my history on this from an interesting book Feedback thought in social science and systems theory by George Richardson. The practical conventions, as far as I know them, I put together from a primer and a textbook, both of which are loaded with demographic and epidemiological examples. In my research, I’ve always focused on something a little different that the mainstream of system dynamics modeling, using the compartmental models to justify mechanistically one or another particular nonlinear regression.

It was in my sleep-deprived state of parenthood to a one-month-old that the Sloan-School brand of system dynamics modeling method really proved its utility for me.

1 Comment

Filed under disease modeling

Same Chernoff

I’ve been reading up on the Chernoff face for data visualization, which, as I’ve mentioned, is so cute. This helped me demonstrate that I haven’t forgotten everything from my grad school days, like a bound with name similar to the face. Back in my olden days, I thought Chernoff bounds were the cutest, and Chernoff faces were quite far from what I spent my time on.

So what a nice continuity that I learned it was the same Chernoff who lent his name to both the bound and the face. He seems to have a good sense of humor about both, and says that a different Herman deserves credit for the bound:

My result, involving the infimum ofa moment generating function, was less elegant and less general than the Cramer result, but did not require a special condition that Cramer required. Also, my proof could be described as crudely beating the problem to death. Herman claimed that he could get a lower bound much easier. I challenged him, and he produced a short Chebyshev Inequality type proof, which was so trivial that I did not trouble to cite his contribution.

What a mistake! It seems that Shannon had incorrectly applied the Central Limit theorem to the far tails of the distribution in one of his papers on Information theory. When his error was pointed out, he discovered the lower bound of Rubin in my paper and rescued his results. As a result I have gained great fame in electrical engineering circles for the Chernoff bound which was really due to Herman. One consequence of the simplicity of the proof was that no one ever bothered to read the original paper of which I was very proud. For years they referred to Rubin’s bound as the Chernov bound, not even spelling my name correctly. … Fortunately for me, my lasting fame, if any, will depend, not on Rubin’s bound, but on Chernoff faces.

Comments Off on Same Chernoff

Filed under dataviz

Baby Faces II

I knew I read something about baby faces recently, and now I found it. It was in a Malcolm Gladwell book. This guy should be my inspiration.

Comments Off on Baby Faces II

Filed under dataviz

DataViz in Python: Chernoff Faces with Matplotlib

As promised previously, here is my effort towards drawing Chernoff faces as cute as they were in the original paper.

Isn’t that just the cutest? Code here.

My replication was greatly aided by a graphic that I call “the missing figure”, which I found on a website that has since disappeared. It helped me understand what Chernoff’s paper was talking about when it says the corner of the face:

Here are a few other ways to look at it:

Varying one parameter at a time:

Lots and lots of random faces:

Making that figure reminds me of a wonderfully vintage quote from Chernoff’s 1972 paper:

“At this time the cost of drawing these faces is about 20 to 25 cents
per face on the IBM 360-67 at Stanford University using the Calcomp
Plotter. Most of this cost is in the computing, and I believe that it
should be possible to reduce it considerably. ”

I’d pay $250 for a beautiful plot of that last figure. But it is nice that the price has come down.


Filed under dataviz

Baby faces and Chernoff faces

I’ve been spending a lot of time looking at faces lately. Baby faces, to be specific. Baby faces are just wonderful, especially my baby, if I do say so myself. In fact, that adorable baby, said face, and said staring are making me forget my “blogging” voice. I think I can recover it with a little effort.

There is a face-related visualization technique that I’ve been planning to write about, and this seems like just the moment: recall Chernoff Faces. This 1970s-era method for multidimensional data visualization is so cute that it has been a recurring example in visualization education for 40 years, but it is so hard to use that it is almost never used in practice. Have you seen it before?

There are two major problems with this method: it is almost always a mistake to use it for public communication (Eugene Turner’s map of LA above is a rare exception), and it is almost always a mistake to use it in a dashboard. It has lived on from its original paper because it is so cute, but it made it to publication in its original paper because it was used appropriately: as a human computation aid in a clustering task.

It looks to me like there are three different shaped faces in the above figure. Agree?

This hasn’t always gone so well, though, as an example from a subsequent application by different authors shows below. In the following, do you see anything worth grouping together? I guess they show only a single exemplar from each cluster here, so they’re all supposed to look different.

But I’m still excited about for the same reason as everyone else: faces are so cute! I have come up with a related human computation task where it may also be useful: outlier detection.

I’m not the first to want to do this, and it has even been attempt in health service research before:

(from here)

(from here)

A quick search turns up lots of code for generating faces from multidimensional data, but nothing as cute as Chernoff’s original work. I’ll remedy that in a near future post. Unless you know of something already out there that I missed.


Filed under dataviz