Monthly Archives: November 2012

Machine Learning in Python: an exercise for the reader in Milk

I have a few extensions and variations on the previous example that you can try, to test your understanding:

  1. (From same questioning reader as before, now revealed to be Health Metrics PhD student David Phillips)

    How does it work with more than two dimensions? I presume model.tree just gets really crazy with more if-else’s?

    The great thing about computer research is that you can experiment with this at no additional cost. Give it a try in 3-d (and let me know if you make a nice visualization of it!), and give it a try in 10-d or something even more imposing.

  2. Note that the file where the tree_learner class is housed also contains a class called stump_learner. Give that a try, and figure out what it is doing. It is pretty simple, but we can make it into something complicated soon.

Comments Off

Filed under machine learning

Machine Learning in Python: decision trees in Milk

I started my exposition of the random forest in the last post with the first simple idea necessary to making it work, the decision tree. The video clip I linked there has a nice description at a high level, but leaves a lot of details as exercise for the reader.

One such reader, possibly to be mentioned by name, writes:

Question (I hope you knew you were getting yourself into my rabbit hole of questions): Does order matter? The way everyone’s description of a decision tree goes, they draw the thing sequentially (we start from the first branch/leaf, then narrow it down with the second branch/leaf etc). But it seems to me that as long as every branch is considered, it doesn’t matter which question came first or last; it’s basically just a huge string of conditional statements. Yes?

You can pretty much convert a decision tree to a huge string of conditional statements, an “or of ands” if the labels happen to be true and false! This brings it dangerously close to conjugate normal form for boolean formulas that used to preoccupy so much of my time as a graduate student. But, the order does matter, and matters quite a lot, because (the hope is) representing the decision tree as a tree is going to be much more efficient than representing it as an “or of ands”.

This order-matters part becomes clear when you start really trying to build a decision tree, which I will now do twice. First, I will do it using Milk, a Machine Learning Toolkit for python. Then I will do it again, looking at the Milk sourcecode.

Version one is in this notebook, which you can clone for yourself and play around with. It includes the huge list of conditional statements that correspond to a simple example.

In version one, all of the machine learning action is in the code block that says

learner = milk.supervised.tree_learner()
model = learner.train(features, labels)

What is this actually doing? You can dig into it by browsing the Milk source (thanks open source movement!). The code for the tree_learner is pretty minimal, see it here. It pushes off all the work into a function called build_tree, which uses recursion, but is not overwhelming either. The precise choice of how to split at each node of the tree is handled by a function called _split, which is also relatively readable.

Should that paragraph count as version two? What I really want is a narrated movie of stepping through this code in a debugger.

1 Comment

Filed under machine learning

Random Forests as in the Verbal Autopsy

Here is an interesting method that spans my old life in theory world and my new one in global health metrics: the Random Forest.

This is a technique that grows (get it?) out of research on decision trees, and it is a great example of how combining a few simple ideas can get complicated very quickly.

The task is the following: learn from labeled examples. (Is this yet another baby-related research topic? Not as directly as the last few.) To be specific, I start with a training data set, which to be specifically about the task at hand in global health, may be the results of verbal autopsy interviews, all digitized and encoded as numeric data; together with the true underlying cause of death (as identified by gold-standard clinical diagnostic criteria) as the labels.

To “learn” in this case means to build a predictor that can take new, unlabeled examples and assign a cause of death to them.

The first simple idea needed for the random forest is the decision tree, and I found a nice youtube video that explains it, so I don’t need to write it up myself:

Well, this video is not perfect; if you have not seen this before, you may be left with a few questions.

1 Comment

Filed under machine learning

System Dynamics Modeling in Python: Diaper Distribution System with Pandas

The figure from the last post was quite slickly generate with a little Python pattern that I want to share. It is all based on my new Python knowledge of the “with” statement, and my new mainstay of pythonic data manipulation: pandas.

For easy reference, the compartmental model from that last post looks like this:

The “with” command was a little hard for me to read up on, but I figured out a simple way to use it, the context manager:

from contextlib import contextmanager

def columns_in(df):
    col_names = df.columns
    col_list = [df[col] for col in col_names]
        yield tuple(col_list)
        for i, col in enumerate(col_names):
            df[col] = col_list[i]

With “with”, the python code for the simulation is almost exactly the mathematical description of the simulation:

with columns_in(state) as (C,D,q,r,s):
    for t in range(T-1):
        C[t+1] = C[t] + q[t] - r[t]
        if C[t+1] < 0:
            C[t+1] = 0

        D[t+1] = D[t] + r[t] - s[t]
        q[t+1] = s[t+1-7] if t % 7 == 6 else 0
        # special case, first pickup, need to deliver something
        if t == 6:
            q[t+1] = 70.
        r[t+1] = 8.
        s[t+1] = D[t] if t % 7 == 6 else 0
        # special case, forget to take the diapers out
        if t == 34:
            s[t+1] = 0

Pretty cool, huh? Here is a notebook full of it (code).

Comments Off

Filed under disease modeling

System Dynamics of Diaper Delivery

The impetus for my first personal, practical application of pure system dynamics modeling came from my sleep-deprived thoughts about Baby Diaper Service, which works as follows:

On your first and second deliveries you receive the full number of diapers you order. Thereafter, you will be on a rotation system. The number of diapers you turn in one week will be returned to you the next, up to your maximum order.

Simple? Sure, but complex enough to have two hallmarks of SDM: feedback and delay.

  • Feedback: The number of diapers delivered depends on the number I turn in.
  • Delay: The number of diapers I get this week is the number I turned in last week.

SDM, as developed in the early days of computer research, focused on “what if?” analysis. Not predicting the future that will be, but generating some range of ideas of the futures that could be.

This led to a very different sort of science that the contemporaneously developing field of econometrics. This divide is captured in a dialog I enjoyed in The Electronic Oracle (if I recall correctly, I don’t have it nearby).

However, the futures that could be are just what I was concerned about in those sleepless early months. I kept worrying about a dirty diaper catastrophe, what if we forget to put dirty diapers out this week?

SDM can answer:

If I can make it through one tough week, I’ll earn a free pass for when I forget again.

Needless to say, I was greatly relieved to have the results of this simulation in my back pocket. Thank you, System Dynamics Modeling!

Epilogue: eventually I did miss one week and failed to turn in diapers to the delivery service. I was all ready to tough it out for my reward. But it turns out that you just call up the service and get them to come by for a special pickup for a small fee. Thank you, Baby Diaper Service!

Comments Off

Filed under disease modeling

I meet system dynamics modeling

There is an area of management sciences thta I didn’t see much of in my days of ACO at CMU. It is yet another thing that I didn’t realize I’d be interested in when I was a student (like statistics). System dynamics modeling is its name.

I started to understand the importance of this research area when I was building my model of the Insecticide Treated mosquito Net distribution system a few years ago.

After we drew up a wonderful mess of boxes and arrows representing a compartmental model of nets flowing from manufacturers to distributors to households, I started thinking that this must have been done before. And indeed, in the early days of computer research, a bustling group centerd at the Sloan School at MIT did do it. I learned my history on this from an interesting book Feedback thought in social science and systems theory by George Richardson. The practical conventions, as far as I know them, I put together from a primer and a textbook, both of which are loaded with demographic and epidemiological examples. In my research, I’ve always focused on something a little different that the mainstream of system dynamics modeling, using the compartmental models to justify mechanistically one or another particular nonlinear regression.

It was in my sleep-deprived state of parenthood to a one-month-old that the Sloan-School brand of system dynamics modeling method really proved its utility for me.

1 Comment

Filed under disease modeling

Same Chernoff

I’ve been reading up on the Chernoff face for data visualization, which, as I’ve mentioned, is so cute. This helped me demonstrate that I haven’t forgotten everything from my grad school days, like a bound with name similar to the face. Back in my olden days, I thought Chernoff bounds were the cutest, and Chernoff faces were quite far from what I spent my time on.

So what a nice continuity that I learned it was the same Chernoff who lent his name to both the bound and the face. He seems to have a good sense of humor about both, and says that a different Herman deserves credit for the bound:

My result, involving the infimum ofa moment generating function, was less elegant and less general than the Cramer result, but did not require a special condition that Cramer required. Also, my proof could be described as crudely beating the problem to death. Herman claimed that he could get a lower bound much easier. I challenged him, and he produced a short Chebyshev Inequality type proof, which was so trivial that I did not trouble to cite his contribution.

What a mistake! It seems that Shannon had incorrectly applied the Central Limit theorem to the far tails of the distribution in one of his papers on Information theory. When his error was pointed out, he discovered the lower bound of Rubin in my paper and rescued his results. As a result I have gained great fame in electrical engineering circles for the Chernoff bound which was really due to Herman. One consequence of the simplicity of the proof was that no one ever bothered to read the original paper of which I was very proud. For years they referred to Rubin’s bound as the Chernov bound, not even spelling my name correctly. … Fortunately for me, my lasting fame, if any, will depend, not on Rubin’s bound, but on Chernoff faces.

Comments Off

Filed under dataviz