I started my exposition of the random forest in the last post with the first simple idea necessary to making it work, the decision tree. The video clip I linked there has a nice description at a high level, but leaves a lot of details as exercise for the reader.
One such reader, possibly to be mentioned by name, writes:
Question (I hope you knew you were getting yourself into my rabbit hole of questions): Does order matter? The way everyone’s description of a decision tree goes, they draw the thing sequentially (we start from the first branch/leaf, then narrow it down with the second branch/leaf etc). But it seems to me that as long as every branch is considered, it doesn’t matter which question came first or last; it’s basically just a huge string of conditional statements. Yes?
You can pretty much convert a decision tree to a huge string of conditional statements, an “or of ands” if the labels happen to be true and false! This brings it dangerously close to conjugate normal form for boolean formulas that used to preoccupy so much of my time as a graduate student. But, the order does matter, and matters quite a lot, because (the hope is) representing the decision tree as a tree is going to be much more efficient than representing it as an “or of ands”.
This order-matters part becomes clear when you start really trying to build a decision tree, which I will now do twice. First, I will do it using Milk, a Machine Learning Toolkit for python. Then I will do it again, looking at the Milk sourcecode.
Version one is in this notebook, which you can clone for yourself and play around with. It includes the huge list of conditional statements that correspond to a simple example.
In version one, all of the machine learning action is in the code block that says
learner = milk.supervised.tree_learner()
model = learner.train(features, labels)
What is this actually doing? You can dig into it by browsing the Milk source (thanks open source movement!). The code for the tree_learner is pretty minimal, see it here. It pushes off all the work into a function called build_tree, which uses recursion, but is not overwhelming either. The precise choice of how to split at each node of the tree is handled by a function called _split, which is also relatively readable.
Should that paragraph count as version two? What I really want is a narrated movie of stepping through this code in a debugger.