Category Archives: machine learning

Before getting started with the Semantic Web

I mentioned the websearching difficulty I found when getting started with Semantic Web recently, but there was one good lead I found: an O’Reilly book called Learning SPARQL, and associated blog by author Bob DuCharme. I was particularly interested in an essay on the culture gap between Semantic Web and Big Data.

I can’t believe I just said I’m particularly interested in an essay about databases!

Comments Off on Before getting started with the Semantic Web

Filed under machine learning

Getting started with the Semantic Web

I’ve been getting started with a new project, for which I need to get up to speed on this whole semantic web/linked data business. I was as let down by the results of my websearching as I was elevated by the tagged-and-up-voted material on Stack Overflow. Here is a little link library:

Why am I doing this? Because supercomputer company Cray, Inc. has built a new type of supercomputer which is optimized for graph searching, and searching RDF with SPARQL is a low-overhead way to use it. And they are running a contest for scientists to do something interesting with their new tool, in which I am a contestant.

3 Comments

Filed under machine learning

Machine Learning in Python: an exercise for the reader in Milk

I have a few extensions and variations on the previous example that you can try, to test your understanding:

  1. (From same questioning reader as before, now revealed to be Health Metrics PhD student David Phillips)

    How does it work with more than two dimensions? I presume model.tree just gets really crazy with more if-else’s?

    The great thing about computer research is that you can experiment with this at no additional cost. Give it a try in 3-d (and let me know if you make a nice visualization of it!), and give it a try in 10-d or something even more imposing.

  2. Note that the tree.py file where the tree_learner class is housed also contains a class called stump_learner. Give that a try, and figure out what it is doing. It is pretty simple, but we can make it into something complicated soon.

Comments Off on Machine Learning in Python: an exercise for the reader in Milk

Filed under machine learning

Machine Learning in Python: decision trees in Milk

I started my exposition of the random forest in the last post with the first simple idea necessary to making it work, the decision tree. The video clip I linked there has a nice description at a high level, but leaves a lot of details as exercise for the reader.

One such reader, possibly to be mentioned by name, writes:

Question (I hope you knew you were getting yourself into my rabbit hole of questions): Does order matter? The way everyone’s description of a decision tree goes, they draw the thing sequentially (we start from the first branch/leaf, then narrow it down with the second branch/leaf etc). But it seems to me that as long as every branch is considered, it doesn’t matter which question came first or last; it’s basically just a huge string of conditional statements. Yes?

You can pretty much convert a decision tree to a huge string of conditional statements, an “or of ands” if the labels happen to be true and false! This brings it dangerously close to conjugate normal form for boolean formulas that used to preoccupy so much of my time as a graduate student. But, the order does matter, and matters quite a lot, because (the hope is) representing the decision tree as a tree is going to be much more efficient than representing it as an “or of ands”.

This order-matters part becomes clear when you start really trying to build a decision tree, which I will now do twice. First, I will do it using Milk, a Machine Learning Toolkit for python. Then I will do it again, looking at the Milk sourcecode.

Version one is in this notebook, which you can clone for yourself and play around with. It includes the huge list of conditional statements that correspond to a simple example.

In version one, all of the machine learning action is in the code block that says

learner = milk.supervised.tree_learner()
model = learner.train(features, labels)

What is this actually doing? You can dig into it by browsing the Milk source (thanks open source movement!). The code for the tree_learner is pretty minimal, see it here. It pushes off all the work into a function called build_tree, which uses recursion, but is not overwhelming either. The precise choice of how to split at each node of the tree is handled by a function called _split, which is also relatively readable.

Should that paragraph count as version two? What I really want is a narrated movie of stepping through this code in a debugger.

1 Comment

Filed under machine learning

Random Forests as in the Verbal Autopsy

Here is an interesting method that spans my old life in theory world and my new one in global health metrics: the Random Forest.

This is a technique that grows (get it?) out of research on decision trees, and it is a great example of how combining a few simple ideas can get complicated very quickly.

The task is the following: learn from labeled examples. (Is this yet another baby-related research topic? Not as directly as the last few.) To be specific, I start with a training data set, which to be specifically about the task at hand in global health, may be the results of verbal autopsy interviews, all digitized and encoded as numeric data; together with the true underlying cause of death (as identified by gold-standard clinical diagnostic criteria) as the labels.

To “learn” in this case means to build a predictor that can take new, unlabeled examples and assign a cause of death to them.

The first simple idea needed for the random forest is the decision tree, and I found a nice youtube video that explains it, so I don’t need to write it up myself:

Well, this video is not perfect; if you have not seen this before, you may be left with a few questions.

1 Comment

Filed under machine learning