Category Archives: machine learning

Using the sklearn grid_search tools

Scikit-learn has a really nice grid search module. It will soon be called model_selection, because it has grown beyond simple grid search. But here is the spirit of it:

import sklearn.svm, sklearn.grid_search, sklearn.datasets.samples_generator
parameters = {'kernel':('poly', 'rbf'), 'C':[.01, .1, 1, 10, 100]}
clf = sklearn.grid_search.GridSearchCV(
    sklearn.svm.SVC(probability=True),
    parameters,
    n_jobs=64)
X, y = sklearn.datasets.samples_generator.make_classification(n_samples=200, n_features=5, random_state=12345)
clf.fit(X, y)
clf.best_params_

And say you want to take a careful look at the results? They are all in there, too. http://nbviewer.ipython.org/gist/aflaxman/cb0660e602d361d06599

Comments Off on Using the sklearn grid_search tools

Filed under machine learning, software engineering

What was I up to two year ago?

One fun thing about keeping my lab notebook in digital form with IPython Notebooks is that I can flip through my old work so easily. Did I say fun? I meant scary, and sometimes depressing. But yes, also fun.

For example, two years ago, I was working on some projects that are still not wrapped up today, and I was doing a lot of prep for the first edition of my now re-titled “machine learning for health metricians” class.

Hey that includes the answer to [a question someone just asked on stats.stackexchange](http://stats.stackexchange.com/q/149801/18291)

Comments Off on What was I up to two year ago?

Filed under machine learning

Why do we call it “ridge” regression?

Asked and answered: http://stats.stackexchange.com/a/151351/18291

With a link to more detail: http://www.itl.nist.gov/div898/handbook/pri/section3/pri336.htm

Comments Off on Why do we call it “ridge” regression?

Filed under machine learning

I like the term OneHotEncoder

Dummy variable just sounds demeaning to me. http://stats.stackexchange.com/questions/149122/treating-missing-data-in-voting-pattern-analysis/149572#149572

Comments Off on I like the term OneHotEncoder

Filed under machine learning

Interesting Q/A: some good questions about data transformation

I’m continuing my class-prep practice of searching through Cross-Validated questions with tags corresponding to upcoming class topics, and here are some interesting ones I found about data transformations:

http://stats.stackexchange.com/questions/46418/why-is-the-square-root-transformation-recommended-for-count-data
http://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros
http://stats.stackexchange.com/questions/27951/when-are-log-scales-appropriate
http://stats.stackexchange.com/questions/90149/pitfalls-to-avoid-when-transforming-data
http://stats.stackexchange.com/questions/60777/what-are-the-assumptions-of-negative-binomial-regression

The last one isn’t really about data transformations, but is still interesting.

Comments Off on Interesting Q/A: some good questions about data transformation

Filed under machine learning

ML in Python: Getting the Decision Tree out of sklearn

I helped my students understand the decision tree classifier in sklearn recently. Maybe they think I helped too much. But I think it was good for them. We did an interesting little exercise, too, writing a program that writes a program that represents a decision tree. Maybe it will be useful to someone else as well:

def print_tree(t, root=0, depth=1):
    if depth == 1:
        print 'def predict(X_i):'
    indent = '    '*depth
    print indent + '# node %s: impurity = %.2f' % (str(root), t.impurity[root])
    left_child = t.children_left[root]
    right_child = t.children_right[root]
    
    if left_child == sklearn.tree._tree.TREE_LEAF:
        print indent + 'return %s # (node %d)' % (str(t.value[root]), root)
    else:
        print indent + 'if X_i[%d] < %.2f: # (node %d)' % (t.feature[root], t.threshold[root], root)
        print_tree(t, root=left_child, depth=depth+1)
        
        print indent + 'else:'
        print_tree(t,root=right_child, depth=depth+1)

See it in action here.

Did I do this for MILK a few years ago? I’m becoming an absent-minded professor ahead of my time.

Comments Off on ML in Python: Getting the Decision Tree out of sklearn

Filed under machine learning

ML in Python: Decision Trees with Pandas

Doctors love decision trees, computer scientists love recursion, so maybe that’s why decision trees have been coming up so much in the Artificial Intelligence for Health Metricians class I’m teaching this quarter. We’ve been very sklearn-focused in our labs so far, but I thought my students might like to see how to build their own decision tree learner from scratch. So I put together this little notebook for them. Unfortunately, it is a little too complicated to make them do it themselves in a quarter-long class with no prerequisites on programming.

3 Comments

Filed under machine learning

DBER

In a recent post, I confessed my interest in a recent National Academy Press report on teaching methods. The tough thing for me about using this discipline-based education research (DBER) approach is not the name or the acronym, but coming up with the misunderstood concepts from the discipline that students benefit from learning actively. In the report examples, it seems like they are articulated by geniuses dedicated to teaching after years of student observation. I don’t know if I’ll get there one day, but I’m certainly not there now.

But I had a great idea, or at least one that I think is great: see what people are confused by online. I tried this out for my lecture last week on cross-validation, using the stats.stackexchange site: http://stats.stackexchange.com/questions/tagged/cross-validation?sort=votes&pageSize=50

After reading a ton of these, I decided that if my students know when they need test/train/validation splits and when they can get aways with test/train splits then they’ve really figured things out. Now I can’t find the question that I thought distilled this best, though.

Comments Off on DBER

Filed under machine learning

Quinlan stuff

To complement that ASA address about what is statistics that I read last week, here is the abstract KDD address about what is data mining: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=754923

Capture

Does the talk exist somewhere?

Comments Off on Quinlan stuff

Filed under machine learning

Verbal Autopsy Analysis Slides

I gave a talk on automatic methods to map from verbal autopsy interview results to underlying causes of death last August, and I like the slides so much that I’m going put them online here. Cheers to the new IHME themed templates for Power Point!

Capture

2 Comments

Filed under machine learning