Interesting Q/A: autocorrelation for categorical var in MCMC

From, a question I run into from time to time:
> Are there any measures of auto-correlation for a sequence of observations of an (unordered) categorical variable?

An (accepted) answer that got me thinking:
> [L]ook directly at the convergence rate for the Markov chain.

My interpretation, in PyMC2 terms: run chain, calculate empirical transition probabilities for categorical variable, examine spectral gap.

Experimental notebook tk.

Filed under MCMC

Interesting Q/A: some good questions about data transformation

I’m continuing my class-prep practice of searching through Cross-Validated questions with tags corresponding to upcoming class topics, and here are some interesting ones I found about data transformations:

The last one isn’t really about data transformations, but is still interesting.

Filed under machine learning

Tables of Stacked Bars in mpl (but not mpld3)

Here is a little feature in Matplotlib that I never saw before: stacked bar plots with tables attached. Perhaps too ugly for my Iraq Mortality stacked bar charts, but definitely handy for exploratory work.

I learned about it because it doesn’t work in `mpld3`… just one more benefit of being part of an open-source project. It would be so cool to have a `mpld3` version with some interactivity included, since interactivity can address one pitfalls of the stacked bar chart, the challenge of comparing lengths with different baselines.

Filed under dataviz

ML in Python: Getting the Decision Tree out of sklearn

I helped my students understand the decision tree classifier in sklearn recently. Maybe they think I helped too much. But I think it was good for them. We did an interesting little exercise, too, writing a program that writes a program that represents a decision tree. Maybe it will be useful to someone else as well:

def print_tree(t, root=0, depth=1):
    if depth == 1:
        print 'def predict(X_i):'
    indent = '    '*depth
    print indent + '# node %s: impurity = %.2f' % (str(root), t.impurity[root])
    left_child = t.children_left[root]
    right_child = t.children_right[root]
    if left_child == sklearn.tree._tree.TREE_LEAF:
        print indent + 'return %s # (node %d)' % (str(t.value[root]), root)
        print indent + 'if X_i[%d] < %.2f: # (node %d)' % (t.feature[root], t.threshold[root], root)
        print_tree(t, root=left_child, depth=depth+1)
        print indent + 'else:'
        print_tree(t,root=right_child, depth=depth+1)

See it in action here.

Did I do this for MILK a few years ago? I’m becoming an absent-minded professor ahead of my time.

Filed under machine learning

Data Science Seminars

These seminars that eScience and company are putting on are great. I have to go to the IHME seminars scheduled at competing time once in a while, so someone else attend at tell me about this one:

Filed under Uncategorized

Stephen Few on Missing Values

A new edition of the Visual Business Intelligence Newsletter crossed my inbox recently, on how to display timeseries with missing and incomplete values:

Good, simple ideas are our most precious intellectual commodity.

Filed under Uncategorized

That Docker thing sounds promising

I missed this presentation, but I am going to figure out how to use Docker for reproducible research soon!


Filed under software engineering