Interesting Q/A: autocorrelation for categorical var in MCMC

From, a question I run into from time to time:
> Are there any measures of auto-correlation for a sequence of observations of an (unordered) categorical variable?

An (accepted) answer that got me thinking:
> [L]ook directly at the convergence rate for the Markov chain.

My interpretation, in PyMC2 terms: run chain, calculate empirical transition probabilities for categorical variable, examine spectral gap.

Experimental notebook tk.

Comments Off on Interesting Q/A: autocorrelation for categorical var in MCMC

Filed under MCMC

Interesting Q/A: some good questions about data transformation

I’m continuing my class-prep practice of searching through Cross-Validated questions with tags corresponding to upcoming class topics, and here are some interesting ones I found about data transformations:

The last one isn’t really about data transformations, but is still interesting.

Comments Off on Interesting Q/A: some good questions about data transformation

Filed under machine learning

Tables of Stacked Bars in mpl (but not mpld3)

Here is a little feature in Matplotlib that I never saw before: stacked bar plots with tables attached. Perhaps too ugly for my Iraq Mortality stacked bar charts, but definitely handy for exploratory work.

I learned about it because it doesn’t work in `mpld3`… just one more benefit of being part of an open-source project. It would be so cool to have a `mpld3` version with some interactivity included, since interactivity can address one pitfalls of the stacked bar chart, the challenge of comparing lengths with different baselines.

Comments Off on Tables of Stacked Bars in mpl (but not mpld3)

Filed under dataviz

ML in Python: Getting the Decision Tree out of sklearn

I helped my students understand the decision tree classifier in sklearn recently. Maybe they think I helped too much. But I think it was good for them. We did an interesting little exercise, too, writing a program that writes a program that represents a decision tree. Maybe it will be useful to someone else as well:

def print_tree(t, root=0, depth=1):
    if depth == 1:
        print 'def predict(X_i):'
    indent = '    '*depth
    print indent + '# node %s: impurity = %.2f' % (str(root), t.impurity[root])
    left_child = t.children_left[root]
    right_child = t.children_right[root]
    if left_child == sklearn.tree._tree.TREE_LEAF:
        print indent + 'return %s # (node %d)' % (str(t.value[root]), root)
        print indent + 'if X_i[%d] < %.2f: # (node %d)' % (t.feature[root], t.threshold[root], root)
        print_tree(t, root=left_child, depth=depth+1)
        print indent + 'else:'
        print_tree(t,root=right_child, depth=depth+1)

See it in action here.

Did I do this for MILK a few years ago? I’m becoming an absent-minded professor ahead of my time.

Comments Off on ML in Python: Getting the Decision Tree out of sklearn

Filed under machine learning

Data Science Seminars

These seminars that eScience and company are putting on are great. I have to go to the IHME seminars scheduled at competing time once in a while, so someone else attend at tell me about this one:

Comments Off on Data Science Seminars

Filed under Uncategorized

Stephen Few on Missing Values

A new edition of the Visual Business Intelligence Newsletter crossed my inbox recently, on how to display timeseries with missing and incomplete values:

Good, simple ideas are our most precious intellectual commodity.

Comments Off on Stephen Few on Missing Values

Filed under Uncategorized

That Docker thing sounds promising

I missed this presentation, but I am going to figure out how to use Docker for reproducible research soon!


Filed under software engineering