Category Archives: TCS

New book on Probabilistic Programming

This crossed my inbox recently:

The e-book is available now from https://www.manning.com/books/practical-probabilistic-programming. The print version will be available on Friday.

>From the cover:

Practical Probabilistic Programming introduces the working programmer to probabilistic programming. In it, you’ll learn how to use the PP paradigm to model application domains and then express those probabilistic models in code. Although PP can seem abstract, in this book you’ll immediately work on practical examples, like using the Figaro language to build a spam filter and applying Bayesian and Markov networks, to diagnose computer system data problems and recover digital images.

Comments Off on New book on Probabilistic Programming

Filed under TCS

Reddit asks about IBM Watson

Is IBM Watson just (mostly) marketing? (self.MachineLearning)

https://www.reddit.com/r/MachineLearning/comments/3qmfbz/is_ibm_watson_just_mostly_marketing/

Comments Off on Reddit asks about IBM Watson

Filed under TCS

Combinatorial Optimization for GIS

A real, applied problem in spatial epidemiology crossed my desk last week, and it turns out that it is a super-fun combinatorial optimization challenge, too.

Details here: http://gis.stackexchange.com/questions/126280/group-polygon-features-to-match-a-set-of-specifications

I don’t have time to play around with it a lot now, but I did try a little stochastic search, which makes me think that this will not be trivial to solve:

bad_clustering

Comments Off on Combinatorial Optimization for GIS

Filed under TCS

How do you argue a query is impossible in SPARQL?

I’ve come to believe that I’m stumped in querying the existence of (s,t)-paths with SPARQL 1.0 or the k nearest neighbors with SPARQL 1.1 because of limitations of the query language, not limitations in my query-writing ability. But how to prove it? Or even provide some evidence? Tell me on cstheory.stackexachange.

Maybe this paper is relevant: Semantics and Complexity of SPARQL

Comments Off on How do you argue a query is impossible in SPARQL?

Filed under TCS

Computation time and model development

20 seconds, 20 minutes, or 20 hours.  These are all amounts of time that a computational method I’ve been working at some time has taken to complete processing.  They each lead to a very different experience for the model developer, and probably in the end for the model, too. Twenty seconds is definitely what I prefer.

Comments Off on Computation time and model development

Filed under statistics, TCS

Age-heaping and Hedgehogs

I heard an interesting talk a few weeks ago about “age-heaping” in survey responses, the phenomenon where people remember ages imprecisely and say that their siblings are ages that are divisible by 5 much more often than expected.  There are some nice theory challenges here, with a big dose of stats modeling, but I’ll have to share some more thoughts on that later.

In the talk, the age-heaping was also referred to a a hedgehog or porcupine plot, because of the spikey histogram that the data produces.  I was looking for a nice picture of one, or some additional background reading, and when I searched for “hedgehog statistical plots”, all google would give me was a bunch of pages about stats on actual hedgehogs.  Cute!

3 Comments

Filed under TCS

Algorithms in the Field Workshop

Last week I attended the workshop Algorithms in the Field or “8F”, as its puzzling acronym turns out to be. David Eppstein published comprehensive notes on the talks on his blog:

I like to think that I’m doing “8F” in my global health job, and also some algorithms in the forest, algorithms in the desert, etc. But this workshop gave me a chance to think about the connection to Algorithms with a capitol “A”, the kind going on in the theory group of the ivory towered computer science department. I’ve been pretty successful at coming up with little challenges in global health where algorithmic thinking is useful (e.g. Doctors = Noise Machines), and I’m going to try to use the blog to throw more of these puzzles over the fence in the next few weeks. My barrier to doing this in the past has been the amount of background research I need to do to avoid sounding foolish. But my writing style was never suited to caution. Let’s see how it goes. I hope that even half-explained connections between Algorithms and global health will inspire some algorithmitician to fill in the details.

What I challenge myself to do more of is to go beyond the little puzzles, and synthesize something bigger to ask from algorithmists. What algorithmic innovation would really change how we’re doing things in Global Health? This is a domain where avoiding foolishness is even more of a recipe for silence. But I will try.

On my mind now: the computation time necessary to fit a model. 20 seconds, 20 minutes, or 20 hours is a really big difference. More on that thought to come.

Question to readers from Global Health Departments, what do you think capitol-A algorithms researchers can offer us?

Comments Off on Algorithms in the Field Workshop

Filed under global health, TCS

MCMC in Python: Part IIb of PyMC Step Methods and their pitfalls

Two weeks ago, I slapped together a post about the banana distribution and some of my new movies sampling from it. Today I’m going to explore the performance of MCMC with the Adaptive Metropolis step method when the banana dimension grows and it becomes more curvy.

To start things off, have a look at the banana model, as implemented for PyMC:

    C_1 = pl.ones(dim)
    C_1[0] = 100.
    X = mc.Uninformative('X', value=pl.zeros(dim))

    def banana_like(X, tau, b):
        phi_X = pl.copy(X)
        phi_X *= 30. # rescale X to match scale of other models
        phi_X[1] = phi_X[1] + b*phi_X[0]**2 - 100*b

        return mc.normal_like(phi_X, 0., tau)

    @mc.potential
    def banana(X=X, tau=C_1**-1, b=b):
        return banana_like(X, tau, b)

Pretty simple stuff, right? This is a pattern that I find myself using a lot, an uninformative stochastic variable that has the intended distribution implemented as a potential. If I understand correctly, it is something that you can’t do easily with BUGS.

Adaptive Metropolis is (or was) the default step method that PyMC uses to fit a model like this, and you can ensure that it is used (and fiddle with the AM parameters) with the use_step_method() function:

mod.use_step_method(mc.AdaptiveMetropolis, mod.X)

Important Tip: If you want to experiment with different step methods, you need to call mod.use_step_method() before doing any sampling. When you call mod.sample(), any stochastics without step methods assigned are given the method that PyMC deems best, and after they are assigned, use_step_method adds additional methods, but doesn’t remove the automatically added ones. (See this discussion from PyMC mailing list.)

So let’s see how this Adaptive Metropolis performs on a flattened banana as the dimension increases:

You can tell what dimension the banana lives in by counting up the panels in the autocorrelation plots in the upper right of each video, 2, 4 or 8. Definitely AM is up to the challenge of sampling from an 8-dimensional “flattened banana”, but this is nothing more than an uncorrelated normal distribution that has been stretched along one axis, so the results shouldn’t impress anyone.

Here is what happens in a more curvy banana:

Now you can see the performance start to degrade. AM does just fine in a 2-d banana, but its going to need more time to sample appropriately from a 4-d one and even more for the 8-d. This is similar to the way a cross-diagonal region breaks AM, but a lot more like something that could come up in a routine statistical application.

For a serious stress test, I can curve the banana even more:

Now AM has trouble even with the two dimensional version.

I’ve had an idea to take this qualitative exploration and quantify it to do a real “experimental analysis of algorithms” -style analysis. Have you seen something like this already? I do have enough to do without making additional work for myself.

Comments Off on MCMC in Python: Part IIb of PyMC Step Methods and their pitfalls

Filed under statistics, TCS

MCMC in Python: Part II of PyMC Step Methods and their pitfalls

I had a good time with the first round of my Step Method Pitfalls: besides making some great movies, I got a tip on how to combine Hit-and-Run with Adaptive Metropolis (together they are “the H-RAM“, fitting since the approach was suggested by Aram). And even more important than getting the tip, I did enough of a proof-of-concept to inspire Anand to rewrite it in the correct PyMC style. H-RAM lives.

Enter the GHME, where I had a lot of good discussions about methods for metrics, and where Mariel Finucane told me that her stress test for new step methods is always the banana (from Haario, Saksman, Tamminen, Adaptive proposal distribution for random walk Metropolis algorithm, Computational Statistics 1999):

The non-linear banana-shaped distributions are constructed from the Gaussian ones by ‘twisting’ them as follows. Let f be
the density of the multivariate normal distribution N(0, C_1) with the covariance again given by C_1 = {\rm diag}(100, 1, ..., 1). The density function of the ‘twisted’ Gaussian with the nonlinearity parameter b > 0 is given by f_b = f \circ \phi_b, where the function \phi_b(x) = (x_1, x_2 + b x_1^2 - 100b, x_3, ..., x_n).

It’s a good distribution, and it makes for a good movie.

More detailed explorations to follow. What do you want to see?

1 Comment

Filed under statistics, TCS

Random Forest Verbal Autopsy Debut

I just got back from a very fun conference, which was the culmination of some very hard work, all on the Verbal Autopsy (which I’ve mentioned often here in the past).

In the end, we managed to produce machine learning methods that rival the ability of physicians. Forget Jeopardy, this is a meaningful victory for computers. Now Verbal Autopsy can scale up without pulling human doctors away from their work.

Oh, and the conference was in Bali, Indonesia. Yay global health!

I do have a Machine Learning question that has come out of this work, maybe one of you can help me. The thing that makes VA most different from the machine learning applications I have seen in the past is the large set of values the labels can take. For neonatal deaths, for which the set is smallest, we were hoping to make predictions out of 11 different causes, and we ended up thinking that maybe 5 causes is the most we could do. For adult deaths, we had 55 causes on our initial list. There are two standard approaches that I know for converting binary classifiers to multiclass classifiers, and I tried both. Random Forest can produce multiclass predictions directly, and I tried this, too. But the biggest single improvement to all of the methods I tried came from a post-processing step that I have not seen in the literature, and I hope someone can tell me what it is called, or at least what it reminds them of.

For any method that produces a score for each cause, what we ended up doing is generating a big table with scores for a collection of deaths (one row for each death) for all the causes on our cause list (one column for each cause). Then we calculated the rank of the scores down each column, i.e. was it the largest score seen for this cause in the dataset, second largest, etc., and then to predict the cause of a particular death, we looked across the row corresponding to that death and found the column with the best rank. This can be interpreted as a non-parametric transformation from scores into probabilities, but saying it that way doesn’t make it any clearer why it is a good idea. It is a good idea, though! I have verified that empirically.

So what have we been doing here?

7 Comments

Filed under TCS