Category Archives: TCS

February 25, 2011 · 8:38 pm

Random Forest Verbal Autopsy Debut

I just got back from a very fun conference, which was the culmination of some very hard work, all on the Verbal Autopsy (which I’ve mentioned often here in the past).

In the end, we managed to produce machine learning methods that rival the ability of physicians. Forget Jeopardy, this is a meaningful victory for computers. Now Verbal Autopsy can scale up without pulling human doctors away from their work.

Oh, and the conference was in Bali, Indonesia. Yay global health!

I do have a Machine Learning question that has come out of this work, maybe one of you can help me. The thing that makes VA most different from the machine learning applications I have seen in the past is the large set of values the labels can take. For neonatal deaths, for which the set is smallest, we were hoping to make predictions out of 11 different causes, and we ended up thinking that maybe 5 causes is the most we could do. For adult deaths, we had 55 causes on our initial list. There are two standard approaches that I know for converting binary classifiers to multiclass classifiers, and I tried both. Random Forest can produce multiclass predictions directly, and I tried this, too. But the biggest single improvement to all of the methods I tried came from a post-processing step that I have not seen in the literature, and I hope someone can tell me what it is called, or at least what it reminds them of.

For any method that produces a score for each cause, what we ended up doing is generating a big table with scores for a collection of deaths (one row for each death) for all the causes on our cause list (one column for each cause). Then we calculated the rank of the scores down each column, i.e. was it the largest score seen for this cause in the dataset, second largest, etc., and then to predict the cause of a particular death, we looked across the row corresponding to that death and found the column with the best rank. This can be interpreted as a non-parametric transformation from scores into probabilities, but saying it that way doesn’t make it any clearer why it is a good idea. It is a good idea, though! I have verified that empirically.

So what have we been doing here?

7 Comments

Filed under TCS

Tagged as conference, machine learning, my research, verbal autopsy

December 23, 2010 · 7:38 pm

MCMC in Python: Custom StepMethods and bounded-depth spanning tree distraction

I was looking for a distraction earlier this week, which led me to the world of stackexchange sites. The stack overflow has been on my radar for a while now, because web-search for coding questions often leads there and the answers are often good. And I knew that math overflow, tcs overflow, and even stats overflow existed, but I’d never really explored these things.

Well, diversion found! I got enamored with an MCMC question on the tcs site, about how to find random bounded-depth spanning trees. Bounded-depth spanning trees are something that I worked on with David Wilson and Riccardo Zechinna in my waning days of my MSR post-doc, and we came up with some nice results, but the theoretical ones are just theory, and the practical ones are based on message passing algorithms that still seem magical to me, even after hours of patient explanation from my collaborators.

So let’s do it in PyMC… this amounts to an exercise in writing custom step methods, something that’s been on my mind for a while anyways. And, as a bonus, I got to make an animation of the chain in action which I find incredibly soothing to watch on repeat:

Continue reading →

3 Comments

Filed under MCMC, TCS

Tagged as MCMC, pymc, python, spanning trees

September 24, 2010 · 5:07 am

Losing touch with theory?

I’ve been flipping through the titles of SODA acceptances listed on the blogs, and wondering if I’m losing touch with TCS research. It’s a good chance for me to think about what algorithms (discrete or otherwise) have been really big in the health metrics work I’ve been doing recently.

Markov Chain Monte Carlo (MCMC): This is the workhorse algorithm for me when fitting statistical models. There are a few MCMC-sounding titles in the SODA list; does anything have an exciting new step method to speed up my life?
Mixed Integer Programming (MIP): This classic formulation of operations research must make an appearance in some of the approximation algorithms or other analysis in SODA. Is there any work there that’s taking it on directly?
Stochastic Programming: There was a lot of excitement about two-stage stochastic programming a few years ago, but the fad seems to have died down in theory land. Too bad for me, because two-stage formulations are not really what I need, and my StoPro needs are growing.
Random Forests: I really didn’t get enough education on machine learning in grad school. What I do know is very much on the theoretical side of the spectrum. But this Random Forests business has been pretty promising so far, and I just made a bet 10-to-1 that it will out-perform an ad-hoc method for verbal autopsy. I believe the odds, but I wasn’t paying enough attention to the stakes…
Nonlinear optimization: I love MCMC, but I don’t love waiting around for my chains to mix. The alternative approach to statistical modeling, where you make due with a maximum likelihood estimate is starting to look pretty appealing. This is pretty outside of the SODA realm. I tried to convince Steven Rudich to include Newton’s Method in his course “Great Theoretical Ideas in Computer Science” some years ago, but I didn’t succeed.
Automatic Differentiation: If I’m getting into nonlinear optimization, I will at least be a user of automatic differentiation, since the nonlinear optimizer wants to know the gradient, and I’m sure not going to be computing it if I don’t have to be.

So I guess my research needs are not squarely within the SODA realm. But they are not disjoint from it either. I’m still touching theory, if not totally in touch. Maybe one day soon I’ll even have time to prove something.

Comments Off on Losing touch with theory?

Filed under TCS

Tagged as algorithms, SODA, TCS

September 11, 2010 · 6:54 pm

Experimental Analysis of Algorithms

It’s been a busy two weeks since I got back in town. The PBFs who went to “the field” for their summer abroad have returned with lots of fun and interesting stories. A new batch of PBFs and PGFs has arrived, bringing IHME to it’s planned capacity of around 100 heads. And I’ve been getting deeply into experimental analysis of a gaussian process regression technique, much like the one we used for estimating child mortality rates.

Maybe I’ll work on it publicly here on healthy algorithms. I’ll see if that seems too boring as I proceed.

For the moment, I’m just looking for reading suggestions. I was very inspired by David Johnson’s papaer A Theoretician’s Guide to the Experimental Analysis of Algorithms when I read it, but that was years ago. I’m going to have to read it again. What else do you recommend like this?

7 Comments

Filed under TCS

Tagged as paper recommendations, tcs reading

August 11, 2010 · 10:28 pm

Aaronson’s non-bet of non-confidence in P≠NP

As you have undoubtedly heard by now, there is a paper that claims to prove P!=NP, and there is a serious effort to understand the proof.

It has been fun to watch the experts set to work on this, and it has brought a lot of attention to random k-SAT, a problem that was near and dear to me when I was a grad student. And I get to learn interesting things from them without having to struggle through Deolalikar’s opus myself.

One interesting thing is the way Scott Aaronson reacted, saying:

If Vinay Deolalikar is awarded the $1,000,000 Clay Millennium Prize for his proof of P≠NP, then I, Scott Aaronson, will personally supplement his prize by the amount of $200,000.

When I first read about Aaronson’s offer to add $200K to the prize money, reported 2nd hand in a roundup of what the #pnp blogs were saying, it came off like the young professor is really hoping to have people work on this thing. But once my trusty rss feeder fed me his post, I realized his offer is not about how profs at private universities have disposable income that public schools don’t provide. It’s his way of quantifing his confidence in the accuracy of the proof.

If Aaronson had framed this in terms of a bet, it would be a textbook example of his level of certainty that the proof will have a flaw (a textbook in decision theory, anyway). But offering the sum without any possibility of receiving a return in the alternative scenario breaks expected utility theory. How certain is Scott? It all depends on what amount of money means nothing to him.

1 Comment

Filed under probability, TCS

Tagged as pnp

August 10, 2010 · 12:30 am

P, NP, and more

There was a lot of buzz this weekend about a paper claiming to resolve the “P vs NP” conjecture. I’ve seen plenty of papers claiming to do this over the years, so I didn’t rush to track it down. But as the tweeting continued, I decided to have a quick look for myself, if only to produce a crotchety blog post on the matter.

Part of the drama around this paper comes from the way it appeared. Author Vinay Deolalikar‘s web page explains:

Manuscript sent on 6th August to several leading researchers in various areas. Confirmations began arriving 8th August early morning. The preliminary version made it to the web without my knowledge. I have made minor updates, here. Please note that the final version of the paper is under preparation, and is to be posted here very shortly.

I’m a global health researcher now, so I’m not going to be the one who tries to verify this 104 page paper, but I would love to learn from a careful review that it is true. The result seems to go further than separating P and NP, since it is a statement about a natural distribution of instances being hard (from p. 78):

If LFP were able to compute solutions to the d1RSB phase of random k-SAT, then the distribution of the entire space of solutions would have a substantially simpler parametrization than we know it does.

Ryan Williams thinks this is suspicious, and expressed his concern in a series of tweets:

This P vs NP claim is getting tons of press. I am really doubtful that it works. Hard to convey my skepticism in a tweet, but here goes…

The author tries to use the fact that for certain distributions of random k-SAT, the solution space has a “hard structure”. Two problems:

(1) Polytime solvable problems (such as perfect matching on random graphs) can also have complicated solution distributions.

(2) There is a randomized reduction from SAT to formulas with at most ONE satisfying assignment (Valiant-Vazirani). A simple solution space

So either Valiant-Vazirani can’t be derandomized or RP=NP (seems very unlikely!) or the proof must break. That’s my intuition.

Time will tell. I find proof of the existence of hard-on-average distributions much more exciting than plain old P vs NP, and Deolalikar’s paper might have something for everybody.

1 Comment

Filed under TCS

Tagged as pnp, random k-sat

April 19, 2010 · 7:10 pm

Practical MCMC Advice: When to Stop

I read some good practical advice about when enough is enough in Markov Chain Monte Carlo sampling this morning. In their “Inference from simulations and monitoring convergence” chapter of Handbook of Markov Chain Monte Carlo, Andrew Gelman and Kenneth Shirley say many useful things in a quickly digested format. Continue reading →

Comments Off on Practical MCMC Advice: When to Stop

Filed under MCMC, statistics, TCS

Tagged as MCMC, statistics

October 16, 2009 · 7:18 pm

Dense-Subset Break-the-Bank Challenge

I’m preparing for my first global travel for global health, but the net is paying attention to a paper that I think I’ll like, and I want to mention it briefly before I fly.

Computational Complexity and Information Asymmetry in Financial Products is 27 pages of serious TCS, but it is so obviously applicable that people outside of our particular ivory tower, and even outside of academia entirely are blogging and twittering about it, and even reading it!

Freedom to Tinker has a nice summary of this paper, if you want to know what it’s about in a hurry.

Mike Trick makes the salient observation that NP-hard doesn’t mean computers can’t do it. But the assumption that this paper is based on is not about worst-case complexity; it is, as it should be, based on an assumption about the average-case complexity of a particular optimization problem over a particular distribution.

As it turns out, this is an average-case combinatorial optimization problem that I know and love, the densest subgraph problem. My plan is to repeat the problem here, and share some Python code for generating instances of it. Then, you, me, and everyone, can have a handy instance to try optimizing. I think that this problem is pretty hard, on average, but there is a lot more chance of making progress on an algorithm for it than for cracking the P versus NP nut. Continue reading →

4 Comments

Filed under combinatorial optimization, cryptography, TCS

Tagged as aco, challenge, dense subset problem, finance, networkx, python, TCS

October 5, 2009 · 6:30 pm

Conference you should know about

This weekend marks the submission of my first “Global Health” paper. Congratulations to me! And many, many thanks to all the people who have worked with me to make it happen. I’ll go into details sometime in the future, first let me see how things go in the refereeing process.

While I was over-working on that business, I got an interesting Call-for-Papers forwarded from global health/AI researcher Emma Brunskill. The AAAI Spring Symposium on Artificial Intelligence for Development (AI-D) is an effort to build a community of people applying computer science and artificial intelligence in less-developed settings.

TCS people, don’t let the “AI” in their title turn you off. Eric Horvitz says that this is for all of us. Continue reading →

Comments Off on Conference you should know about

Filed under global health, TCS

Tagged as ai4d, conference, ict4d, TCS

August 14, 2009 · 6:20 pm

August is Too-Many-Projects Month

(Tap… tap… tap… is this thing on? Good.)

July was vacation month, where I went on a glorious bike tour of the Oregon/California coast, and learned definitively that I don’t like biking on the side of a highway all day. Don’t worry, I escaped in Coos Bay and took trains and buses between Eugene, Santa Cruz, Berkeley, and SF for a vacation more my speed.

But now that I’m back, August is turning out to be project month. I have 3 great TCS applications to global health in the pipeline, and I have big plans to tell you about them soon. But one mixed blessing about these applications is that people actually want to see the results, like, yesterday! So first I have to deal with the results, and then I can write papers and blogs about the techniques.

Since Project Month is a little over-booked with projects, I’m going to have to triage one today. You’ve heard of the NetFlix Challenge, right? Well, github.com is running a smaller scale recommendation contest, and I was messing around with personal page rank, which seems like a fine approach for recommending code repositories to hackers. I haven’t got it working very well (best results, 15% of holdout set recovered), but I was having fun with it. Maybe someone else will take it up, let me know if you get it to work; networkx + data = good times.

    f = open('download/data.txt')
    for l in f:
        u_id, r_id = l.strip().split(':')
        G.add_edge(user(u_id), repo(r_id))

[get the code]

2 Comments

Filed under combinatorial optimization, software engineering, TCS

Tagged as github, networkx, personalized page rank, python

Category Archives: TCS

Random Forest Verbal Autopsy Debut

MCMC in Python: Custom StepMethods and bounded-depth spanning tree distraction

Losing touch with theory?

Experimental Analysis of Algorithms

Aaronson’s non-bet of non-confidence in P≠NP

P, NP, and more

Practical MCMC Advice: When to Stop

Dense-Subset Break-the-Bank Challenge

Conference you should know about

August is Too-Many-Projects Month

Posts

Theory Blogs

some rights reserved

Pages

Archives

Meta