# Monthly Archives: June 2011

## Wikipedia Editing for Scientists

David Eppstein has written up a guide for scientists who want to get started contributing to Wikipedia.

Here is why you might want to write for Wikipedia, from Eppstein’s writeup:

## Why?

You already have other avenues for publishing your writing professionally, and plenty of demands on your time. Why should you take the extra time to write for Wikipedia as well?

• Public service. Part of being a scientist is communicating to the public, and Wikipedia is a great way of writing about research in a way that can be found and read by the public.
• Give and take. As a research scientist you are benefiting from a vast collection of survey articles written by the Wikipedia community. Why not reciprocate and help improve the existing articles by sharing your knowledge?
• Righting wrongs. You’ve probably already found some important topics that you know about from your research that are missing from Wikipedia, or worse, described incorrectly. Who better than someone who knows about these topics professionally to repair the damage?
• Practice. To write well on Wikipedia, you have to pay more attention to matters of readability than you might when writing for your peers. Practicing your writing ability in this way is likely to cause your professional writing to improve.
• Broaden your knowledge. When you write about a topic, you learn about it yourself; you may well find the topics you write about useful later in your own research. Also, when you carefully survey a topic, you are likely to find out about what is not known as well as what is known, and this could help you find future research projects.
• It looks good on your vita. Actually, I don’t think any tenure committee is going to care about your Wikipedia contributions. And in most cases the fact that you’ve contributed to an article is invisible to most readers, so it’s also not going to do much for making you more famous. But recently the NSF has started to take “broader impacts” more seriously on grant applications, and if you can make a convincing case that your Wikipedia editing activity is significant enough to count as a broader impact then that will probably improve your chances of getting funding. And getting more funding really does look good on your vita.

I agree.

Filed under education

## Life Expectancy by County in US

This recent study by my colleagues has been making headlines a lot last week, but I’m just getting to write about it now.  While I was busy, stories about it appeared in high-profile outlets like NPR and the Statistical Modeling, Causal Inference, and Social Science blog.

As I’ve been thinking for two years (according to the ancient post I pushed out the door yesterday), life expectancy is a weird statistic. Life expectancy at birth is not, as the name might imply, a prediction on the average length of the life of a baby born this year. It is something more complicated to describe, but easier to predict. I like to think of it as the length of life if you froze the world exactly the way it is right now, and the baby today was exposed to the mortality risk of today’s one-year-olds next year, today’s two-year-olds in two years, etc. Although, as a friend pointed out two weeks ago, this is not a really good way to look at things either, if you push the analogy too hard. Currently Wikipedia isn’t really helpful on this matter, but maybe it will be better in the future.

There is another interesting thing in this paper, which is the validation approach the authors used. Unfortunately, it’s full development is in a paper still in press. Here is what they have to say about it so far:

We validated the performance of the model by creating small counties whose “true” underlying death rates were known. We did this by treating counties with large populations (> 750,000) as those where death rates have little sampling uncertainty. We then repeatedly sampled residents and deaths from these counties (by year and sex) to construct simulated small-county populations. We used the above model to predict mortality for these small, sampled-down counties, which were then compared with the mortality of the original large county.

I believe that this is fully developed in the paper which they cite at the beginning of the modeling section, Srebotnjak T, Mokdad AH, Murray CJL: A novel framework for validating and applying standardized small area measurement strategies, submitted. From what I’ve heard about it, I like it.

Filed under global health

## What is life expectance?

Hint:  It’s not what you think.

(This is a post that I never finished/barely started almost two years ago.)

Filed under Uncategorized

## Age-heaping and Hedgehogs

I heard an interesting talk a few weeks ago about “age-heaping” in survey responses, the phenomenon where people remember ages imprecisely and say that their siblings are ages that are divisible by 5 much more often than expected.  There are some nice theory challenges here, with a big dose of stats modeling, but I’ll have to share some more thoughts on that later.

In the talk, the age-heaping was also referred to a a hedgehog or porcupine plot, because of the spikey histogram that the data produces.  I was looking for a nice picture of one, or some additional background reading, and when I searched for “hedgehog statistical plots”, all google would give me was a bunch of pages about stats on actual hedgehogs.  Cute!

Filed under TCS

## New Stats Books

Here is a new book on Bayesian stats that Kyle forwarded on to me: Principles of Uncertainty.  Chapter 11 looks unique, on “multiparty problems”, and a pdf of the whole thing is available from the book website for download.

Filed under statistics

## A Slide I Like

I’m updating a talk about machine learning for verbal autopsy analysis, and I thought I’d share a slide I like. I wonder what statisticians think about this view of the world:

Filed under statistics

## Schedule of a workshop I could use

Effective use of programming in scientific research, soon, far away, and already full.

Maybe we should do our own in Seattle.

Filed under software engineering

## A simple optimization problem I don’t know how to solve (from DCP)

Inspired by the recent 8F workshop, I’m trying to write up theory challenges arising from global health. And I’m trying to do it with less background research, because avoiding foolishness is a recipe for silence.

This is the what I called the “simplest open problem in DCP optimization” in a recent post about DCP (Disease Control Priorities), but with more reflection, I should temper that claim. I’m not sure it is the simplest. I’m not sure it is an open problem. And I’m pretty sure that if we solve it, the DCP optimizers will come back with something more complicated.

But it is a nice, clean problem to start with. I’m calling it “Fully Stochastic Knapsack”. It looks just like the plain, old knapsack problem:
$\max \bigg\{ \sum_{i=1}^n v_ix_i \qquad s.t. \quad \sum_{i=1}^n w_ix_i \leq W, \quad x_i \in \{0,1\} \bigg\}$
The fully stochastic part is that everything that usually would be input data is now a probability distribution, and the parameters of the distribution are the input data.

This makes even deciding what to maximize a challenge. I was visiting the UW Industrial Engineering Dept yesterday, and Zelda Zabinsky pointed me to this nice INFORMS tutorial by Terry Rockafeller on “coherent approaches” to this.