False information

Ben Birnbaum stood for his general exam last week, on a topic that I’m very interested in:

ABSTRACT–

Surveys are one of the principal means of gathering critical data from low-income regions. However, interviewer fabrication, or curbstoning, can threaten data quality. The existing literature lacks a set of general-purpose techniques to detect curbstoning; it does not capitalize on the potential of mobile data collection tools to help detect the phenomenon; and it provides few rigorous validations of the techniques that are developed. In this talk, I propose an anomaly detection framework to develop several general-purpose algorithms that identify curbstoning.

These algorithms can take advantage of the information in user traces from mobile data collection, a potential that I will evaluate rigorously. I also propose two studies to obtain high-quality labeled data sets with which I will validate my algorithms, thus partially filling the need for more rigorous evaluations.

Good job, Ben!  Also in attendance was Aram Harrow, who was reminded of this great story of the lying professor.  I wonder, could I could pull that off?

Leave a Comment

Filed under education

Unscientific America

I read a short book about science and society last weekend, Unscientific America by Chris Mooney and Sheril Kirshenbaum. It’s a quick read, and the context is very much the 2008 elections, so you should browse it sooner than later. There are some good ideas, but the focus on web campaigns of 2008 are going to make them sound even more dated in a year.

The book argues strongly for the meaningful popularization of scientific ideas. I love the popularizers of science, and was very influenced by books like Surely You’re Joking, Mr. Feynman and Gödel, Escher, Bach when I was a youth. The modern history sections in Unscientific America trace these popularizations to Carl Sagan’s book/television series Cosmos. I should check that out.

1 Comment

Filed under science policy

Validating Statistical Models

I’ve been thinking a lot about validating statistical models. My disease models are complicated, there are many places to make a little mistake. And people care about the numbers, so they will care if I make mistakes. My concern is grounded in experience; when I was re-implementing my disease modeling system, I realized that I mis-parameterized a bit of the model, giving undue influence to observations with small sample size. Good thing I caught it before anything was published based on the resultsI published anything based on the results!

How do I avoid this trouble going forwards? A well-timed blog post from Statistical Modeling, Causal Inference, and Social Science highlights one way, described in a paper linked there. I like this and I partially replicated in PyMC. But I’m concerned about something, which the authors mention in their conclusion:

To help ensure that errors, when present, are apparent from the simulation results, we caution against using “nice” numbers for fixed inputs or “balanced” dimensions in these simulations. For example, consider a generic hyperprior scale parameter s. If software were incorrectly written to use s^2 instead of s, the software could still appear to work correctly if tested with the fixed value of s set to 1 (or very close to 1), but would not work correctly for other values of s.

How do I avoid nice numbers in practice? I have an idea, but I’m not sure I like it. Does anyone else have ideas?

Also, my replication only works part of the time for my simple example, I guess because one of my errors is not enough of an error:


Leave a Comment

Filed under MCMC, software engineering

NSF Program Solicitation for Smart Health

This NSF Program Solicitation crossed my desk recently. It is for “Smart Health and Wellbeing”, which includes a lot of healthy algorithms topic in the “new tools and methods” it lists. For example:

From Data to Knowledge to Decisions: Investigate methods and algorithms for aggregation of multi-scale clinical, biomedical, contextual, and environmental data about each patient (EHR, personal health records – PHR, etc.), and unified and extensible metadata standards, and decision support tools to facilitate optimized patient-centered evidence-based decisions. Integrate patient information with delivery systems performance and economic models to support operations management decisions. Develop robust knowledge representations and reasoning algorithms to support inferences based on individual or population health data, multiple sources of potentially conflicting information while complying with applicable policies and preferences. Develop innovative technology for the secondary use of health data to support assisted and automated discovery of reliable knowledge from aggregated population health records and predictive modeling and simulation of health and disease at multiple levels from cellular to individuals/patients to populations, along with robust validation and integration of empirical data into the models. Develop understanding of how families, communities, informal caregivers, professional medical teams and patients interpret care and treatment. Increase understanding of issues (technological, behavioral, socio-economic, value-driven actions, ethical, systemic) that interfere with patients’ collaboration in care team and adherence to treatment and wellness regimens.

Leave a Comment

Filed under Uncategorized

Cool optimization and disease modeling from INFORMS

I didn’t go to INFORMS, because I get lost at mega-conferences. Give me 200 people attending one track of talks, preferably with lots of coffee breaks. But I did talk to people who went. One exciting thing that I heard about is this: Pyomo, the Python Optimization Modeling Objects package. There was a talk that applied it to do some optimization of disease parameters for an infectious disease model, which is sort of like the business I’ve been getting in to lately. Fortunately, the slides from the talk are online, here.

Now I must see if I can run the examples.

Leave a Comment

Filed under disease modeling

I-TECH’s Everydayleadership.org

There are many sides to global health, and the quantitative, metrics-y part that I write about here is but one. My colleagues from the UW Global Health Department at I-TECH work on another, which intrigues me, and might be called “leadership development”. They have just released a large collection of short videos about leadership, on a slick new website.

Here is a video that I liked, 2 minutes of the country director in South Africa for the US Centers for Disease Control talking about the complexity of transparency:


More videos by Thurma Goldman on Everydayleadership.org

There are a lot of videos there, so if you see one you recommend, post a link in the comment section for me.

Leave a Comment

Filed under global health

Put it in a figure+

An old adage when writing research papers is “put it in a figure”. If there is one thing that I want the reader to know when they put my paper down, then I try to put it in a beautiful figure, with a complete explanation in the caption. I saw the extension of this rule to talks recently, and I’m going to try it out myself: if there is one thing you want your audience to remember when they leave your talk, put it in a movie.

Here is the movie that taught me this lesson:

And here is a blog post by one of the video creators, to tell you more about what you’re seeing.

1 Comment

Filed under disease modeling, videos

Causal Modeling in Python: Bayesian Networks in PyMC

While I was off being really busy, an interesting project to learn PyMC was discussed on their mailing list, beginning thusly:

I am trying to learn PyMC and I decided to start from the very simple discrete Sprinkler model.
I have found plenty of examples for continuous models, but I am not sure how should I proceed with conditional tables, especially when the condition is over more than a variable. How would you model a CPT with PyMC, and in particular the Sprinkler model?

I spend all my time on continuous models these days, and I miss my old topics from back in grad school. Not that this is one of them, exactly. But when I found myself with a long wait while running some validation code, I thought I’d give it a try. The model turned out to be simple, although using MCMC to fit it is probably not the best idea. Continue reading

2 Comments

Filed under MCMC

OWS in Theory

Luca Trevisan sparks a CS Theory discussion about the police repression of students supporting Occupy Wall St on his blog “in theory”.

Leave a Comment

Filed under Mysteries

How I spent my fall vacation

As mentioned, HA took a brief vacation while I worked hard on my disease modeling system for the Global Burden of Disease 2010 study. First, I thought I was writing a book about the methods, but as I wrote I realized more and more things that I would like to do differently implementing the methods. So then I switched to re-writing all of the implementation, which seemed like an ambitious 1 week project. Two months later, I’m very happy with the results, and they’re online just in time for my users to crunch many numbers.

Of course, there are still plenty of issues that are coming up, and I still have to get back to writing the book about this approach. But I miss the variety of the blogging, and I’m starting it back up, even if I have plenty of other writing responsibilities.

Also, I love the wisdom of the internet. Dear readers, here is the description of this integrative disease modeling book I’ve been working on. Does it sound like something you’ve seen before? I’ve found that the mathematics I’m using have been rediscovered many times in many fields, all of which I would like to know more about.

Integrative systems modeling of disease in populations is the first book-length treatment of model-based meta-analytic methods for descriptive epidemiology. It develops, from first principles, the system dynamics model which constitutes the theoretical foundation of Years Lived with Disability (YLD) estimation in burden of disease studies. This compartmental model of the progression of disease through a population has been used for over ten years in global health epidemiology in the popular generic disease modeling system DisMod II, distributed by the World Health Organization. However, until now, the description of the model and the methods behind the software have been scattered through the scientific literature in a loose collection of journal articles and operations manuals.

In addition to collecting the prior work on compartmental modeling of disease together in one place, this book significantly extends the model, by formally connecting the system dynamics model of disease progression to a statistical model of epidemiological rates, the kind that are calculated in descriptive epidemiological research and collected through systematic review. This combination of systems dynamics modeling and statistical model, which the author calls integrative systems modeling allows the model to integrate all available relevant data. Because advanced numerical algorithms are needed to fit these complex models, a section of the book provides the necessary background on Markov chain Monte Carlo (MCMC) computation.

Experience with the results of systematic review indicates that when all available relevant data is collected, it is often very sparse and very noisy. The integrative systems models developed in this book focus particularly on techniques for handling sparse, noisy data. The book explores statistical models for over-dispersed count data, covariate modeling to both explain systematic variation in epidemiological rate data and increase predictive accuracy for estimates for subpopulations where no data is available, and age-pattern modeling to systematically incorporate expert knowledge about how quickly epidemiological rates can vary as a function of age. It also develops a novel theory of age group modeling to address heterogeneity in age groups commonly found during systematic review.

The theoretical foundations of integrative systems modeling of disease in populations are complemented with a series of applications of the model to meta-analysis of more than a dozen different diseases. These practical applications provide a unique opportunity to see how the model performs in a variety of scenarios, and also demonstrate how the model performs when the model assumptions are violated, and how to work around model assumption violations.

The book concludes with a detailed description of the future directions for research in model-based meta-analysis of descriptive epidemiological data and integrative systems modeling for global health.

Leave a Comment

Filed under disease modeling