Where does the term “covariate” come from?

I’ve been hard at work revising my DisMod-MR book, and one thing that has been fun is recognizing the jargon that is so embedding my working language that I’ve never thought about it before. What is a “covariate”? It is any data column in my tabular array that is not the most important one. But how do I know that? This word is not actually English.

I asked where the term comes from on CrossValidated, and got a good answer, as well as a link to a whole website of earliest known uses of some of the words of mathematics.

The first use of this word in a JSTOR-archived article is in Proceedings of a Meeting of the Royal Statistical Society held on July 16th, 1946, and it captures all the basics of how I am using it today (although my setting is observational, not experimental):

A simple example occurs in crop-cutting experiments. In the Indian Statistical Institute the weight of green plants of jute or the weight of paddy immediately after harvesting are recorded on an extensive scale. In only a small fraction of cases (of the order of 10 %) the jute plant is steeped in water, retted and the dry fibre extracted and its weight determined directly, or the paddy is dried, husked and the weight of rice measured separately. These auxiliary measurements serve to supply the regression relation between the weight of green plants of jute or the weight of paddy immediately after harvesting and the yield of dry fibre of jute or of husked rice, respectively, which can then be used to estimate the corresponding final yields from the more extensive weights taken immediately after harvesting. Such a procedure simplifies the field work enormously without any appreciable loss of precision in the final results.

Such methods, in which the estimates made in later surveys are based on correlations determined in earlier surveys, may perhaps be called “covariate sampling”.

Comments Off on Where does the term “covariate” come from?

Filed under statistics

Statistics in Python: Bootstrap resampling with numpy and, optionally, pandas

I’m almost ready to do all my writing in the IPython notebook. If only there was a drag-and-drop solution to move it into a wordpress blog. The next closest thing: An IPython Notebook on Github’s Gist, linked from here. This one is about bootstrap resampling with numpy and, optionally, pandas.

Comments Off on Statistics in Python: Bootstrap resampling with numpy and, optionally, pandas

Filed under statistics

IHME Seminar: Captricity

We had a live-streamed seminar at IHME this week! I’m very excited to hear that it worked. The talk was good, too.

We heard from Kuang Chen, the founder and CEO of Captricity (http://captricity.com/), about how this “beyond OCR” approach to data entry went from academic research to a Silicon Valley startup. To show off what they can do, Captricity has a number of cool datasets that they have transformed from unsearchable PDFs into well-groomed csv files. For example, the USA 1940 census: https://shreddr.captricity.com/opendata/1940-census/

I could see using this service in my future, maybe for a VA survey or something similar. But what really grabbed me was the datasets they are making available. Their data gallery a place I’ll be watching: https://shreddr.captricity.com/opendata/

Comments Off on IHME Seminar: Captricity

Filed under global health

Journal Club: Socioeconomic development as an intervention against malaria

The quarter is underway, and journal club is back. This week will will discuss Tusting et al’s meta-analysis of socioeconomic development as an intervention against malaria.

t

I wonder if the forest plot is here to stay?

t2

It presents a lot of information, but maybe it could emphasize the important parts more. There is great benefit to having a standard way to present systematic review data, however, so any changes need to be for huge benefit or just little tweaks.

Comments Off on Journal Club: Socioeconomic development as an intervention against malaria

Filed under global health

IHME Seminar: Ambient Air Pollution and Cardiovascular Disease

The semester is starting up again, and that means that weekly IHME seminars are starting up again. This week, we heard from Joel Kaufman, a doctor and UW professor who knows quite a lot about how air pollution is bad for the heart. He had some great historical photos of air pollution from the Great Smog of London, which I had not heard of before. Searching later led me to this collection in the Guardian. Dr. Kaufman also had some recent photos of extreme air pollution, which looked sort of like this one.

I remember when I started this blog, I had a goal to draw connections between the issues in global health and the methods and results of theoretical computer science. What does the air-pollution/cardio-health talk inspire along these lines? Well, there are two “big data” sources going on here: continuously updated measurements of air quality from a number of geographically dispersed sensors, and regularly conducted CT scans of participants in a large longitudinal study. It was only an hour long talk, so I’m not sure what problems arise when you put these things together, but I bet you can’t store it all in memory, even on a pretty large computer. And that’s one definition of big…

Comments Off on IHME Seminar: Ambient Air Pollution and Cardiovascular Disease

Filed under Uncategorized

Comparing spline models

I need a graphic for dismod that makes the point as nicely as this one:

smooth_splines

It’s from P. Dierckx, Curve and Surface Fitting with Splines, which also has some pretty pictures of bivariate splines doing their thing:

bisplines

Comments Off on Comparing spline models

Filed under machine learning

Will I attend a MOOC?

The author of one of the best books on data visualization is giving a massively open online course (MOOC) this fall. I’m going to check it out. You may be interested, too.

http://www.thefunctionalart.com/2013/09/the-third-introduction-to-infographics.html

Comments Off on Will I attend a MOOC?

Filed under dataviz, education

GeoJSON for Norway Counties

This may come in handy: http://gangerolf.blogspot.com/2012/09/norway-in-geojson.html

I know I have seen a nice one for USA somewhere as well.

Comments Off on GeoJSON for Norway Counties

Filed under dataviz

PyMC3 coming along

I have been watching the development of PyMC3 from a distance for some time now, and finally have had a chance to play around with it myself. It is coming along quite nicely! Here is a notebook Kyle posted to the mailing list recently which has a clean demonstration of using Normal and Laplace likelihoods in linear regression: http://nbviewer.ipython.org/c212194ecbd2ee050192/variable_selection.ipynb

Comments Off on PyMC3 coming along

Filed under statistics

Global Data Viz in translation

IHME has recently worked with the World Bank to release a series of regional reports on relevant findings from the Global Burden of Disease 2010 Study. It is cool to see this work getting disseminated, and now even in non-English editions. This raises questions for data visualization translations, like should 1990 and 2010 be in reversed positions when accompanying right-to-left text?

gh_dv_q

Comments Off on Global Data Viz in translation

Filed under dataviz