Category Archives: statistics

Journal Club: Prognosis of patients with HIV-1 infection

Oh, how the quarter gets away. What happened in our journal club since I last recorded a paper? Well, I will start catching up now. The first thing that happened was just over a month ago we read a paper on prognosis for lovers of survival curves: Prognosis of patients with HIV-1 infection starting antiretroviral therapy in sub-Saharan Africa: a collaborative analysis of scale-up programmes. The author’s interpretation of their results:

INTERPRETATION:
Prognostic models should be used to counsel patients, plan health services, and predict outcomes for patients with HIV-1 infection in sub-Saharan Africa.

Comments Off on Journal Club: Prognosis of patients with HIV-1 infection

Filed under statistics

MCMC Convergence Diagnostics

I have revisited my approach to deciding if MCMC has run for long enough recently, and I’m collecting some of the relevant material here:

Last time I thought about it: https://healthyalgorithms.com/2010/04/19/practical-mcmc-advice-when-to-stop/

Original paper for R_hat approach: http://www.stat.columbia.edu/~gelman/research/published/itsim.pdf

Presentation comparing several approaches: http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf

Published comparison: http://www.jstor.org/stable/2291683

Blog about a cool visual approach: http://andrewgelman.com/2009/12/24/visualizations_1/

Discussion on cross-validated: http://stats.stackexchange.com/questions/507/what-is-the-best-method-for-checking-convergence-in-mcmc

Book with a chapter on this referenced there: http://www.amazon.com/dp/1441915753/?tag=stackoverfl08-20 (available as an eBook from UW Library, how convenient!)

Another related blog: http://xianblog.wordpress.com/2012/11/28/mcmc-convergence-assessment/

Comments Off on MCMC Convergence Diagnostics

Filed under statistics

IHME Seminar: Emily Fox on Bayesian Dynamic Modeling

This week we had a seminar on Bayesian Dynamic Modeling from UW Stats professor Emily Fox. The video is archived here! This talk had some of the most successful embedded video that I’ve seen in a talk. I don’t think it made it into the video perfectly, though, so imagine dancing honeybees while you watch.

Comments Off on IHME Seminar: Emily Fox on Bayesian Dynamic Modeling

Filed under machine learning, statistics

Journal Club: Assessment of statistical models

This week in journal club, we will be reading Green et al, Use of posterior predictive assessments to evaluate model fit in multilevel logistic regression. I like posterior predictive checks, here are some of the stats papers that we might read in the future if my students do, too:

Click to access GelmanMengStern1996.pdf

Click to access dogs.pdf

http://onlinelibrary.wiley.com/doi/10.1002/1097-0258(20000915/30)19:17/18%3C2377::AID-SIM576%3E3.0.CO;2-1/pdf

http://onlinelibrary.wiley.com/doi/10.1002/sim.1403/pdf

Comments Off on Journal Club: Assessment of statistical models

Filed under statistics

Where does the term “covariate” come from?

I’ve been hard at work revising my DisMod-MR book, and one thing that has been fun is recognizing the jargon that is so embedding my working language that I’ve never thought about it before. What is a “covariate”? It is any data column in my tabular array that is not the most important one. But how do I know that? This word is not actually English.

I asked where the term comes from on CrossValidated, and got a good answer, as well as a link to a whole website of earliest known uses of some of the words of mathematics.

The first use of this word in a JSTOR-archived article is in Proceedings of a Meeting of the Royal Statistical Society held on July 16th, 1946, and it captures all the basics of how I am using it today (although my setting is observational, not experimental):

A simple example occurs in crop-cutting experiments. In the Indian Statistical Institute the weight of green plants of jute or the weight of paddy immediately after harvesting are recorded on an extensive scale. In only a small fraction of cases (of the order of 10 %) the jute plant is steeped in water, retted and the dry fibre extracted and its weight determined directly, or the paddy is dried, husked and the weight of rice measured separately. These auxiliary measurements serve to supply the regression relation between the weight of green plants of jute or the weight of paddy immediately after harvesting and the yield of dry fibre of jute or of husked rice, respectively, which can then be used to estimate the corresponding final yields from the more extensive weights taken immediately after harvesting. Such a procedure simplifies the field work enormously without any appreciable loss of precision in the final results.

Such methods, in which the estimates made in later surveys are based on correlations determined in earlier surveys, may perhaps be called “covariate sampling”.

Comments Off on Where does the term “covariate” come from?

Filed under statistics

Statistics in Python: Bootstrap resampling with numpy and, optionally, pandas

I’m almost ready to do all my writing in the IPython notebook. If only there was a drag-and-drop solution to move it into a wordpress blog. The next closest thing: An IPython Notebook on Github’s Gist, linked from here. This one is about bootstrap resampling with numpy and, optionally, pandas.

Comments Off on Statistics in Python: Bootstrap resampling with numpy and, optionally, pandas

Filed under statistics

PyMC3 coming along

I have been watching the development of PyMC3 from a distance for some time now, and finally have had a chance to play around with it myself. It is coming along quite nicely! Here is a notebook Kyle posted to the mailing list recently which has a clean demonstration of using Normal and Laplace likelihoods in linear regression: http://nbviewer.ipython.org/c212194ecbd2ee050192/variable_selection.ipynb

Comments Off on PyMC3 coming along

Filed under statistics

Regression Modeling in Python: Patsy Spline

I’ve been watching the next generation of PyMC come together over the last months, and there is some very exciting stuff happening. The part on GLM regression led me to a different project which is also of interest, a regression modeling minilanguage, called Patsy which “brings the convenience of R ‘formulas’ to Python.”

This package recently introduced a method for spline regression, and avoided all puns in naming. Impressive.

Comments Off on Regression Modeling in Python: Patsy Spline

Filed under statistics

Classic EM in Python: Multinomial sampling

In the classic paper on the EM algorithm, the extensive example section begins with a multinomial modeling example that is theoretically very similar to the warm-up problem on 197 animals:

We can think of the complete data as an n \times p matrix x whose (i,j) element is unity if the i-th unit belongs in the j-th of p possible cells, and is zero otherwise. The i-th row of x contains p-1 zeros and one unity, but if the i-th unit has incomplete data, some of the indicators in the i-th row of x are observed to be zero, while the others are missing and we know only that one of them must be unity. The E-step then assigns to the missing indicators fractions that sum to unity within each unit, the assigned values being expectations given the current estimate of \phi. The M-step then becomes the usual estimation of \phi from the observed and assigned values of the indicators summed over the units.

In practice, it is convenient to collect together those units with the same pattern of missing indicators, since the filled in fractional counts will be the same for each; hence one may think of the procedure as filling in estimated counts for each of the missing cells within each group of units having the same pattern of missing data.

When I first made some data to try this out, it looked like this:

import pymc as mc, numpy as np, pandas as pd, random

n = 100000
p = 5

pi_true = mc.rdirichlet(np.ones(p))
pi_true = np.hstack([pi_true, 1-pi_true.sum()])
x_true = mc.rmultinomial(1, pi_true, size=n)

x_obs = array(x_true, dtype=float)
for i in range(n):
    for j in random.sample(range(p), 3):
        x_obs[i,j] = np.nan

At first, I was pretty pleased with myself when I managed to make a PyMC model and an E-step and M-step that converged to something like the true value of \pi. The model is not super slick:

pi = mc.Uninformative('pi', value=np.ones(p)/p)

x_missing = np.isnan(x_obs)
x_initial = x_obs.copy()
x_initial[x_missing] = 0.
for i in range(n):
    if x_initial[i].sum() == 0:
        j = np.where(x_missing[i])[0][0]
        x_initial[i,j] = 1.
@mc.stochastic
def x(pi=pi, value=x_initial):
    return mc.multinomial_like(value, 1, pi)

@mc.observed
def y(x=x, value=x_obs):
    if np.allclose(x[~x_missing], value[~x_missing]):
        return 0
    else:
        return -np.inf

And the E-step/M-step parts are pretty simple:

def E_step():
    x_new = array(x_obs, dtype=float)
    for i in range(n):
        if x_new[i, ~x_missing[i]].sum() == 0:
            conditional_pi_sum = pi.value[x_missing[i]].sum()
            for j in np.where(x_missing[i])[0]:
                x_new[i,j] = pi.value[j] / conditional_pi_sum
        else:
            x_new[i, x_missing[i]] = 0.
    x.value = x_new

def M_step():
    counts = x.value.sum(axis=0)
    pi.value = (counts / counts.sum())

But the way the values converge does look nice:
em

The thing that made me feel silly was comparing this fancy-pants approach to the result of averaging all of the non-empty cells of x_obs:

ests = pd.DataFrame(dict(pr=pi_true, true=x_true.mean(0),
                    naive=pd.DataFrame(x_obs).mean(), em=pi.value),
                    columns=['pr', 'true', 'naive', 'em']).sort('true')
print np.round_(ests, 3)
      pr   true  naive     em
2  0.101  0.101  0.100  0.101
0  0.106  0.106  0.108  0.108
3  0.211  0.208  0.209  0.208
1  0.269  0.271  0.272  0.271
4  0.313  0.313  0.314  0.313

Simple averages are just as good as EM, for the simplest distribution I could think of based on the example, anyways.

To see why this EM business is worth the effort requires a more elaborate model of missingness. I made one, but it is a little bit messy. Can you make one that is nice and neat?

Comments Off on Classic EM in Python: Multinomial sampling

Filed under statistics, Uncategorized

Classic EM in Python: Warm-up problem of 197 animals in PyMC

The classic paper on the EM algorithm begins with a little application in multinomial modeling:

Rao (1965, pp. 368-369) presents data in which 197 animals are distributed multinomially into four categories, so that the observed data consist of y = (y_1, y_2, y_3, y_4) = (125, 18, 20, 34).

A genetic model for the population specifies cell probabilities \left(\frac12 + \frac14\pi, \frac14(1-\pi), \frac14(1 -\pi), \frac14\pi\right) for some \pi with 0\leq \pi \leq 1.

Solving this analytically sounds very much like a 1960’s pastime to me (answer: \pi = \frac{15 + \sqrt{53809}}{394}), and a modern computer with PyMC can make short work of numerically approximating this maximum likelihood estimation. It takes no more than this:

import pymc as mc, numpy as np

pi = mc.Uniform('pi', 0, 1)
y = mc.Multinomial('y', n=197, 
                   p=[.5 + .25*pi, .25*(1-pi), .25*(1-pi), .25*pi],
                   value=[125, 18, 20, 34], observed=True)

mc.MAP([pi,y]).fit()
print pi.value

But the point is to warm-up, not to innovate. The EM way is to introduce a latent variable x = (x_1, x_2, x_3, x_4, x_5) and set y_1 = x_1+x_2, y_2=x_3, y_3=x_4, y_4=x_5, and work with a multinomial distribution for x that induces the multinomial distribution above on y. The art is determining a good latent variable, mapping, and distribution, and “good” is meant in terms of the computation it yields:

import pymc as mc, numpy as np

pi = mc.Uniform('pi', 0, 1)

@mc.stochastic
def x(n=197, p=[.5, .25*pi, .25*(1-pi), .25*(1-pi), .25*pi], value=[125.,0.,18.,20.,34.]):
    return mc.multinomial_like(np.round_(value), n, p)

@mc.observed
def y(x=x, value=[125, 18, 20, 34]):
    if np.allclose([x[0]+x[1], x[2], x[3], x[4]], value):
        return 0
    else:
        return -np.inf

It is no longer possible to get a good fit to an mc.MAP object for this model (why?), but EM does not need to. The EM approach is to alternate between two steps:

  • Update x (E-step, because it is set to its conditional expectation based on the current value of \pi)
  • Update \pi (M-step, because it is chosen to maximize the likelihood for the current value of x)

This is simply implemented in PyMC and quick to get the goods:

def E_step():
    x.value = [125 * .5 / (.5 + .25*pi.value), 125 * .25*pi.value / (.5 + .25*pi.value), 18, 20, 34]
def M_step():
    pi.value = (x.value[1] + 34.) / (x.value[1] + 34. + 18. + 20.)
for i in range(5):
    print 'iter %2d: pi=%0.4f, X=%s' %(i, pi.value, x.value)
    E_step()
    M_step()

It is not fair to compare speeds without making both run to the same tolerance, but when that is added EM is 2.5 times faster. That could be important sometimes, I suppose.

1 Comment

Filed under statistics