## Using “potentials” in Bambi to complicate a regression model

I have had my eye on a python package called Bambi for a while now, because I often need a regression model that is a little more complicated that sklearn.linear_model.LinearRegression but not complicated enough to make a whole new PyMC model.

Here is a minimal example (adapted from Awesome Open Source):

import bambi as bmb
import numpy as np, pandas as pd, arviz as az

data = pd.DataFrame({
"y": np.random.normal(size=50),
"x1": np.random.normal(size=50),
"x2": np.random.normal(size=50)
})

model = bmb.Model("y ~ x1 + x2", data)
results = model.fit()
az.summary(results)


One cool thing about Bambi is that while it is simpler that writing a whole new PyMC model, it is a lot like writing a PyMC model. For example, if I need to add an informative prior, that is pretty easy:

priors = {'x1': bmb.Prior("Uniform", lower=0, upper=.05)}
model = bmb.Model("y ~ x1 + x2", data,
priors=priors)
results = model.fit()
az.summary(results)


And if I need a more complex distribution on that prior, Bambi exposes a “potential” parameter that puts additional terms in the posterior distribution, just like PyMC:

potentials = [
(('x1', 'x2'),
lambda x1, x2: bmb.math.switch((x1+x2 <= 0.0), 0, -999)),
]

model = bmb.Model("y ~ x1 + x2", data,
potentials=potentials)
results = model.fit()
az.summary(results)



I’m guessing that the syntax will continue to evolve, which is just one more reason Bambi is a python package that I am going to continue to watch.

Comments Off on Using “potentials” in Bambi to complicate a regression model

Filed under Uncategorized

## Our nine phase approach to building a public health simulation

We wrote this up for a conference, but it didn’t have proceedings, so I’m putting it online here: Christine Allen, James Collins, Zane Rankin, Kate Wilson, Derrick Tsoi, Kelly Compton, Enabling Model Complexity Through an Improved Workflow, presented at Modeling World Systems Conference, Washington DC, May 13-15, 2019.

2019 Enabling Model Complexity Through an Improved Workflow MWS_paper Christine Allen

(The paper refers to eight phases of model development, but there is a ninth phase, which Christine wanted to keep secret: celebrate the successful completion of a modeling project.)

Comments Off on Our nine phase approach to building a public health simulation

Filed under Uncategorized

## Creative nonfiction and Professor Gloxman

I published some creative nonfiction about Census data and re-identification attacks on twitter, and I’m linking to it here so I can find it again more easily:

Comments Off on Creative nonfiction and Professor Gloxman

Filed under Uncategorized

## CSV version of social-distancing mandate table

The IHME COVID-19 health service utilization forecasting team has updated our pre-print on how we make the projections on https://covid19.healthdata.org/

You can find it here while it makes its way onto medRxiv (and eventually makes it through the peer-review process, I hope!)

The preprint is now up to 68 pages, and a table starting on page 65 lists all of the dates that David Pigott and his army of data analysts compiled as part of this work. I think it might be useful to other researchers, and if it is, I’m sure they have better things to do than transform it into a machine-readable format.

Filed under Uncategorized

## Hospital data potentially relevant to COVID-19

In the same spirit of my last post, here is a notebook I developed to extract information on hospital capacity from the Form 2552-10 that hospitals file regularly with Medicare. If you know how to get this information from somewhere else, know how to tangle with this data better, or know other things about it that you think I should know… please let me know!

1 Comment

Filed under disease modeling

## Nursing home data potentially relevant to COVID-19 outbreak in King County

I have recently started helping out with some computer modeling efforts to support outbreak response, and I’m learning that there are a lot of projects already underway. To learn from others, and also to perhaps be helpful to people who I don’t know who are also working on this, I’m posting this write-up of an estimate I just produced: estimates of the number of people in nursing homes by age-group and facility.

As you may know, many of the deaths so far in the COVID-19 outbreak have been among people who were in nursing homes (28 out of 37, as of March 3, 2020). How many people are in nursing homes? And, since age seems to be an important determinant of disease severity, how old are they? I used open sources to make estimates for all nursing facilities in Washington State. If this is useful to you, great! And if you know how to do this better, please let me know!

Here is a csv file you can use if you are doing disease modeling. I’ve put a Jupyter Notebook with all the code to derive these estimates on the web, and below I have collected some details on the potential data sources that this and other future estimates might use.

## Medicare Minimum Data Set (MDS) 3.0

Medicare collects quality improvement data regularly from all skilled nursing facilities, and publishes summaries from this “Minimum Data Set (MDS) 3.0” exercise.  You can find some information about it online: https://www.cms.gov/Research-Statistics-Data-and-Systems/Computer-Data-and-Systems/Minimum-Data-Set-3-0-Public-Reports/Minimum-Data-Set-3-0-Frequency-Report

MDS provides data in 10-year age groups, but the data on the web gives only state-level values:

Can we get more detail? This url (https://www.resdac.org/cms-data/files/mds-3.0) seems like it would have the full data file somewhere, but I was not able to locate it.

## Decennial Census Summary File One (SF1)

There is also data with fine geographic precision available from the decennial census.  The “Summary File One (SF1)” includes tables on the number of people living in skilled nursing facilities in each census block. This is available at the county and MSA level, stratified by sex and 5-year age groups (top coded at 85+), but it is now 10 years out of date, so perhaps it is not that useful.  https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=DEC_10_SF1_PCO5&prodType=table

It does have the sex ratio, maybe that is useful.  MDS must have that, too, though. It is also available at the census block level, stratified by sex and broad age groups (<18, 22-64, 69+, if I’m reading it correctly).

## Washington Aging and Long-Term Support Administration (ALTSA)

https://www.dshs.wa.gov/altsa/management-services-division/nursing-facility-rates-and-reports

This state agency maintains a list of nursing homes in Washington State, and has bed counts for each nursing home. It does not provide an age breakdown, though.  It puts WA State capacity at 19,332 beds.

ALTSA and MDS together show WA State is using 84% of available beds in nursing homes.

## Model

It seems a bit grandiose to call this a statistical model, although I think it is technically accurate to say that I have used a “first-order log-linear model” to estimate the number of individuals in nursing homes by facility and by age group.  The formula is the following, where N(age, facility) is the number of individuals in a given age group and a given facility:

N(age, facility) = total number * age share * facility share

## Conclusion

For convenience, here is a link to the csv file that I produced with this approach, and here is a link to the Jupyter Notebook with all the code to derive these estimates from open sources.  If you know how to make estimates like this that are more accurate, please let me know!

1 Comment

Filed under disease modeling

## Value of Privacy in Census Data (Holocaust Remembrance Day edition)

In October of 2019, the Harvard Data Science Review hosted two workshop sessions on Differential Privacy in the 2020 Census. I watched it remotely, and I thought it was interesting, but I don’t remember anything really surprising being presented. Simson Garfinkel from Census Bureau asked a question towards the end that stuck with me, though.  Can people who advocate for privacy protection in census data do some sort of cost-benefit analyses to quantify the value of an imprecise census?

I don’t know of any work which has presented such an analysis explicitly.  But I did read a chapter in a history book that struck me as relevant.  It is a bit heavy, so I have never before thought it was appropriate to push into the conversation.  But today is Holocaust Remembrance Day, so if not today, well…

The history book is IBM and the Holocaust by Edwin Black.  The chapter is “France and Holland” and these two countries, both occupied by Nazis, are the paired comparison for valuing census privacy. Black writes, “German intentions in both countries were nearly identical and unfolded in a similar sequence throughout the war years.  But everything about the occupation of these lands and their involvement with Hitler’s Holleriths was very different. For the Jews of these two nations, their destinies would also be quite different.”

In Holland, the Nazis found a census director who, although not an anti-Semite, loved tallying population metrics to the exclusion of considering their possible effects on populations.  “Theoretically,” he wrote, “the collection of data for each person can be so abundant and complete, that we can finally speak of a paper human representing the natural human.” He was tasked to create a registry of Dutch Jews for the Nazi occupiers, “[t]he registry much [contain] age, profession, and gender … [and] the category (Jew, Mixed I, Mixed II) to which the registered belongs.” It was not an easy task, but the Dutch census workers succeeded, and, as the director wrote, “the census office was able to contribute ways and means of carrying out its often difficult task.” (All quotes from Black.)

In France, the Nazis thought things were proceeding similarly. But there, the census director was Rene Carmille, and he was also a secret agent with the French Resistance.  Charged with creating a registry like that in Holland, he employed sabotage: making tabulations slowly, damaging punch card machinery to prevent recording of Jewish ethnic category, and other heroic acts of sabotage. Carmille’s loyalty to the resistance was discovered by the Nazis when he used his registry to organize resistance combat units in Algeria. As Black puts it, “the holes were never punched, the answers were never tabulated.  More than 100,000 cards of Jews sitting in his office – never handed over. He foiled the entire enterprise.”

The chapter ends with quantitative outcomes. In Holland, Nazis murdered more than 70% of the Jews.  In France, less than 25%. Is it too much to conclude that the lack of precise census data saved 45 of every 100 Jews of occupied France? 135,000 total deaths averted.

1 Comment

Filed under census

## Convex optimization in Python: Non-negative least squares with Pyomo and Ipopt

Last month I attended a CNSTAT workshop https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518 on differentially private (DP) census data and left thinking that I need to get more hands-on experience with non-negative least squares optimization.

In the DP approach to disclosure avoidance that Census Bureau is planning, there is a “post-processing” step that improves the acceptability of imprecise DP counts of the number of individuals in a subgroup, like the voting age males who are non-Hispanic whites in King County census tract 89 http://ihmeuw.org/50cg. This improvement relies on non-negative least squares optimization. It finds fractional counts that sum to a known control total, that are not negative, and that minimize the sum of squared differences between these optimized counts and the DP imprecise counts. Sam Petti and I wrote out the details here, https://gatesopenresearch.org/articles/3-1722/v1 if you want to know more about this.

This blog is not about the DP details, but about how, in practice, I might try doing this sort of optimization. As anyone who has a PhD in Algorithms, Combinatorics, and Optimization can tell you, this sort of constrained convex optimization can be computed efficiently, meaning in time that scales like a polynomial function of the size of the instance. But when I was leaving graduate school, good luck if you wanted to actually solve a specific instance.

So I am quite pleased now to see how open-source software is making it possible today.

Ipopt https://github.com/coin-or/Ipopt#ipopt is an open-source interior point solver that can handle a quadratic objective function with linear equality and inequality constraints. My colleague Brad Bell has been a far of it for a decade, but until now I’ve just used it indirectly through his code.

Pyomo https://github.com/Pyomo/pyomo#pyomo-overview is a pythonic interface for many optimization solvers, including Ipopt, and now that I want to do something without Brad’s help, I’m glad to have it.

Here is a stripped down version of the optimization I want to do

$\min \left\{ \sum_{i=1}^I (x_i - c_i)^2 : x_i \geq 0, \sum_{i=1}^I x_i = C\right\}$

I’m completely dedicated to the Anaconda python distribution at this point (as setup and used in Software Carpentry), and if you are using this approach you can get everything you need to do such an optimization with the following

conda install -c conda-forge pyomo ipopt


You can test if this is working by executing the following little snippet of code in an IPython shell or Jupyter Notebook:

from pyomo.core import *
from pyomo.opt import SolverFactory

m = ConcreteModel()
m.x = Var([1,2], within=NonNegativeReals)
m.objective = Objective(expr=m.x[1]**2 + m.x[2]**2)
m.constraint = Constraint(expr=m.x[1] + m.x[2] == 1)

solver = SolverFactory('ipopt')
results = solver.solve(m)

print('Optimal value of x:',
(value(m.x[1]), value(m.x[2])))


Here is a challenge, if you like such a thing: use this test and the nonnegative least squares formula above to make a function that takes a list of imprecise counts and a float for a control total, and returns the optimized (fractional) counts for them.

It does take a bit of fiddling to get the pyomo python right, and then there are plenty of options to Ipopt that might make the optimization work better in more complicated cases. I found the book Pyomo — Optimization Modeling in Python useful for the former (and a pdf from a talk entitled Pyomo Tutorial – OSTI.GOV a good substitute before I got my hands on the book). For the latter, I found an extensive webpage of Ipopt options gave me some good hints, especially when combined with web searching, which provided only some small utility on its own.)

It is so cool that this is now possible. Thanks to all who made it so, including some old friends who must have been working on this back when I was a summer intern at IBM.

And for readers who don’t like or don’t have time for the challenge of coding it up themselves (perhaps including myself a day or a year from now), here is a solution:

def nonnegative_optimize(imprecise_counts, control_total):
"""optimize the imprecise counts so that they sum to
the control total and are non-negative

Parameters
----------
imprecise_counts : list-like of floats
control_total : float

Results
-------
returns optimized_counts, which are close to imprecise
counts, but not negative, and match control total in
aggregate
"""
imprecise_counts = list(imprecise_counts)

model = ConcreteModel()
model.I = range(len(imprecise_counts))
model.x = Var(model.I, within=NonNegativeReals)
model.objective = Objective(
expr=sum((model.x[i] - imprecise_counts[i])**2
for i in model.I))
model.constraint = Constraint(
expr=summation(model.x) == control_total)

solver = SolverFactory('ipopt')
results = solver.solve(model)
optimized_counts = [value(model.x[i])
for i in model.I]

return optimized_counts


p.s. if you want to see what this looks like in action, or for convenience in replication efforts, find a Jupyter Notebook version of the code from this blog here: https://gist.github.com/aflaxman/d7b39fc8cd5805e344a66f9c9d16acf7

Comments Off on Convex optimization in Python: Non-negative least squares with Pyomo and Ipopt

Filed under census

## Slides from my CSSS Seminar on DP and the 2020 US Census

Last week I got to present for a whole hour on the changes to the disclosure avoidance system being developed for the 2020 decennial census. Thank you Center for Statistics in the Social Sciences for being a great convener and thanks to the attendees for being a great audience.

Here are the slides, including some I made just for talking in this enumeration district:

Comments Off on Slides from my CSSS Seminar on DP and the 2020 US Census

Filed under census

## Differential Privacy and the FYI Guy

Gene Balk might  be the highest profile user of Census data in Seattle. He is a columnist for the Seattle Times, and as the “FYI Guy”, his columns center on facts in figures. Will the Census Bureau’s dedication to differential privacy ruin his work?

Let’s investigate.  His July 17th column was titled “For first time this decade, a dip in King County’s white population, census data shows”.  As you might suspect from the title, it turns entirely on census data, specifically the recently published 2018 Annual Population Estimates, which include estimates of the racial composition of the United States, each state and county, by year.  In the column, Gene focused on how the race-specific population numbers changed from 2017 to 2018:

A provocative topic, for sure.  I read recently that demographers avoid drawing attention to the national version of this issue. I would have thought that perhaps it would not be as controversial in a liberal city like Seattle.  Perhaps I would have been wrong, as demonstrated by the editor’s note on the Seattle Times column:

## To differential privacy

If I have understood correctly, there is not currently differential privacy in the disclosure avoidance system for Population Estimates.  But what if there was?  That would likely mean that the reported values would be noisy versions of the version now available on the Census website.  Instead of saying that there were an estimated 744,955 people living in Seattle on July 1, 2018, it might say that there were 744,955 + X – Y, where X and Y are numbers drawn randomly from a geometric distribution. (It might be a bit more complicated than that, but I don’t think too much more complicated.)  There is a crucial and yet-to-be-made decision about the appropriate standard deviation of this distribution, often described as the “privacy loss budget”, and also known as epsilon.

The privacy loss epsilon that people have been talking about for the decennial census is 0.25, which I recently learned is just for testing purposes.

Let’s reproduce the key figure with an epsilon of noise added. I started by trying epsilon of 0.01, since that sounds very private to me.  I’m not being very precise here, but I am following some code from the Census Bureau’s End-to-End test.

For the purposes of having something that visibly changes, let’s also try a reduced epsilon of 0.001, which sounds too small, to me.  Is this a useful way to investigate how DP might change things?  It looks like even at this level of privacy loss, in 99 out of 100 noisy versions the story is the same: a dip in White population of King County.

At least in this case, DP will not change the story for the FYI guy!

(Notebook to reproduce these figure here.)