## Convex optimization in Python: Non-negative least squares with Pyomo and Ipopt

Last month I attended a CNSTAT workshop https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518 on differentially private (DP) census data and left thinking that I need to get more hands-on experience with non-negative least squares optimization.

In the DP approach to disclosure avoidance that Census Bureau is planning, there is a “post-processing” step that improves the acceptability of imprecise DP counts of the number of individuals in a subgroup, like the voting age males who are non-Hispanic whites in King County census tract 89 http://ihmeuw.org/50cg. This improvement relies on non-negative least squares optimization. It finds fractional counts that sum to a known control total, that are not negative, and that minimize the sum of squared differences between these optimized counts and the DP imprecise counts. Sam Petti and I wrote out the details here, https://gatesopenresearch.org/articles/3-1722/v1 if you want to know more about this.

This blog is not about the DP details, but about how, in practice, I might try doing this sort of optimization. As anyone who has a PhD in Algorithms, Combinatorics, and Optimization can tell you, this sort of constrained convex optimization can be computed efficiently, meaning in time that scales like a polynomial function of the size of the instance. But when I was leaving graduate school, good luck if you wanted to actually solve a specific instance.

So I am quite pleased now to see how open-source software is making it possible today.

Ipopt https://github.com/coin-or/Ipopt#ipopt is an open-source interior point solver that can handle a quadratic objective function with linear equality and inequality constraints. My colleague Brad Bell has been a far of it for a decade, but until now I’ve just used it indirectly through his code.

Pyomo https://github.com/Pyomo/pyomo#pyomo-overview is a pythonic interface for many optimization solvers, including Ipopt, and now that I want to do something without Brad’s help, I’m glad to have it.

Here is a stripped down version of the optimization I want to do

$\min \left\{ \sum_{i=1}^I (x_i - c_i)^2 : x_i \geq 0, \sum_{i=1}^I x_i = C\right\}$

I’m completely dedicated to the Anaconda python distribution at this point (as setup and used in Software Carpentry), and if you are using this approach you can get everything you need to do such an optimization with the following

conda install -c conda-forge pyomo ipopt


You can test if this is working by executing the following little snippet of code in an IPython shell or Jupyter Notebook:

from pyomo.core import *
from pyomo.opt import SolverFactory

m = ConcreteModel()
m.x = Var([1,2], within=NonNegativeReals)
m.objective = Objective(expr=m.x[1]**2 + m.x[2]**2)
m.constraint = Constraint(expr=m.x[1] + m.x[2] == 1)

solver = SolverFactory('ipopt')
results = solver.solve(m)

print('Optimal value of x:',
(value(m.x[1]), value(m.x[2])))


Here is a challenge, if you like such a thing: use this test and the nonnegative least squares formula above to make a function that takes a list of imprecise counts and a float for a control total, and returns the optimized (fractional) counts for them.

It does take a bit of fiddling to get the pyomo python right, and then there are plenty of options to Ipopt that might make the optimization work better in more complicated cases. I found the book Pyomo — Optimization Modeling in Python useful for the former (and a pdf from a talk entitled Pyomo Tutorial – OSTI.GOV a good substitute before I got my hands on the book). For the latter, I found an extensive webpage of Ipopt options gave me some good hints, especially when combined with web searching, which provided only some small utility on its own.)

It is so cool that this is now possible. Thanks to all who made it so, including some old friends who must have been working on this back when I was a summer intern at IBM.

And for readers who don’t like or don’t have time for the challenge of coding it up themselves (perhaps including myself a day or a year from now), here is a solution:

def nonnegative_optimize(imprecise_counts, control_total):
"""optimize the imprecise counts so that they sum to
the control total and are non-negative

Parameters
----------
imprecise_counts : list-like of floats
control_total : float

Results
-------
returns optimized_counts, which are close to imprecise
counts, but not negative, and match control total in
aggregate
"""
imprecise_counts = list(imprecise_counts)

model = ConcreteModel()
model.I = range(len(imprecise_counts))
model.x = Var(model.I, within=NonNegativeReals)
model.objective = Objective(
expr=sum((model.x[i] - imprecise_counts[i])**2
for i in model.I))
model.constraint = Constraint(
expr=summation(model.x) == control_total)

solver = SolverFactory('ipopt')
results = solver.solve(model)
optimized_counts = [value(model.x[i])
for i in model.I]

return optimized_counts


p.s. if you want to see what this looks like in action, or for convenience in replication efforts, find a Jupyter Notebook version of the code from this blog here: https://gist.github.com/aflaxman/d7b39fc8cd5805e344a66f9c9d16acf7

Filed under census

## Slides from my CSSS Seminar on DP and the 2020 US Census

Last week I got to present for a whole hour on the changes to the disclosure avoidance system being developed for the 2020 decennial census. Thank you Center for Statistics in the Social Sciences for being a great convener and thanks to the attendees for being a great audience.

Here are the slides, including some I made just for talking in this enumeration district:

Comments Off on Slides from my CSSS Seminar on DP and the 2020 US Census

Filed under census

## Differential Privacy and the FYI Guy

Gene Balk might  be the highest profile user of Census data in Seattle. He is a columnist for the Seattle Times, and as the “FYI Guy”, his columns center on facts in figures. Will the Census Bureau’s dedication to differential privacy ruin his work?

Let’s investigate.  His July 17th column was titled “For first time this decade, a dip in King County’s white population, census data shows”.  As you might suspect from the title, it turns entirely on census data, specifically the recently published 2018 Annual Population Estimates, which include estimates of the racial composition of the United States, each state and county, by year.  In the column, Gene focused on how the race-specific population numbers changed from 2017 to 2018:

A provocative topic, for sure.  I read recently that demographers avoid drawing attention to the national version of this issue. I would have thought that perhaps it would not be as controversial in a liberal city like Seattle.  Perhaps I would have been wrong, as demonstrated by the editor’s note on the Seattle Times column:

## To differential privacy

If I have understood correctly, there is not currently differential privacy in the disclosure avoidance system for Population Estimates.  But what if there was?  That would likely mean that the reported values would be noisy versions of the version now available on the Census website.  Instead of saying that there were an estimated 744,955 people living in Seattle on July 1, 2018, it might say that there were 744,955 + X – Y, where X and Y are numbers drawn randomly from a geometric distribution. (It might be a bit more complicated than that, but I don’t think too much more complicated.)  There is a crucial and yet-to-be-made decision about the appropriate standard deviation of this distribution, often described as the “privacy loss budget”, and also known as epsilon.

The privacy loss epsilon that people have been talking about for the decennial census is 0.25, which I recently learned is just for testing purposes.

Let’s reproduce the key figure with an epsilon of noise added. I started by trying epsilon of 0.01, since that sounds very private to me.  I’m not being very precise here, but I am following some code from the Census Bureau’s End-to-End test.

For the purposes of having something that visibly changes, let’s also try a reduced epsilon of 0.001, which sounds too small, to me.  Is this a useful way to investigate how DP might change things?  It looks like even at this level of privacy loss, in 99 out of 100 noisy versions the story is the same: a dip in White population of King County.

At least in this case, DP will not change the story for the FYI guy!

(Notebook to reproduce these figure here.)

Filed under Uncategorized

## Census Workshop at UW

I have become very interested in the 2020 US Census. There is a workshop at UW about it tomorrow. Expect more technical content about the decennial census on this blog soon. http://ccde.com.washington.edu/census2020/

Comments Off on Census Workshop at UW

Filed under Uncategorized

## #stats_book_club

My colleagues at IHME have been running a Statistics reading group for a year or more now, and this quarter I have joined them to carefully read Cameron Davidson-Pilon’s book Probabilistic Programming & Bayesian Methods for Hackers. We are now three sessions in, and it is going really well, IMHO. I’ve been doing some good thinking about what it takes to get started in applied statistics and Bayesian methods.

More to come.

Filed under Uncategorized

## SWC Jokes

More thoughts on my recent 12 hours of Software-Carpentry-inspired teaching: one feedback card that I will keep in my rainy-day folder said the learner liked my jokes.

The jokes came in after the second break on the first day, before I figured out that 15 minutes was the right length for the break.  I was trying to bring the group back together after only 5 minutes off, and having trouble. “Knock knock,” I said, not too loudly.  “Who’s there?” answers some handful of learners who heard me over the racket. Now the room was starting to focus on this. But what did I have to deliver? “Isabel,” I offered, thanks to my 7-year-old neighbor.

Do you know this one? I need to get some Python-relevant material for future courses.  Anyway, more of the class was now working with me on it. “Isabel who?” they politely offered. “Is a bell necessary on a bicycle?” Definitely a winner… you never know what will go over until you try it on stage.

Filed under Uncategorized

## SWC-inspired 12 hours

I’ve recently completed 12 hours of teaching Introduction to Python and SQL for an audience of new Institute for Health Metrics and Evaluation (IHME) staff and fellows.  SWC is a gem! (I have been thinking this for a while.)

In retrospect, what worked and what might I do differently next time?

Some SWC mechanics that worked well: Live coding, Hands-on exercises, Sticky notes, Jupyter notebooks, and friendly teaching assistants.

Some things to change: Remember to give the big-picture framing for each section, Do more explanation of solutions after hands-on exercises, Share the syllabus ahead of time, and emphasize that this is *introduction* material.

Some changes that I made mid-stream: longer breaks (15 minutes every hour or so), connect the examples to IHME-specific domains.

I also did not use an etherpad until we got through Creating Functions (Section 6 in the Python Inflammation Lession). That might have been too much typing in the first two sessions, and it was definitely appreciated.

Comments Off on SWC-inspired 12 hours

Filed under Uncategorized