Category Archives: census

November 20, 2023 · 2:00 pm

AI Assistance for Pseudopeople: GPTs for configuration dicts

Over the last year, I’ve been hard at work making simulated data. I love making simulated data, and I finally put a minimal blog about it up (https://healthyalgorithms.com/2023/11/19/introducing-pseudopeople-simulated-person-data-in-python/)

I have a persistent challenge when I use pseudopeople in my work: configuring the noise requires a deeply nested python dictionary, and I can never remember what goes in it.

After reading a recent dispatch from Simon Willison, I thought that maybe the new “GPTs” affordances from OpenAI could help me deal with this. I’m very optimistic about the potential of AI assistance for data science work.

And with just a short time of messing around, I have something I’m pretty happy with:
https://chat.openai.com/g/g-7e9Dfx1fv-pseudopeople-config-wizard

If you try it out and want to confirm that your custom config works, here is a Google Colab that you can use to test it out: https://colab.research.google.com/drive/1UG38OZigDwBy4zNJHo5fZ752LdalQ7Bw?usp=sharing

Comments Off on AI Assistance for Pseudopeople: GPTs for configuration dicts

Filed under census, software engineering

Tagged as python

November 19, 2023 · 4:26 pm

Introducing Pseudopeople: simulated person data in python

I’m still settling back into blogging as a custom, so perhaps that is why it has taken me six months to think of announcing our new python package here! Without further ado, let me introduce you to pseudopeople.

It is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution methods or other data science algorithms at scale.

To see it for yourself, here is a three-line quickstart, suitable for using in a Google Colab or a Jupyter Notebook:

!pip install pseudopeople

import pseudopeople as psp
psp.generate_decennial_census()

Enjoy!

3 Comments

Filed under census, simulation

Tagged as my research, python

January 28, 2020 · 1:26 am

Value of Privacy in Census Data (Holocaust Remembrance Day edition)

In October of 2019, the Harvard Data Science Review hosted two workshop sessions on Differential Privacy in the 2020 Census. I watched it remotely, and I thought it was interesting, but I don’t remember anything really surprising being presented. Simson Garfinkel from Census Bureau asked a question towards the end that stuck with me, though. Can people who advocate for privacy protection in census data do some sort of cost-benefit analyses to quantify the value of an imprecise census?

I don’t know of any work which has presented such an analysis explicitly. But I did read a chapter in a history book that struck me as relevant. It is a bit heavy, so I have never before thought it was appropriate to push into the conversation. But today is Holocaust Remembrance Day, so if not today, well…

The history book is IBM and the Holocaust by Edwin Black. The chapter is “France and Holland” and these two countries, both occupied by Nazis, are the paired comparison for valuing census privacy. Black writes, “German intentions in both countries were nearly identical and unfolded in a similar sequence throughout the war years. But everything about the occupation of these lands and their involvement with Hitler’s Holleriths was very different. For the Jews of these two nations, their destinies would also be quite different.”

In Holland, the Nazis found a census director who, although not an anti-Semite, loved tallying population metrics to the exclusion of considering their possible effects on populations. “Theoretically,” he wrote, “the collection of data for each person can be so abundant and complete, that we can finally speak of a paper human representing the natural human.” He was tasked to create a registry of Dutch Jews for the Nazi occupiers, “[t]he registry much [contain] age, profession, and gender … [and] the category (Jew, Mixed I, Mixed II) to which the registered belongs.” It was not an easy task, but the Dutch census workers succeeded, and, as the director wrote, “the census office was able to contribute ways and means of carrying out its often difficult task.” (All quotes from Black.)

In France, the Nazis thought things were proceeding similarly. But there, the census director was Rene Carmille, and he was also a secret agent with the French Resistance. Charged with creating a registry like that in Holland, he employed sabotage: making tabulations slowly, damaging punch card machinery to prevent recording of Jewish ethnic category, and other heroic acts of sabotage. Carmille’s loyalty to the resistance was discovered by the Nazis when he used his registry to organize resistance combat units in Algeria. As Black puts it, “the holes were never punched, the answers were never tabulated. More than 100,000 cards of Jews sitting in his office – never handed over. He foiled the entire enterprise.”

The chapter ends with quantitative outcomes. In Holland, Nazis murdered more than 70% of the Jews. In France, less than 25%. Is it too much to conclude that the lack of precise census data saved 45 of every 100 Jews of occupied France? 135,000 total deaths averted.

1 Comment

Filed under census

January 13, 2020 · 9:00 am

Convex optimization in Python: Non-negative least squares with Pyomo and Ipopt

Last month I attended a CNSTAT workshop https://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_196518 on differentially private (DP) census data and left thinking that I need to get more hands-on experience with non-negative least squares optimization.

In the DP approach to disclosure avoidance that Census Bureau is planning, there is a “post-processing” step that improves the acceptability of imprecise DP counts of the number of individuals in a subgroup, like the voting age males who are non-Hispanic whites in King County census tract 89 http://ihmeuw.org/50cg. This improvement relies on non-negative least squares optimization. It finds fractional counts that sum to a known control total, that are not negative, and that minimize the sum of squared differences between these optimized counts and the DP imprecise counts. Sam Petti and I wrote out the details here, https://gatesopenresearch.org/articles/3-1722/v1 if you want to know more about this.

This blog is not about the DP details, but about how, in practice, I might try doing this sort of optimization. As anyone who has a PhD in Algorithms, Combinatorics, and Optimization can tell you, this sort of constrained convex optimization can be computed efficiently, meaning in time that scales like a polynomial function of the size of the instance. But when I was leaving graduate school, good luck if you wanted to actually solve a specific instance.

So I am quite pleased now to see how open-source software is making it possible today.

Ipopt https://github.com/coin-or/Ipopt#ipopt is an open-source interior point solver that can handle a quadratic objective function with linear equality and inequality constraints. My colleague Brad Bell has been a far of it for a decade, but until now I’ve just used it indirectly through his code.

Pyomo https://github.com/Pyomo/pyomo#pyomo-overview is a pythonic interface for many optimization solvers, including Ipopt, and now that I want to do something without Brad’s help, I’m glad to have it.

Here is a stripped down version of the optimization I want to do

$\min \left\{ \sum_{i=1}^I (x_i - c_i)^2 : x_i \geq 0, \sum_{i=1}^I x_i = C\right\}$

I’m completely dedicated to the Anaconda python distribution at this point (as setup and used in Software Carpentry), and if you are using this approach you can get everything you need to do such an optimization with the following

conda install -c conda-forge pyomo ipopt

You can test if this is working by executing the following little snippet of code in an IPython shell or Jupyter Notebook:

from pyomo.core import *
from pyomo.opt import SolverFactory

m = ConcreteModel()
m.x = Var([1,2], within=NonNegativeReals)
m.objective = Objective(expr=m.x[1]**2 + m.x[2]**2)
m.constraint = Constraint(expr=m.x[1] + m.x[2] == 1)
    
solver = SolverFactory('ipopt')
results = solver.solve(m)

print('Optimal value of x:',
            (value(m.x[1]), value(m.x[2])))

Here is a challenge, if you like such a thing: use this test and the nonnegative least squares formula above to make a function that takes a list of imprecise counts and a float for a control total, and returns the optimized (fractional) counts for them.

It does take a bit of fiddling to get the pyomo python right, and then there are plenty of options to Ipopt that might make the optimization work better in more complicated cases. I found the book Pyomo — Optimization Modeling in Python useful for the former (and a pdf from a talk entitled Pyomo Tutorial – OSTI.GOV a good substitute before I got my hands on the book). For the latter, I found an extensive webpage of Ipopt options gave me some good hints, especially when combined with web searching, which provided only some small utility on its own.)

It is so cool that this is now possible. Thanks to all who made it so, including some old friends who must have been working on this back when I was a summer intern at IBM.

And for readers who don’t like or don’t have time for the challenge of coding it up themselves (perhaps including myself a day or a year from now), here is a solution:

def nonnegative_optimize(imprecise_counts, control_total):
    """optimize the imprecise counts so that they sum to
    the control total and are non-negative
    
    Parameters
    ----------
    imprecise_counts : list-like of floats
    control_total : float
    
    Results
    -------
    returns optimized_counts, which are close to imprecise
    counts, but not negative, and match control total in
    aggregate
    """
    imprecise_counts = list(imprecise_counts)
    
    model = ConcreteModel()
    model.I = range(len(imprecise_counts))
    model.x = Var(model.I, within=NonNegativeReals)
    model.objective = Objective(
        expr=sum((model.x[i] - imprecise_counts[i])**2
                 for i in model.I))
    model.constraint = Constraint(
        expr=summation(model.x) == control_total)
    
    solver = SolverFactory('ipopt')
    results = solver.solve(model)
    optimized_counts = [value(model.x[i])
                        for i in model.I]
        
    return optimized_counts

p.s. if you want to see what this looks like in action, or for convenience in replication efforts, find a Jupyter Notebook version of the code from this blog here: https://gist.github.com/aflaxman/d7b39fc8cd5805e344a66f9c9d16acf7

Comments Off on Convex optimization in Python: Non-negative least squares with Pyomo and Ipopt

Filed under census

Tagged as optimization

November 1, 2019 · 5:27 pm

Slides from my CSSS Seminar on DP and the 2020 US Census

Last week I got to present for a whole hour on the changes to the disclosure avoidance system being developed for the 2020 decennial census. Thank you Center for Statistics in the Social Sciences for being a great convener and thanks to the attendees for being a great audience.

Here are the slides, including some I made just for talking in this enumeration district:

differential_privacy_for_social_research-2019-10_23 Download

Comments Off on Slides from my CSSS Seminar on DP and the 2020 US Census

Filed under census

Tagged as census, differential privacy

Category Archives: census

AI Assistance for Pseudopeople: GPTs for configuration dicts

Introducing Pseudopeople: simulated person data in python

Value of Privacy in Census Data (Holocaust Remembrance Day edition)

Convex optimization in Python: Non-negative least squares with Pyomo and Ipopt

Slides from my CSSS Seminar on DP and the 2020 US Census

Posts

Theory Blogs

some rights reserved

Pages

Archives

Meta