Press "Enter" to skip to content

How much did a new privacy tool distort the 2020 U.S. census?

The 2020 U.S. census released last month is more than a snapshot of the country’s 331 million residents. It is also a promise from the Census Bureau to protect the identity of every person whose information is hidden within the billions of pieces of demographic data.

Census officials decided several years ago that the best way to keep that promise was to apply a mathematical tool called differential privacy. Last fall, the agency used it to make slight adjustments to all the data collected on age, race/ethnicity, gender, and household status before it was made public on 12 August.

Agency officials concede that applying differential privacy subtly distorts those data. To compensate, they suggested people view the census results like an Impressionist painting: Blurry up close but clear once you take a few steps back.

But that suggestion isn’t very helpful for researchers who sweat the statistical details. So last month more than 50 of them asked the Census Bureau to make public an interim data file that would help them analyze exactly how differential privacy scrambled the data they rely upon to understand the country’s ever-changing demographics.

“It is a way to let researchers look under the hood,” says Joe Salvo, the former chief demographer for New York City and one of the signers of a 12 August letter to acting Census Director Ron Jarmin. (An earlier version of the letter, drafted by Cynthia Dwork, Gary King, and Ruth Greenwood of Harvard University, was published in The Boston Globe on 26 July.) The added transparency, they say, would also bolster public confidence in an agency that had to overcome both political interference and the coronavirus pandemic to perform the constitutionally mandated decennial census.

Census officials declined repeated requests to make someone available to discuss the issue. But a spokesperson said the agency “is considering the request … and will disclose its decision publicly once made.”

Bring in the noise

Census data serve many audiences. The U.S. government, for instance, uses them to determine how to distribute more than $1 trillion a year in myriad federal programs, and states redraw voting districts based on the new demographic information.

What the researchers want released is what they call the “noisy measurements file.” It’s noisy because Census officials applied differential privacy—statistical noise—to make sure none of the people who filled out the census or who were included through other means could be identified in the vast array of tables that are made public.

Differential privacy provides a mathematical guarantee that whatever someone can learn about you from analyzing a data set doesn’t depend on whether your data are included, explains Dwork, a theoretical computer scientist who co-authored a 2006 paper that described the underlying concept. Operationally, it allows the agency to decide how much “noise” to inject into any data set. The more noise, the greater degree of privacy. But more privacy comes at the cost of accuracy; perfect privacy would require disclosing no information, a condition that would negate the purpose of the decennial census.

The Census Bureau’s chief scientist, John Abowd, believes differential privacy is far superior to the technique the Bureau had used in the 1990, 2000, and 2010 censuses to safeguard privacy. That method, called swapping, involved altering specific characteristics of an individual, one at a time. But swapping can no longer protect privacy, the letter notes, because it is “vulnerable to technical advances in data reconstruction science.”

For researchers, swapping had another major disadvantage: it was completely opaque. The Bureau neither disclosed how often it was used nor gave details on how it was applied, saying such secrecy was necessary to prevent someone from reverse engineering the process and fingering individuals. As a result, demographers had no way to calculate how much swapping might have distorted the publicly available data. “Scientists pretended [swapping] was not happening because they couldn’t adjust for it,” says Erica Groshen, an economist at Cornell University, who also signed onto the letter.

In contrast, differential privacy is based on algorithms that don’t need to be kept secret, and Census officials have disclosed how much noise they have injected.

However, the noisy file that results isn’t fit for public consumption. In the process of masking the raw data, the algorithms yield negative numbers, fractions, and other statistical oddities, such as a household headed by a 2-year-old. Researchers aren’t bothered by such oddities, Dwork says, but they would confound most public officials or citizens trying to understand their community’s changing demographics.

So Census officials produce a third version of the data, called the post-processed file, that cleans up those confusing numbers. The file still contains data altered by the application of differential privacy, however. And it is hard for researchers to spot those changes.

“It will take considerable statistical effort and expertise to measure and correct for all the biases,” the letter asserts. “Making available the noisy measurements file is an easy solution and the key to ensuring that analysts can easily use the data.”

The impact on redistricting

Census officials created another obstacle for researchers when they decided not to tamper with the total population of each state when applying differential privacy. Those politically sensitive numbers are used to apportion the 435 seats in the U.S. House of Representatives, an exercise mandated by the Constitution. Holding that number invariant required balancing out any population shifts within a state. In practice, it resulted in redistributing away from the largest demographic subgroups: Someone from a metropolitan area might be added to a rural area, for example, or a white resident assigned to a majority minority neighborhood. Those adjustments led to media reports of “ghost” residents inflating the population of a hamlet or an urban neighborhood in which nobody resided.

Releasing the noisy file would be a boon to the work of Michael McDonald, a political scientist at the University of Florida who signed the letter. McDonald has created a tool that allows the public to create their own redistricting maps, and he says knowing how many white residents might have been added to a congressional district that is predominantly Black could determine whether a newly drawn district violates federal laws designed to assure equal access to voting and protect minority representation. “It could affect whether the requirements of the [federal] Voting Rights Act are met in a particular district,” he says.

The potential impact of differential privacy on redistricting is part of a broader problem facing those who work with federal statistics, Groshen says. “Because we live in a society where data, lawsuits, and special interest groups have become so pervasive, the ‘accuracy’ of the Decennial Census is now much harder to define and much easier to challenge,” she writes in an upcoming paper with Daniel Goroff of the Sloan Foundation about how researchers can make good use of privacy-protected data.   

That new reality is why former Census Director John Thompson chose not to sign onto the letter despite being a staunch supporter of the Bureau’s use of differential privacy. “It’s a good idea, scientifically,” says Thompson, who was nominated by President Barack Obama in 2012 and ousted shortly after President Donald Trump took office in 2017. “But I’m worried that [releasing a second set of numbers] could lead to lawsuits over redistricting. So I’m against it for political reasons.”

Steve Ruggles, a demographer at the University of Minnesota, Twin Cities, where he leads a center that provides census and survey data to a global network of researchers, didn’t sign the letter because he believes differential privacy is not needed to ensure privacy and that it degrades data quality. But on 2 August he wrote separately to Abowd, urging him to release the noisy file “in the interest of transparency.”

Groshen, who led the Bureau of Labor Statistics from 2013 to 2017, concedes it would be “a heavy lift” for the Census Bureau to prepare the noisy file for release. “As a former head of a federal statistical agency, I’m very reluctant to make their lives more difficult. But releasing the file would send a signal that providing researchers with the information they need is a high priority for the Bureau. And I think that is a very important message.”

Some co-signers worry that, for all its value, releasing the noisy file could also lead to unforeseen negative consequences. “We don’t know what happened behind the scenes,” says one researcher about the months of work that went into producing the post-processed file. “What if it exposes a haphazard process within the Bureau, or problems that it encountered in applying differential privacy?” (In 2020, a senior Census official explained to a conference of state legislators that “post-processing error tends to be much larger than differential privacy error.”)

But Dwork and her colleagues say researchers—and the general public—can handle the truth. “It is time to help [census data] users stop pretending there is no noise,” they write, “and empower them to address uncertainty in data with mathematical rigor.”

Source: Science Mag