ORLANDO, Fla. — First came the “noise” — minor errors the U.S. Census Bureau decided to introduce into the 2020 census data to protect participants’ privacy. Now the bureau is looking into “synthetic data,” manipulating the numbers widely used for economic and demographic research to obscure people’s identities who provided information.
The moves have some researchers up in arms, worried that the statistical agency could sacrifice accuracy in its zeal to protect privacy. Census Bureau statisticians disclosed at a virtual conference last week that they will work toward developing a method to create “synthetic data” for files on individuals and homes that already are devoid of personalized information over the next three years. These files, known as American Community Survey microdata, are used by researchers to create customized tables tailored to their research.
Census Bureau statisticians said more privacy protections are needed as technological innovations magnify the threat of people being identified through their confidential survey answers. Computing power is now so vast that it can quickly crunch third-party data sets that combine personal information from credit rating and social media companies, purchasing records, voting patterns, and public documents, among other things.
“It’s a balancing act. The law requires us to do competing things. We need to release statistics on the nation to allow people to make good decisions. But we also have to protect the privacy of our respondents,” said Rolando Rodriguez, a Census Bureau statistician, at the conference.
But critics say the proposal, coupled with an ongoing effort to add minor inaccuracies to the 2020 census data to protect participants’ privacy, undermines the Census Bureau’s credibility as the go-to provider of precise data about the U.S. population.
University of Minnesota demographer Steven Ruggles said bluntly that synthetic data “will not be suitable for research.” “The Census Bureau is inventing imaginary threats to confidentiality to sharply reduce public access to data,” Ruggles said. “I do not think this will stand because society needs information to function.”
The microdata is gathered every year from the American Community Survey with a sample size of 3.5 million households, extrapolated across populations of all sizes, from the entire nation down to neighborhoods. This provides a wide range of estimates on the nation’s demographic makeup and housing characteristics. The microdata is used in the drafting of around 12,000 research papers a year, Ruggles said.
The synthetic data are created by taking variables in the microdata to build models recreating the variables’ interrelationships, and then constructing a simulated population based on the models. Scholars would conduct their research using the simulated population — or the synthetic data — and then submit it, if they want, to the Census Bureau for double-checking against the actual data to make sure their analyses are correct.