Much of the data that governments, companies, and individuals use on a daily basis aggregates information that is confidential to some degree. The annual number of reported cases of a given disease is a matter of public record and is important to understand: Perhaps the trend suggests some program is making progress or that corrective measures are needed. However, in general, the names of individuals with the disease are protected by law.
As big data sets get used in more and more contexts – and as the volumes of data grow massively in support of machine-learning algorithms – protecting user privacy becomes a bigger and bigger concern.
[ Many people still misunderstand the realities of big data. Read also: How to explain big data in plain English. ]
One answer is just to do without the data. And this may be an appropriate policy response to certain types of data collection. However, it’s certainly not the complete solution, given that aggregated data can improve outcomes in everything from healthcare to energy usage to traffic management. Rather, we need ways to use data in ways that don’t compromise individual privacy, especially when data is shared – such as with other researchers who wish to reproduce or extend an original study.
Anonymize the data
One conceptually simple approach is to directly anonymize the data. For example, medical images might be shared to compare the effectiveness of different diagnostic techniques; removing the patient’s name and other specific identifying information could work in such a case. (Often such data is actually pseudonymized, with a trusted party replacing personally identifiable information fields with one or more artificial identifiers or pseudonyms.)
It’s also common to generalize fields. The patient’s age or age range in our example may be relevant. Their specific birthday probably isn’t. So a field with a birthday of 1/1/90 may simply become a field of age 30 or 30-35. But surely a given birthday isn’t an identifier! Not by itself, but one of the big challenges with anonymization is that it’s not always clear what can be used to identify someone and what can’t – especially once you start correlating with other data sources, including public ones. Even if a birthday isn’t a unique identifier, it’s one more piece of information for someone trying to narrow down the possibilities.
Even combining data so that only the aggregated data is seen isn’t a panacea.
Imagine this scenario: A company runs an employee satisfaction survey that includes questions about the manager each person reports to. The aggregated results for the entire company are shared with all; managers also see what those reporting up to them answered in aggregate. There’s not much anonymity if a manager has only one person – or even just a few – reporting to them. (A common approach for this sort of situation is to show only results that are aggregated across some minimum number of people.)
At a much larger scale, organizations like the U.S. Census have long had to deal with the challenges of publishing large numbers of tables that cut up data in many different ways. Over time, there’s been a great deal of research into the topic; this has led to the creation of various guidelines for working with data in this manner.
Particularly challenging today: Rather than being published in static tables, datasets are now often available in electronic form for ad hoc queries. This makes it much easier to narrow down results to one or a small number of identifiable individuals or companies by running multiple queries. Even in the absence of a single field that uniquely identifies someone, the totality of data like zip code, age, salary, homeownership, and so forth, can at a minimum narrow down the possibilities.
Differential privacy
One of the problems with traditional anonymization methods is that it’s often not well understood how well they’re actually protecting privacy. Techniques like those described, which collectively fall under the umbrella of statistical disclosure control, are often based on intuition and empirical observation.
However, in a 2006 article, Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith provided a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. This relatively new approach brings more rigor to the process of preserving privacy in statistical databases. It’s called “differential privacy” (specifically ε-differential privacy). A differential privacy algorithm injects random data into a data set in a mathematically rigorous way to protect individual privacy.
Because the data is “fuzzed,” to the extent that any given response could instead have plausibly been any other valid response, the results may not be quite as accurate as the raw data, depending upon the technique used. But other research has shown that it’s possible to produce very accurate statistics from a database while still ensuring high levels of privacy.
Differential privacy remains an area of active research. However, the technique is already in use. For example, the Census Bureau will use differential privacy to protect the results of the 2020 census.
Fully homomorphic encryption
Other areas of research address additional privacy concerns. Fully homomorphic encryption lets a third party perform complicated processing of data without being able to see it. Homomorphic encryption is essentially a technique to extend public-key cryptography and was in fact first mentioned shortly after the RSA cryptosystem was initially invented.
This technique is very expensive computationally and is mostly not practical yet. However, if realized, it would provide an additional level of protection against data leaks when using public cloud or other service providers to analyze data sets.
Secure Multi-Party Computing
While technically distinct from homomorphic encryption, Secure Multi-Party Computing (typically abbreviated MPC) can tackle a similar class of problems. Essentially, MPC replaces a trusted third party with a protocol. This includes preserving certain security properties, such as privacy and correctness, even if some of the parties collude and maliciously attack the protocol. In general, you can think of MPC distributing shares of cryptographic secrets among the parties doing the computation. No single party can decrypt any of the inputs, which may contain confidential data, but all have access to the aggregated data outputs. In practice, a third party can also do the computation in a way that a data analyst doesn’t have access to the inputs either.
One example of when this technique is useful is when companies have data that they are willing to allow the government or other organization to use for some purpose – but only as long as no one else can see their particular data. This was the case with a project done by Boston University with the City of Boston regarding gender-based wage gaps. Companies were willing to participate in the study, but for legal and other reasons, they weren’t willing to share numbers in a form that others could read. MPC provided a solution.
Data is becoming ever more important for optimizing businesses and establishing appropriate government policies. However, even given the best of intentions – an assumption that is admittedly not always warranted – data can leak confidential information if it’s not handled correctly.
And handling it correctly may mean looking beyond historical rules of thumb and ad hoc approaches. While some techniques remain works in progress to greater or lesser degrees, it’s going to be increasingly important for those analyzing data to understand the new options available for their tool kit.
[ How can automation free up more staff time for innovation? Get the free e-book: Managing IT with Automation. ]