Driving value through personal data is often framed as a balancing act between robust data privacy and creating business insights, but this is a false dichotomy. One approach for forging a middle-path is securing data and sharing it through fine-grain controls. This allows organizations to both capitalize on the insights derived from sensitive data (both PII and other varieties) while also mitigating the risk that comes through making that data available. We can frame this approach as a risk decision. In contrast, some organizations pit data privacy against driving business value and reduce the details down to extremes of prioritizing either data privacy or business. These two approaches represent the most and least risk averse, and they highlight how securing personal data and business value are both incomplete without the other:
- The most risk-averse approach to sensitive data entails keeping it under lock and key or simply not collecting it in the first place. While this eliminates many concerns about the improper use of data or violating privacy laws, it guarantees that the data delivers zero additional business value. Nonutilized data such as this is at best an expense without ROI and at worst a liability in the event of a data breach. This extreme shows that when personal data is collected it should provide some business value.
- Unfettered access across the whole organization represents the least risk-averse approach to PII. While this eliminates friction around data access for deriving additional business value from PII, it exposes the organization to the potential of both ethical and legislative privacy violations. In this case, even if the personal data is encrypted while at rest it can still all be accessed through a single compromised credential. This approach is even less realistic than the first, but it exists as an extreme to highlight the dire need for some protections on personal information.
The gap between these two extremes becomes bridgeable through incorporating a risk-decision approach, which relies on the fact that sensitive data is often only a handful of columns which can be secured through targeted measures. Unlike general encryption, the risk-decision approach can treat data while at rest and in use. There are several methods that can be applied to the sensitive columns to produce privacy-enhanced data sets, and the best method will depend on the business’ needs and the risk of potential reidentification. A few of the most popular fine-grained controls include:
- De-identification: This is the most popular route because it can be tailored to the specific needs of the company to balance both data utility and privacy when the data is accessed. De-identification centers on removing identifiers, such as masking an individual’s birth date to only show the year. In this example, de-identification allows the business to still use an individual’s birth year without needing to provide wider access to the more sensitive PII of a full birth date.
- Anonymization: This is an irreversible approach that makes data unable to identify directly or indirectly any individual and this can take the form of a wide set of methods. One example of anonymization in practice is replacing someone’s birth date with an age range. This method, known as generalizing, allows the business to still use an approximate age (e.g. 28-35 years old) without accessing PII, but which can still serve needs such as certain types of market segmentation.
- Pseudonymization: This method is the recommended approach by the European Data Protection Board (EDPB) and has been supported by the Court of Justice of the European Union (CJEU). The process replaces sensitive data with a type, length, and format-preserving “token” or pseudonym that is unreadable to viewers. The process can be reversed to restore the original value from the token. A tokenization system can be used to deidentify data by replacing identifiers and quasi-identifiers with pseudonyms. The system maintains referential integrity (e.g. a token for Bob Smith is consistent across systems), which allows for joins and analytical operations to be performed. Because the underlying data can be restored, data sets maintain their data utility. The most common use case for this method is to de-risk data lakes, allowing wider access for analysis because PII is inherently kept private.
One effective approach is to centralize policy but decentralize enforcement in a way that is tailored to each business unit. An example of this could be generating synthetic data for one branch to enhance privacy and model performance while providing masked data for another so they can verify a client’s identity through checking a select piece of personal information such as a birth year.
These techniques allow us to trade small levels of specificity for meaningful advances in privacy. Embracing fine-grained controls creates more nuanced choices and allows us to reject the false opposition between protecting data privacy and creating business value. Instead, this approach invites us to look at these choices as a risk decision.
For more on the topic, here are some related articles, studies, and datasets:
- “EDPB Guidelines - Data Protection by Design and by Default” the relevant passages for “pseudonymization” are on pages 6, 19, 20, and 24
- “EU General Court Clarifies When Pseudonymized Data is Considered Personal Data” an article by Inside Privacy on the CJEU Ruling
- “Security & Privacy Considerations in Artificial Intelligence & Machine Learning — Part-6: Up close with Privacy” in Towards Data Science
- “Privacy-Enhancing Cryptography to Complement Differential Privacy” a blog by the National Institute of Standards and Technology (NIST)
- “Simple Demographics Often Identify People Uniquely” a project by the Data Privacy Lab outlining how individuals can often be identified by combining their ZIP code, gender, and birth date
- Example datasets from Dunnhumby created by IIA’s Brian Sampsel using some of the techniques mentioned above
This blog was inspired by discussions within IIA’s Analytics Leadership Consortium (ALC), a private network of senior analytics executives across diverse industries. ALC members meet monthly to share best practices, explore analytics innovation, and grow as leaders through confidential peer-to-peer exchange. You can learn more about the ALC here.
About the ALC
The ALC is a private community of leaders who come together to openly exchange ideas, insights, and developments in analytics with the shared goal of improving the business impact of analytics at their respective organizations.