Pseudonymization

Pseudonymization is a procedure by which the most identifying fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. There can be a single pseudonym for a collection of replaced fields or a pseudonym per replaced field. The purpose is to render the data record less identifying and therefore lower customer or patient objections to its use. Data in this form is suitable for extensive analytics and processing.

The choice of which data fields are to be pseudonymized is partly subjective, but should include all fields that are highly selective, NHS number (in the UK) for example. Less selective fields, such as Birth Date or Postal Code are often also included because they are usually available from other sources and therefore make a record easier to identify. Pseudonymizing these less identifying fields removes most of their analytic value and should therefore be accompanied by the introduction of new derived and less identifying forms, such as Year of Birth or a larger Postal Code region.

Data fields that are less identifying, such as Date of Attendance, are usually not pseudonymized. It is important to realize that this is because too much statistical utility is lost in doing so, not because the data cannot be identified. For example given prior knowledge of a few attendance dates it is easy to identify someone's data in a pseudonymized dataset by selecting only those people with that pattern of dates. This is an example of an Inference attack.

The weakness of pseudonymized data to Inference attacks is commonly overlooked. A famous example is the AOL search data scandal. This example illustrates that there is no way to universally protect pseudomymized data whilst allowing general analysis of it.

Protecting statistically useful pseudonymized data from re-identification requires:

  1. a sound Information security base
  2. controlling the risk that the analysts, researchers or other data workers cause a privacy breach

The pseudonym allows tracking back of data to its origins, which distinguishes pseudonymization from anonymization (comment: better distinction is given in [1]), where all person-related data that could allow backtracking has been purged. Pseudonymization is an issue in, for example, patient-related data that has to be passed on securely between clinical centers.

Recently there are tools introduced that enable users pseudonymize their own data,[2] but still not acceptable among users and not successful in the market. This reveals the fact that pseudonymization still is a machine process and not user task.

An example of application of Pseudonymization procedure is creation of datasets for De-identification research by replacing identifying words with words from the same category (e.g. replacing a name with a random name from the names dictionary),[3][4][5] however, in this case it is in general not possible to track data back to its origins.

See also

References

  1. http://dud.inf.tu-dresden.de/literatur/Anon_Terminology_v0.31.pdf Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and Identity Management – A Consolidated Proposal for Terminology
  2. Rawassizadeh, R., Heurix, J., Khosravipour, S., & Tjoa, A. M. (2011, August). LiDSec-A Lightweight Pseudonymization Approach for Privacy-Preserving Publishing of Textual Personal Information. In Availability, Reliability and Security (ARES), 2011 Sixth International Conference on (pp. 603-608). IEEE.
  3. Ishna Neamatullah, Margaret M Douglass1, Li-wei H Lehman, Andrew Reisner, Mauricio Villarroe, William J Long, Peter Szolovits, George B Moody, Roger G Mark and Gari D Clifford, Automated de-identification of free-text medical records, BMC Medical Informatics and Decision Making 2008, 8:32, http://www.biomedcentral.com/1472-6947/8/32
  4. Ishna Neamatullah, Automated De-Identification of Free-Text Medical Records, http://www.physionet.org/physiotools/deid/doc/ishna-meng-thesis.pdf
  5. Deleger L et al. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.01.014