Quasi-identifier

Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.[1]

Quasi-identifiers can thus, when combined, become personally identifying information. This process is called re-identification. As an example, Latanya Sweeney has shown that even though neither gender, birth dates nor postal codes uniquely identify an individual, the combination of all three is sufficient to identify 87% of individuals in the United States.[2]

The term was introduced by Tore Dalenius in 1986. [3] Since then, quasi-identifiers have been the basis of several attacks on released data. For instance, Sweeney linked health records to publicly available information to locate the then-governor of Massachusetts' hospital records using uniquely-identifying quasi-identifiers, [4] [5] and Sweeney, Abu and Winn used public voter records to re-identify participants in the Personal Genome Project. [6] Additionally, Arvind Narayanan and Vitaly Shmatikov made use of quasi-identifiers to de-anonymize data released by Netflix. [7]

Motwani and Ying warn about potential privacy breaches being enabled by publication of large volumes of government and business data containing quasi-identifiers. [8]

References

  1. "Glossary of Statistical Terms: Quasi-identifier". OECD. November 10, 2005. Retrieved 29 September 2013.
  2. Sweeney, Latanya. Simple demographics often identify people uniquely. Carnegie Mellon University, 2000. http://dataprivacylab.org/projects/identifiability/paper1.pdf
  3. Dalenius, Tore. Finding a Needle In a Haystack or Identifying Anonymous Census Records. Journal of Official Statistics, Vol.2, No.3, 1986. pp. 329–336. http://www.jos.nu/Articles/abstract.asp?article=23329
  4. Anderson, Nate. Anonymized data really isn’t—and here’s why not. Ars Technica, 2009. http://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/
  5. Barth-Jones, Daniel C. The're-identification'of Governor William Weld's medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (June 4, 2012) (2012).
  6. Sweeney, Latanya, Akua Abu, and Julia Winn. "Identifying participants in the personal genome project by name." Available at SSRN 2257732 (2013).
  7. Narayanan, Arvind and Shmatikov, Vitaly. Robust De-anonymization of Large Sparse Datasets. The University of Texas at Austin, 2008. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
  8. Rajeev Motwani and Ying Xu (2007). Efficient Algorithms for Masking and Finding Quasi-Identifiers (PDF). Proceedings of the Conference on Very Large Data Bases (VLDB).

See also