Methods Inf Med 2005; 44(05): 687-692
DOI: 10.1055/s-0038-1634025
Original Article
Schattauer GmbH

Protecting Genomic Sequence Anonymity with Generalization Lattices

B. A. Malin
1   Data Privacy Laboratory, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
› Author Affiliations
Further Information

Publication History

Received: 13 October 2004

accepted: 28 April 2005

Publication Date:
07 February 2018 (online)

Summary

Objectives: Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual’s identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences.

Methods: The technique is termed DNA lattice an-onymization (DNALA), and is based upon the formal privacy protection schema of k-anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines).

Results: The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy.

Conclusions: The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific data-sharing scenarios.

 
  • References

  • 1 Lindberg DA. Medicine in the 21st century: global problems, global solutions. Methods Inf Med 2001; 41: 235-6.
  • 2 Altman RB, Klein TE. Challenges for biomedical informatics and pharmacogenomics. Annu Rev Pharmacol Toxicol 2002; 42: 113-33.
  • 3 Churches T. A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. BMC Med Res Methodol 2003; 3 (01) 1
  • 4 Gulcher JR, Kristjansson K, Gudbjartsson H, Stefanson K. Protection of privacy by third-party encryption in genetic research. Eur J Hum Genetics 2000; 8: 739-42.
  • 5 de Moor GJ, Claerhout B, de Meyer F. Privacy enhancing technologies: the key to secure communication and management of clinical and genomic data. Methods Inf Med 2003; 42: 148-53.
  • 6 Burnett L, Barlow-Stewart K, Pros AL, Aizenberg H. The “Gene Trustee”: a universal identification system that ensures privacy and confidentiality for human genetic databases. J Law M 2003; 10 (04) 506-13.
  • 7 Malin B. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. JAMIA 2005; 12 (01) 28-34.
  • 8 Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics 2004; 37 (03) 179-92.
  • 9 Lin Z, Owen AB, Altman RB. Genomic research and human subject privacy. Science 2004; 305 5681 183
  • 10 Lin Z, Hewett M, Altman RB. Using binning to maintain confidentiality of medical data. In: Proc AMIA Symp 2002: 454-8.
  • 11 Sweeney L. k-anonymity: a model for protection privacy. International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems 2002; 10 (07) 557-70.
  • 12 Sweeney L. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems 2002; 10 (07) 571-88.
  • 13 Liebecq C. editor Biochemical Nomenclature: And Related Documents: A Compendium. 2nd ed. Chapel Hill, NC: Portland Press; 1992
  • 14 Higgins DG, Thompson JD, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994; 22: 4673-80.
  • 15 Makova KD, Ramsay M, Jenkins T, Li WH. Human DNA sequence variation in a 6.6-kb region containing the melanocortin 1 receptor promoter. Genetics 2001; 158: 1253-68.
  • 16 Harris EE, Hey J. X chromosome evidence for ancient human histories. PNAS USA 1999; 96: 3320-4.
  • 17 Yao YG. et al Genetic relationship of Chinese ethnic populations revealed by mtDNA sequence diversity. Am J Phys Anthropol 2002; 118: 63-76.
  • 18 Malin BA, Sweeney LA. Inferring genotype from phenotype through a knowledge-based algorithm. In Pac Symp Biocomput 2002: 41-52.