Protecting Genomic Sequence Anonymity with Generalization Lattices

B. A. Malin

doi:10.1055/s-0038-1634025

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

Methods Inf Med 2005; 44(05): 687-692
DOI: 10.1055/s-0038-1634025

Original Article

Schattauer GmbH

Protecting Genomic Sequence Anonymity with Generalization Lattices

B. A. Malin

¹Data Privacy Laboratory, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

› Author Affiliations

Further Information

Publication History

Received: 13 October 2004

accepted: 28 April 2005

Publication Date:
07 February 2018 (online)

Abstract
PDF (408 kb)
References

PDF Download Permissions and Reprints

Summary

Objectives: Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual’s identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences.

Methods: The technique is termed DNA lattice an-onymization (DNALA), and is based upon the formal privacy protection schema of k-anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines).

Results: The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy.

Conclusions: The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific data-sharing scenarios.

Keywords

Privacy - anonymity - databases - genetic variation - genomic data - sequence analysis

References
1 Lindberg DA. Medicine in the 21st century: global problems, global solutions. Methods Inf Med 2001; 41: 235-6.

Thieme Connect PubMed Search in Google Scholar
2 Altman RB, Klein TE. Challenges for biomedical informatics and pharmacogenomics. Annu Rev Pharmacol Toxicol 2002; 42: 113-33.

Crossref PubMed Search in Google Scholar
3 Churches T. A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. BMC Med Res Methodol 2003; 3 (01) 1

Crossref PubMed Search in Google Scholar
4 Gulcher JR, Kristjansson K, Gudbjartsson H, Stefanson K. Protection of privacy by third-party encryption in genetic research. Eur J Hum Genetics 2000; 8: 739-42.

PubMed Search in Google Scholar
5 de Moor GJ, Claerhout B, de Meyer F. Privacy enhancing technologies: the key to secure communication and management of clinical and genomic data. Methods Inf Med 2003; 42: 148-53.

Thieme Connect PubMed Search in Google Scholar
6 Burnett L, Barlow-Stewart K, Pros AL, Aizenberg H. The “Gene Trustee”: a universal identification system that ensures privacy and confidentiality for human genetic databases. J Law M 2003; 10 (04) 506-13.

PubMed Search in Google Scholar
7 Malin B. An evaluation of the current state of genomic data privacy protection technology and a roadmap for the future. JAMIA 2005; 12 (01) 28-34.

PubMed Search in Google Scholar
8 Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics 2004; 37 (03) 179-92.

Crossref PubMed Search in Google Scholar
9 Lin Z, Owen AB, Altman RB. Genomic research and human subject privacy. Science 2004; 305 5681 183

Crossref PubMed Search in Google Scholar
10 Lin Z, Hewett M, Altman RB. Using binning to maintain confidentiality of medical data. In: Proc AMIA Symp 2002: 454-8.

Search in Google Scholar
11 Sweeney L. k-anonymity: a model for protection privacy. International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems 2002; 10 (07) 557-70.

PubMed Search in Google Scholar
12 Sweeney L. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness, and Knowledge-based Systems 2002; 10 (07) 571-88.

PubMed Search in Google Scholar
13 Liebecq C. editor Biochemical Nomenclature: And Related Documents: A Compendium. 2nd ed. Chapel Hill, NC: Portland Press; 1992

Search in Google Scholar
14 Higgins DG, Thompson JD, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994; 22: 4673-80.

Crossref PubMed Search in Google Scholar
15 Makova KD, Ramsay M, Jenkins T, Li WH. Human DNA sequence variation in a 6.6-kb region containing the melanocortin 1 receptor promoter. Genetics 2001; 158: 1253-68.

PubMed Search in Google Scholar
16 Harris EE, Hey J. X chromosome evidence for ancient human histories. PNAS USA 1999; 96: 3320-4.

Crossref PubMed Search in Google Scholar
17 Yao YG. et al Genetic relationship of Chinese ethnic populations revealed by mtDNA sequence diversity. Am J Phys Anthropol 2002; 118: 63-76.

Crossref PubMed Search in Google Scholar
18 Malin BA, Sweeney LA. Inferring genotype from phenotype through a knowledge-based algorithm. In Pac Symp Biocomput 2002: 41-52.

Search in Google Scholar

Subscribe to RSS

Share / Bookmark

Protecting Genomic Sequence Anonymity with Generalization Lattices

Publication History

Summary

Keywords

References