Evaluation of a Binary Semi-supervised Classification Technique for Probabilistic Record Linkage

D. Nasseh; J. Stausberg

doi:10.3414/ME14-01-0087

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

Teilen / Bookmarken

Facebook X Linkedin Weibo

PDF herunterladen

Methods Inf Med 2016; 55(02): 136-143
DOI: 10.3414/ME14-01-0087

Original Articles

Schattauer GmbH

Evaluation of a Binary Semi-supervised Classification Technique for Probabilistic Record Linkage

D. Nasseh

¹IBE, Ludwig-Maximilians-Universität München, Munich, Germany

,

J. Stausberg

¹IBE, Ludwig-Maximilians-Universität München, Munich, Germany

› Institutsangaben

Weitere Informationen

Publikationsverlauf

received: 25. August 2014

accepted: 25. März 2015

Publikationsdatum:
08. Januar 2018 (online)

Abstract
Volltext
Referenzen

Lizenzen und Reprints

Summary

Background: The process of merging data of different data sources is referred to as record linkage. A medical environment with increased preconditions on privacy protection demands the transformation of clear-text attributes like first name or date of birth into one-way encrypted pseudonyms. When performing an automated or privacy preserving record linkage there might be the need of a binary classification deciding whether two records should be classified as the same entity. The classification is the final of the four main phases of the record linkage process: Preprocessing, indexing, matching and classification. The choice of binary classification techniques in dependence of project specifications in particular data quality has not extensively been studied yet.

Objectives: The aim of this work is the introduction and evaluation of an automatable semi-supervised binary classification system applied within the field of record linkage capable of competing or even surpassing advanced automated techniques of the domain of unsupervised classification.

Methods: This work describes the rationale leading to the model and the final implementation of an automatable semi-supervised binary classification system and the comparison of its classification performance to an advanced active learning approach out of the domain of unsupervised learning. The performance of both systems has been measured on a broad variety of artificial test sets (n = 400), based on real patient data, with distinct and unique characteristics.

Results: While the classification performance for both methods measured as F-measure was relatively close on test sets with maximum defined data quality, 0.996 for semi-supervised classification, 0.993 for unsupervised classification, it incrementally diverged for test sets of worse data quality dropping to 0.964 for semi-supervised classification and 0.803 for unsupervised classification.

Conclusions: Aside from supplying a viable model for semi-supervised classification for automated probabilistic record linkage, the tests conducted on a large amount of test sets suggest that semi-supervised techniques might generally be capable of outperforming unsupervised techniques especially on data with lower levels of data quality.

Keywords

Medical record linkage - classification

References
1 Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Information Science and Systems 2014; 2: 3.

Crossref PubMed Google Scholar
2 Mansmann U, Stausberg J, Engel J, Heussner P, Birkner B, Maar C. Familien schützen und stärken – Umgang mit familiärem Darmkrebs. Gastroenterologe. 2012 161–162.

PubMed Google Scholar
3 Durham E, Xue Y, Kantarcioglu M, Malin B. Private Medical Record Linkage with Approximate Matching. AMIA Annu Symp Proc 2010; pp 182-186.

PubMed Google Scholar
4 Sauleau EA, Paumier JP, Buemi A. Medical record linkage in health information systems by approximate string matching and clustering. BMC Medical Informatics and Decision Making 2005; 5: 32.

Crossref PubMed Google Scholar
5 Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin Heidelberg: Springer; 2012

Google Scholar
6 Schnell R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making 2009; 9: 41.

Crossref PubMed Google Scholar
7 Verykios SV, Karakasidis A, Mitrogiannis V. Privacy Preserving Record Linkage Approaches. Int J Data Mining, Modelling and Management 2009; 1: 206-221.

PubMed Google Scholar
8 Trepetin S. Privacy-preserving string comparisons in record linkage systems: a review. Information Security Journal: A Global Perspective 2008; 17: 253-266.

Crossref PubMed Google Scholar
9 Durham E, Kantarcioglu M, Malin B. Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage. Inf Fusion 2012; 13 (Suppl. 04) 245-259.

Crossref PubMed Google Scholar
10 Fellegi I, Sunter A. A Theory for Record Linkage. Journal of the American Statistical Association 1969; 64 (Suppl. 328) 1183-1210.

Crossref PubMed Google Scholar
11 Meyer M. Kontrollnummern und Record Linkage. Das Manual der epidemiologischen Krebsregistrierung. Hentschel S, Katalinie A. editors Zuckschwerdt: 2011. pp 57-68.

Google Scholar
12 Bloom B. Space/time trade offs in hash coding with allowable errors. Communication of the ACM 1970; 13 (Suppl. 07) 422-426.

Crossref PubMed Google Scholar
13 Blakely T, Salmond C. Probabilistic record linkage and a method to calculate the positive predictive value. International Journal of Epidemiology 2002; 31 (Suppl. 06) 1246-1252.

Crossref PubMed Google Scholar
14 Nasseh D, Stausberg J. Impact of variations in anonymous record linkage on weight distribution and classification. Stud Health Technol Inform 2013; 192: 922.

PubMed Google Scholar
15 Christen P. Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification. KDD ’08 Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: 2008. pp 151-159.

Google Scholar
16 Christen P. A Two-Step Classification Approach to Unsupervised Record Linkage. AusDM ’07: Proceedings of the sixth Australasian conference on Data mining and analytics; Australia. Australia: Australian Computer Society, Inc; 2007

Google Scholar
17 Goiser K, Christen P. Towards automated record linkage. AusDM ’06 Proceedings of the fifth Australasian conference on Data mining and analytics; Australia. Australia: Australian Computer Society, Inc; 2006. pp 23-31.

Google Scholar
18 Han J, Kamber M. Data Mining: concepts and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2006

Google Scholar
19 Breimann L, Freidman J, Olshen R, Stone C. Classification and regression trees. Chapman and Hall/CRC; 1984

Google Scholar
20 Sibson R. SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1973; 16 (Suppl. 01) 30-34.

Crossref PubMed Google Scholar
21 Chang CC, Lin CJ. LIBSVM: A library for support vector machines. Manual, Department of Computer Science National Taiwan University; 2001

Google Scholar
22 Sariyar M, Borg A. Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data. Comput Methods Programs Biomed 2012; 108 (Suppl. 03) 1160-1169.

Crossref PubMed Google Scholar
23 Sariyar M, Borg A, Pommerening K. Evaluation of Record Linkage – Methods for Iterative Insertion. Methods Inf Med 2009; 48 (Suppl. 05) 429-437.

Artikel in Thieme Connect PubMed Google Scholar
24 Elfeky MG, Verykios V, Elmagarmid AK. TAILOR: A record linkage toolbox. IEEE ICDE; 2002. pp 17-28.

Google Scholar
25 Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. ACM KDD’02 2002 pp 269-278.

PubMed Google Scholar
26 Zhu X, Goldberg A. Introduction to Semi-supervised Learning. San Rafael, CA: Morgan & Clay-pool Publishers; San Rafael, CA, USA 2009

Google Scholar
27 Gilbert H, Handschuh H. Security Analysis of SHA-256 and Sisters. Selected Areas in Cryptography. 2003 pp 175-193.

PubMed Google Scholar
28 Hinrichs H. Bundesweite Einführung eines einheitlichen Record Linkage Verfahrens in den Krebsregistern der Bundesländer nach dem KRG, Abschlussbericht, Projekt Deutsche Krebshilfe. Antragsnummer 70–2043-Ap I OFFIS. Oldenburg: 1999

PubMed Google Scholar
29 Breimann L. Bagging predictors. Machine Learning 1996; 24 (Suppl. 02) 123-140.

PubMed Google Scholar
30 Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. ACM KDD’02 2002 pp 269-278.

PubMed Google Scholar
31 Sanchez JS, Sotoca JM, Pla F. Efficient nearest neighbor classification with data reduction and fast search algorithms. IEEE International Conference on Systems, Man and Cybernetics volume 5 2004; pp 4757-4762.

PubMed Google Scholar
32 Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J, Dusserre L. Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods Inf Med 1998; 37 (Suppl. 03) 271-277.

Artikel in Thieme Connect PubMed Google Scholar
33 Han J, Kamber M. Data Mining: concepts and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2006

Google Scholar
34 Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association 2000

PubMed Google Scholar
35 Christen P. Febrl – a freely available record linkage system with a graphical user interface. HDKM’08, CRPIT, vol. 80 2008

PubMed Google Scholar
36 Silveira D, Artmann E. Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saúde Pública. 2009

PubMed Google Scholar

RSS-Feed abonnieren

Teilen / Bookmarken

Evaluation of a Binary Semi-supervised Classification Technique for Probabilistic Record Linkage

Publikationsverlauf

Summary

Keywords

References