Methods Inf Med 2016; 55(02): 136-143
DOI: 10.3414/ME14-01-0087
Original Articles
Schattauer GmbH

Evaluation of a Binary Semi-supervised Classification Technique for Probabilistic Record Linkage

D. Nasseh
1   IBE, Ludwig-Maximilians-Universität München, Munich, Germany
,
J. Stausberg
1   IBE, Ludwig-Maximilians-Universität München, Munich, Germany
› Institutsangaben
Weitere Informationen

Publikationsverlauf

received: 25. August 2014

accepted: 25. März 2015

Publikationsdatum:
08. Januar 2018 (online)

Summary

Background: The process of merging data of different data sources is referred to as record linkage. A medical environment with increased preconditions on privacy protection demands the transformation of clear-text attributes like first name or date of birth into one-way encrypted pseudonyms. When performing an automated or privacy preserving record linkage there might be the need of a binary classification deciding whether two records should be classified as the same entity. The classification is the final of the four main phases of the record linkage process: Preprocessing, indexing, matching and classification. The choice of binary classification techniques in dependence of project specifications in particular data quality has not extensively been studied yet.

Objectives: The aim of this work is the introduction and evaluation of an automatable semi-supervised binary classification system applied within the field of record linkage capable of competing or even surpassing advanced automated techniques of the domain of unsupervised classification.

Methods: This work describes the rationale leading to the model and the final implementation of an automatable semi-supervised binary classification system and the comparison of its classification performance to an advanced active learning approach out of the domain of unsupervised learning. The performance of both systems has been measured on a broad variety of artificial test sets (n = 400), based on real patient data, with distinct and unique characteristics.

Results: While the classification performance for both methods measured as F-measure was relatively close on test sets with maximum defined data quality, 0.996 for semi-supervised classification, 0.993 for unsupervised classification, it incrementally diverged for test sets of worse data quality dropping to 0.964 for semi-supervised classification and 0.803 for unsupervised classification.

Conclusions: Aside from supplying a viable model for semi-supervised classification for automated probabilistic record linkage, the tests conducted on a large amount of test sets suggest that semi-supervised techniques might generally be capable of outperforming unsupervised techniques especially on data with lower levels of data quality.

 
  • References

  • 1 Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Information Science and Systems 2014; 2: 3.
  • 2 Mansmann U, Stausberg J, Engel J, Heussner P, Birkner B, Maar C. Familien schützen und stärken – Umgang mit familiärem Darmkrebs. Gastroenterologe. 2012 161–162.
  • 3 Durham E, Xue Y, Kantarcioglu M, Malin B. Private Medical Record Linkage with Approximate Matching. AMIA Annu Symp Proc 2010; pp 182-186.
  • 4 Sauleau EA, Paumier JP, Buemi A. Medical record linkage in health information systems by approximate string matching and clustering. BMC Medical Informatics and Decision Making 2005; 5: 32.
  • 5 Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin Heidelberg: Springer; 2012
  • 6 Schnell R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making 2009; 9: 41.
  • 7 Verykios SV, Karakasidis A, Mitrogiannis V. Privacy Preserving Record Linkage Approaches. Int J Data Mining, Modelling and Management 2009; 1: 206-221.
  • 8 Trepetin S. Privacy-preserving string comparisons in record linkage systems: a review. Information Security Journal: A Global Perspective 2008; 17: 253-266.
  • 9 Durham E, Kantarcioglu M, Malin B. Quantifying the correctness, computational complexity, and security of privacy-preserving string comparators for record linkage. Inf Fusion 2012; 13 (Suppl. 04) 245-259.
  • 10 Fellegi I, Sunter A. A Theory for Record Linkage. Journal of the American Statistical Association 1969; 64 (Suppl. 328) 1183-1210.
  • 11 Meyer M. Kontrollnummern und Record Linkage. Das Manual der epidemiologischen Krebsregistrierung. Hentschel S, Katalinie A. editors Zuckschwerdt: 2011. pp 57-68.
  • 12 Bloom B. Space/time trade offs in hash coding with allowable errors. Communication of the ACM 1970; 13 (Suppl. 07) 422-426.
  • 13 Blakely T, Salmond C. Probabilistic record linkage and a method to calculate the positive predictive value. International Journal of Epidemiology 2002; 31 (Suppl. 06) 1246-1252.
  • 14 Nasseh D, Stausberg J. Impact of variations in anonymous record linkage on weight distribution and classification. Stud Health Technol Inform 2013; 192: 922.
  • 15 Christen P. Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification. KDD ’08 Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: 2008. pp 151-159.
  • 16 Christen P. A Two-Step Classification Approach to Unsupervised Record Linkage. AusDM ’07: Proceedings of the sixth Australasian conference on Data mining and analytics; Australia. Australia: Australian Computer Society, Inc; 2007
  • 17 Goiser K, Christen P. Towards automated record linkage. AusDM ’06 Proceedings of the fifth Australasian conference on Data mining and analytics; Australia. Australia: Australian Computer Society, Inc; 2006. pp 23-31.
  • 18 Han J, Kamber M. Data Mining: concepts and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2006
  • 19 Breimann L, Freidman J, Olshen R, Stone C. Classification and regression trees. Chapman and Hall/CRC; 1984
  • 20 Sibson R. SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 1973; 16 (Suppl. 01) 30-34.
  • 21 Chang CC, Lin CJ. LIBSVM: A library for support vector machines. Manual, Department of Computer Science National Taiwan University; 2001
  • 22 Sariyar M, Borg A. Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data. Comput Methods Programs Biomed 2012; 108 (Suppl. 03) 1160-1169.
  • 23 Sariyar M, Borg A, Pommerening K. Evaluation of Record Linkage – Methods for Iterative Insertion. Methods Inf Med 2009; 48 (Suppl. 05) 429-437.
  • 24 Elfeky MG, Verykios V, Elmagarmid AK. TAILOR: A record linkage toolbox. IEEE ICDE; 2002. pp 17-28.
  • 25 Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. ACM KDD’02 2002 pp 269-278.
  • 26 Zhu X, Goldberg A. Introduction to Semi-supervised Learning. San Rafael, CA: Morgan & Clay-pool Publishers; San Rafael, CA, USA 2009
  • 27 Gilbert H, Handschuh H. Security Analysis of SHA-256 and Sisters. Selected Areas in Cryptography. 2003 pp 175-193.
  • 28 Hinrichs H. Bundesweite Einführung eines einheitlichen Record Linkage Verfahrens in den Krebsregistern der Bundesländer nach dem KRG, Abschlussbericht, Projekt Deutsche Krebshilfe. Antragsnummer 70–2043-Ap I OFFIS. Oldenburg: 1999
  • 29 Breimann L. Bagging predictors. Machine Learning 1996; 24 (Suppl. 02) 123-140.
  • 30 Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. ACM KDD’02 2002 pp 269-278.
  • 31 Sanchez JS, Sotoca JM, Pla F. Efficient nearest neighbor classification with data reduction and fast search algorithms. IEEE International Conference on Systems, Man and Cybernetics volume 5 2004; pp 4757-4762.
  • 32 Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J, Dusserre L. Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods Inf Med 1998; 37 (Suppl. 03) 271-277.
  • 33 Han J, Kamber M. Data Mining: concepts and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2006
  • 34 Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association 2000
  • 35 Christen P. Febrl – a freely available record linkage system with a graphical user interface. HDKM’08, CRPIT, vol. 80 2008
  • 36 Silveira D, Artmann E. Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saúde Pública. 2009