Methods Inf Med 2008; 47(05): 448-453
DOI: 10.3414/ME0529
Original Article
Schattauer GmbH

DataLink Record Linkage Software Applied to the Cancer Registry of Murcia, Spain

M. Márquez Cid
1   Epidemiology Department, Regional Health Council, Murcia, Spain
2   CIBER Epidemiología y Salud Pública (CIBERESP), Spain
,
M. D. Chirlaque
1   Epidemiology Department, Regional Health Council, Murcia, Spain
2   CIBER Epidemiología y Salud Pública (CIBERESP), Spain
,
C. Navarro
1   Epidemiology Department, Regional Health Council, Murcia, Spain
2   CIBER Epidemiología y Salud Pública (CIBERESP), Spain
› Author Affiliations
Further Information

Publication History

Received: 10 January 2008

accepted: 21 April 2008

Publication Date:
20 January 2018 (online)

Summary

Objectives: Record linkage between data sets is relatively simple when unique, universal, permanent, and common variables exist in each data set. This situation occurs infrequently; thus, there is a need to apply probabilistic methods to identify corresponding records. DataLink has been tested to determine if the use of clustering techniques will improve performance with a minimum decrease in accuracy.

Methods: The study uses cancer registry data which includes hospital discharge and pathology reports from two hospitals in the Murcia Region for the years 2002-2003. These data are standardized prior to running DataLink. The original version of DataLink compares all of the records one by one, and in two later versions of the software clustering is applied which filters for one or more variables. Computing time and the proportion of detected matches have been investigated with each version.

Results: The clustering versions achieve 96.1% and 96.2% accuracy, respectively. An improvement in the computational time of 97.3% and 98.6% is achieved for the two clustering versions compared with the original. The clustering versions lose 0.36% and 1.07% of real duplicates, respectively.

Conclusions: DataLink implements deterministic and probabilistic record linkage to eliminate duplicates and to merge new information with existing cases. The standardization of variables to a common format has been adapted to the characteristics of Spanish language data. Clustering techniques minimize computational time and maximize accuracy in the detection of corresponding records.

 
  • References

  • 1 Fellegi I, Sunter A. A theory for record linkage. Journal of theAmerican Statistical Society. 1969
  • 2 Winkler W. The state of record linkage and current research problems. RR99 (1)/03, US Bureau of the Census, 1999 (1).
  • 3 Arribas P, Cirera E, Tristán-Polo M. Buscando una aguja en un pajar: las técnicas de conexión de registros en los sistemas de información sanitaria. Med Clin (Barc) 2004; 122 Supl (01) 16-20.
  • 4 Neutel CL, Johansen HL, Walop W. New data from old: epidemiology and record-linkage. Progress Food Nutrition Sciences 1991; 15: 85-116.
  • 5 Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. Proc ACM, (38) SIGMOD (16) 2003 (9). pp 313-324.
  • 6 Cohen W. Integration of heterogeneous databases without common domains using queries based on textual similarity. Proc ACM, (38) SIGMOD (16), 1998
  • 7 Elfeky MG, Verykios VS, Elmagarmid AK. TAILOR: A Record Linkage Toolbox. Data Engineering 2002 (6). Proceedings 18th (1) International Conference 2002. (6).
  • 8 Hernandez MA, Stolfo JS. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery 1998; 2: 9-37.
  • 9 Nahm UY, Bilenko M, Mooney RJ. Two Approaches to Handling Noisy Variation in Text Mining. Proceedings of the ICML-2002 Workshop on Text Learning 2002 pp 18-27.
  • 10 Verikios VS, Elmagarmid AK, Elfeky MG, Cochinwala M, Dalal S. On the Completeness and Accuracy of the Record Matching Process. Proceedings of the MIT Conference on Information Quality 2000
  • 11 Dempster AP, Laird N, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society 1977; series B, 39: 1-38.
  • 12 Winkler WE. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association 1988 pp 667-671.
  • 13 Newcombe HB, Kennedy JM. Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information. Communications of the ACM 1962; 5 (11).
  • 14 Gu L, Baxter R. Adaptive Filtering for Efficient Record Linkage. Canberra ACT 2601, 2004
  • 15 McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. Knowledge Discovery and Data Mining 2000 pp 169-178.
  • 16 Navarro C, Chirlaque MD, Rodríguez M, Garrido S, Párraga E. et al. Estadísticas básicas del Registro de Cáncer de Murcia 1993-1996. Murcia: Consejería de Sanidad. Dirección General de Salud Pública, 2003. Serie informes n.33.
  • 17 Parkin DM, Whelan SL, Ferlay J, Teppo L, Thomas B. Cancer Incidence in Five Continents. Volume VIII, IARC Scientific Publications No. 155. Lyon:: 2002
  • 18 Ley Orgánica 15/1999, de 13 de diciembre, de Protección de Datos de Carácter Personal. BOE nº 298, de 14 de diciembre de 1999.
  • 19 Márquez M, Valera I, Chirlaque MD, Tortosa J, Párraga E. et al. Validación de los códigos diagnósticos de cáncer de colon y recto del Conjunto Mínimo Básico de Datos. Gac Sanit 2006; 20 (04) 266-272.
  • 20 Navarro C, Chirlaque MD, Tormo MJ, Perez-Flores D, Rodríguez-Barranco M. et al. Validity of self-reported diagnoses of cancer in a major Spanish prospective cohort study. J Epidemiol Comm Health 2006; 60: 593-599.
  • 21 Jensen OM, Parkin DM, MacLennan R, Muir CS, Skeet RG. Cancer Registration. Principles and Methods. Lyon: IARC; 1991
  • 22 Clark DE. Practical introduction to record linkage for injury research. Inj Prev 2004; 10: 186-191.
  • 23 Contiero P, Tittarelli A, Tagliabue G, Maghini A, Fabiano S. et al. The EpiLink Record Linkage Software. Presentation and Results of Linkage Test on Cancer Registry Files. Methods Inf Med 2005; 44: 66-71.
  • 24 Machado CJ, Hill K. Probabilistic record linkage and an automated procedure to minimize the undecided- matched pair problem. Cad. Saúde Pública, Rio de Janeiro 2004; 20 (04) 915-925.
  • 25 Oberaigner W, Stühlinger W. Record linkage in the Cancer Registry of Tyrol, Austria. Methods Inf Med 2005; 44: 626-630.
  • 26 English J. Plain English on Data Quality. DM Review, http://www.dmreview.com 1999