CC BY-NC-ND 4.0 · Methods Inf Med 2024; 63(01/02): 021-034
DOI: 10.1055/s-0044-1778693
Original Article

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse

1   Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé (LIMICS), Paris, France
,
Perceval Wajsbürt
2   Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
,
2   Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
,
2   Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
,
Alexandre Mouchet
2   Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
,
Martin Hilka
2   Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
,
2   Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France
› Author Affiliations
Funding This study has been supported by grants from the Assistance Publique-Hôpitaux de Paris (AP-HP) Foundation.

Abstract

Objective The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.

Methods We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.

Results and Discussion Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

Authors' Contribution

All authors designed the study. X.T. drafted the manuscript. All authors interpreted data and made critical intellectual revisions of the manuscript. X.T. did the literature review. P.W. checked all the annotations. P.W., A.C. and B.D. developed the deidentification algorithms. P.W. conducted the experiments and computed the statistical results. X.T., A.M., M.H. and R.B. supervised the project.


Data Availability Statement

Access to the Clinical Data Warehouse's raw data can be granted following the process described on its Web site: eds.aphp.fr. Prior validation of the access by the local institutional review board is required. In the case of non-AP-HP researchers, the signature of a collaboration contract is moreover mandatory.


Supplementary Material



Publication History

Received: 24 March 2023

Accepted: 28 November 2023

Article published online:
05 March 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Stuttgart · New York

 
  • References

  • 1 Yang X, Lyu T, Li Q. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak 2019;19(05):
  • 2 Lin J. De-identification of free-text clinical notes (Masters Thesis). Massachusetts Institute of Technology; 2019
  • 3 Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform 2017; 75S: S34-S42
  • 4 Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform 2014; 50: 151-161
  • 5 Paris N, Doutreligne M, Parrot A, Tannier X. Désidentification de comptes-rendus hospitaliers dans une base de données OMOP. Actes de TALMED 2019: Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical; August 26, 2019; Lyon, France
  • 6 Bourdois L, Avalos M, Chenais G. et al. De-identification of emergency medical records in French: survey and comparison of state-of-the-art automated systems. The International FLAIRS Conference Proceedings, University of Florida George A Smathers Libraries, 2021, 34
  • 7 Azzouzi ME, Bellafqira R, Coatrieux G, Cuggia M, Bouzillé G. A deep learning approach for de-identification of French electronic health records through automatic annotation. Francophone SIG Workshop at MIE 2022; May 19–22, 2022; Nice, France
  • 8 Hartman T, Howell MD, Dean J. et al. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20 (01) 14
  • 9 Carlini N, Liu C, Erlingsson Ú, Kos J, Song D. The secret sharer: evaluating and testing unintended memorization in neural networks. 28th USENIX Security Symposium (USENIX Security 19), USENIX Association; August 14–16, 2019; Santa Clara, United States
  • 10 Lehman E, Jain S, Pichotta K, Goldberg Y, Wallace B. Does BERT pretrained on clinical notes reveal sensitive data?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics; June 6–11, 2021
  • 11 Goldman JP, Foufi V, Zaghir J, Lovis C. A hybrid approach to French clinical document de-identification. Francophone SIG Workshop at MIE 2022; May 19–22, 2022; Nice, France
  • 12 Liao S, Kiros J, Chen J, Zhang Z, Chen T. Improving domain adaptation in de-identification of electronic health records through self-training. J Am Med Inform Assoc 2021; 28 (10) 2093-2100
  • 13 Observational Health Data Sciences and Informatics. OHDSI program. Accessed January 8, 2024, at: https://ohdsi.org/
  • 14 Benchimol EI, Smeeth L, Guttmann A. et al; RECORD Working Committee. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Med 2015; 12 (10) e1001885
  • 15 Dura B, Wajsburt P, Petit-Jean T. et al. EDS-NLP: efficient information extraction from French clinical notes (v0.7.3). Zenodo 2022; DOI: 10.5281/zenodo.7360508.
  • 16 Vaswani A, Shazeer N, Parmar N. et al. Attention is All you Need. Advances in Neural Information Processing Systems 30, Curran Associates, Inc.; 2017: 5998-6008
  • 17 Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann; June 28 - July 1, 2001; Williamstown, United States
  • 18 Wajsbürt P. Nested Named Entity Recognition. 2023 . Accessed at: https://aphp.github.io/edsnlp/latest/pipelines/trainable/ner/
  • 19 Dura B, Jean C, Tannier X. et al. Learning structures of the French clinical language: development and validation of word embedding models using 21 million clinical reports from electronic health records. arXiv preprint, 2022, abs/2207.12940v1
  • 20 Martin L, Muller B, Ortiz Suárez PJ. et al. CamemBERT: a tasty French language model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics; July 6–8, 2020
  • 21 Wolff LFAnthony, Kanding B, Selvan R. Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models. ICML Workshop on Challenges in Deploying and monitoring Machine Learning Systems; July 2020
  • 22 Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24 (03) 596-606
  • 23 Wu Q, Lin Z, Karlsson B, Lou JG, Huang B. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020
  • 24 Strötgen J, Gertz M. A baseline temporal tagger for all languages. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics; September 17–21, 2015; Lisbon, Portugal
  • 25 Garat D, Wonsever D. Towards de-identification of legal texts. arXiv preprint, 2019. abs/1910.03739
  • 26 Gianola L, Ajausks Ē, Arranz V. et al. Automatic removal of identifying information in official EU languages for public administrations: The MAPA Project. Frontiers in Artificial Intelligence and Applications, IOS Press, 2020
  • 27 Commission Nationale Informatique et Libertés. Accessed 2019, at: https://www.cnil.fr/sites/cnil/files/atoms/files/referentiel_entrepot.pdf
  • 28 Chen S, Neves L, Solorio T. Mitigating temporal-drift: a simple approach to keep NER models crisp. Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics; June 10, 2021; Mexico City, Mexico
  • 29 Wagholikar K, Torii M, Jonnalagadda S, Liu H. Feasibility of pooling annotated corpora for clinical concept extraction. AMIA Jt Summits Transl Sci Proc 2012; 2012: 38
  • 30 Ge S, Wu F, Wu C, Qi T, Huang Y, Xie X. FedNER: privacy-preserving medical named entity recognition with federated learning. arXiv preprint, 2020. abs/2003.09288
  • 31 Baza M, Salazar A, Mahmoud M, Abdallah M, Akkaya K. On sharing models instead of data using mimic learning for smart health applications. 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), IEEE, 2020
  • 32 Bannour N, Wajsbürt P, Rance B, Tannier X, Névéol A. Privacy-preserving mimic models for clinical named entity recognition in French. J Biomed Inform 2022; 130: 104073