Methods of Information in Medicine

nicht eingeloggt Login
- Benutzername oder E-Mail-Adresse:
  
  Passwort:
  
  Zugangsdaten vergessen? Neu registrieren OpenAthens/Shibboleth Login

Jahre (Archiv)

2024

Ausgaben

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

CC BY-NC-ND 4.0 · Methods Inf Med 2024; 63(01/02): 021-034
DOI: 10.1055/s-0044-1778693

Original Article

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse

Xavier Tannier

¹Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé (LIMICS), Paris, France

,

Perceval Wajsbürt

²Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France

,

Alice Calliger

²Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France

,

Basile Dura

²Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France

,

Alexandre Mouchet

²Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France

,

Martin Hilka

²Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France

,

Romain Bey

²Innovation and Data Unit, IT Department, Assistance Publique-Hôpitaux de Paris, Paris, France

› Institutsangaben
Funding This study has been supported by grants from the Assistance Publique-Hôpitaux de Paris (AP-HP) Foundation.

› Weitere Informationen

Abstract
Volltext
Referenzen
Zusatzmaterial

Lizenzen und Reprints

Abstract

Objective The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.

Methods We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.

Results and Discussion Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

Keywords

natural language processing - pseudonymization - electronic health reports - Clinical Data Warehouse - named entity recognition

Authors' Contribution

All authors designed the study. X.T. drafted the manuscript. All authors interpreted data and made critical intellectual revisions of the manuscript. X.T. did the literature review. P.W. checked all the annotations. P.W., A.C. and B.D. developed the deidentification algorithms. P.W. conducted the experiments and computed the statistical results. X.T., A.M., M.H. and R.B. supervised the project.

Data Availability Statement

Access to the Clinical Data Warehouse's raw data can be granted following the process described on its Web site: eds.aphp.fr. Prior validation of the access by the local institutional review board is required. In the case of non-AP-HP researchers, the signature of a collaboration contract is moreover mandatory.

Supplementary Material

Supplementary Material

Publikationsverlauf

Eingereicht: 24. März 2023

Angenommen: 28. November 2023

Artikel online veröffentlicht:
05. März 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Stuttgart · New York

References
1 Yang X, Lyu T, Li Q. et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak 2019;19(05):

PubMed
2 Lin J. De-identification of free-text clinical notes (Masters Thesis). Massachusetts Institute of Technology; 2019

Suche in Google Scholar
3 Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform 2017; 75S: S34-S42

Crossref PubMed Suche in Google Scholar
4 Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for reference corpus development. J Biomed Inform 2014; 50: 151-161

Crossref PubMed Suche in Google Scholar
5 Paris N, Doutreligne M, Parrot A, Tannier X. Désidentification de comptes-rendus hospitaliers dans une base de données OMOP. Actes de TALMED 2019: Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical; August 26, 2019; Lyon, France

PubMed
6 Bourdois L, Avalos M, Chenais G. et al. De-identification of emergency medical records in French: survey and comparison of state-of-the-art automated systems. The International FLAIRS Conference Proceedings, University of Florida George A Smathers Libraries, 2021, 34

PubMed
7 Azzouzi ME, Bellafqira R, Coatrieux G, Cuggia M, Bouzillé G. A deep learning approach for de-identification of French electronic health records through automatic annotation. Francophone SIG Workshop at MIE 2022; May 19–22, 2022; Nice, France

PubMed
8 Hartman T, Howell MD, Dean J. et al. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20 (01) 14

Crossref PubMed Suche in Google Scholar
9 Carlini N, Liu C, Erlingsson Ú, Kos J, Song D. The secret sharer: evaluating and testing unintended memorization in neural networks. 28th USENIX Security Symposium (USENIX Security 19), USENIX Association; August 14–16, 2019; Santa Clara, United States

PubMed
10 Lehman E, Jain S, Pichotta K, Goldberg Y, Wallace B. Does BERT pretrained on clinical notes reveal sensitive data?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics; June 6–11, 2021

PubMed
11 Goldman JP, Foufi V, Zaghir J, Lovis C. A hybrid approach to French clinical document de-identification. Francophone SIG Workshop at MIE 2022; May 19–22, 2022; Nice, France

PubMed
12 Liao S, Kiros J, Chen J, Zhang Z, Chen T. Improving domain adaptation in de-identification of electronic health records through self-training. J Am Med Inform Assoc 2021; 28 (10) 2093-2100

Crossref PubMed Suche in Google Scholar
13 Observational Health Data Sciences and Informatics. OHDSI program. Accessed January 8, 2024, at: https://ohdsi.org/

PubMed
14 Benchimol EI, Smeeth L, Guttmann A. et al; RECORD Working Committee. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Med 2015; 12 (10) e1001885

Crossref PubMed Suche in Google Scholar
15 Dura B, Wajsburt P, Petit-Jean T. et al. EDS-NLP: efficient information extraction from French clinical notes (v0.7.3). Zenodo 2022;

Crossref PubMed Suche in Google Scholar
16 Vaswani A, Shazeer N, Parmar N. et al. Attention is All you Need. Advances in Neural Information Processing Systems 30, Curran Associates, Inc.; 2017: 5998-6008

Suche in Google Scholar
17 Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann; June 28 - July 1, 2001; Williamstown, United States

PubMed
18 Wajsbürt P. Nested Named Entity Recognition. 2023 . Accessed at: https://aphp.github.io/edsnlp/latest/pipelines/trainable/ner/

PubMed Suche in Google Scholar
19 Dura B, Jean C, Tannier X. et al. Learning structures of the French clinical language: development and validation of word embedding models using 21 million clinical reports from electronic health records. arXiv preprint, 2022, abs/2207.12940v1

PubMed
20 Martin L, Muller B, Ortiz Suárez PJ. et al. CamemBERT: a tasty French language model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics; July 6–8, 2020

PubMed
21 Wolff LFAnthony, Kanding B, Selvan R. Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models. ICML Workshop on Challenges in Deploying and monitoring Machine Learning Systems; July 2020

PubMed
22 Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24 (03) 596-606

Crossref PubMed Suche in Google Scholar
23 Wu Q, Lin Z, Karlsson B, Lou JG, Huang B. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020

PubMed
24 Strötgen J, Gertz M. A baseline temporal tagger for all languages. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics; September 17–21, 2015; Lisbon, Portugal

PubMed
25 Garat D, Wonsever D. Towards de-identification of legal texts. arXiv preprint, 2019. abs/1910.03739

PubMed
26 Gianola L, Ajausks Ē, Arranz V. et al. Automatic removal of identifying information in official EU languages for public administrations: The MAPA Project. Frontiers in Artificial Intelligence and Applications, IOS Press, 2020

PubMed
27 Commission Nationale Informatique et Libertés. Accessed 2019, at: https://www.cnil.fr/sites/cnil/files/atoms/files/referentiel_entrepot.pdf

PubMed
28 Chen S, Neves L, Solorio T. Mitigating temporal-drift: a simple approach to keep NER models crisp. Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics; June 10, 2021; Mexico City, Mexico

PubMed
29 Wagholikar K, Torii M, Jonnalagadda S, Liu H. Feasibility of pooling annotated corpora for clinical concept extraction. AMIA Jt Summits Transl Sci Proc 2012; 2012: 38

PubMed Suche in Google Scholar
30 Ge S, Wu F, Wu C, Qi T, Huang Y, Xie X. FedNER: privacy-preserving medical named entity recognition with federated learning. arXiv preprint, 2020. abs/2003.09288

PubMed
31 Baza M, Salazar A, Mahmoud M, Abdallah M, Akkaya K. On sharing models instead of data using mimic learning for smart health applications. 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), IEEE, 2020

PubMed
32 Bannour N, Wajsbürt P, Rance B, Tannier X, Névéol A. Privacy-preserving mimic models for clinical named entity recognition in French. J Biomed Inform 2022; 130: 104073

Crossref PubMed Suche in Google Scholar

Zusatzmaterial

Supplementary Material