Keywords
natural language processing - machine learning - adverse drug events
Introduction
The rise of electronic medical records has heightened the importance of natural language processing (NLP) techniques in health care due to the vast amount of textual data they generate.[1] Given the widespread interest in NLP within computer science, the volume of research on medical NLP has experienced a remarkable surge annually. This trend has also been supported by numerous medical NLP workshops, such as CLEF eHealth,[2]
[3] n2c2[4] (formerly known as i2b2), MADE,[5] and MEDIQA.[6] However, despite the substantial body of research, the availability of privacy-compliant medical text data remains limited, particularly in non-English languages.[7]
[8]
[9]
To address this limitation, we organized a series of medical NLP workshops with open datasets (the MedNLP Series) at an international conference, NII Testbeds and Community for Information Access Research (NTCIR): MedNLP-1,[10] MedNLP-2,[11] MedNLPDoc,[12] and MedWeb.[13] In MedNLP-1, we introduced a foundational NLP task, named entity recognition (NER), utilizing dummy medical records crafted by medical professionals. MedNLP-2 focused on a term normalization task, again employing dummy medical records developed by medical experts. The MedNLPDoc workshop was designed to encompass a comprehensive task. Beginning with a medical record sourced from a medical textbook, participants were tasked with identifying an appropriate disease name represented by International Classification of Diseases codes. In MedWeb, a disease tweet classification task was designed to simulate the use of social media data in the medical and health care domains; dummy Twitter data were created in Japanese and translated into English and Chinese.
Past workshops in the MedNLP Series, summarized in [Table 1], successfully produced valuable datasets. However, two major problems have been identified. (1) The data were not real clinical texts but were dummy records or sample texts from medical textbooks. (2) The datasets were limited to Japanese, which made it difficult to compare the results with those of other English-based workshop results ([ Table 2]).
Table 1
Past MedNLP Series workshops proposed by the authors
Workshop
|
Year
|
Corpus
|
Task
|
MedNLP-1[10]
|
2012–2013
|
Dummy HR written by clinicians
|
NER
|
MedNLP-2[11]
|
2013–2014
|
Dummy HR written by clinicians
|
NEN
|
MedNLPDoc[12]
|
2015–2016
|
Dummy HR extracted from clinical textbooks
|
NEN
|
MedWeb[13]
|
2016–2017
|
Dummy Tweets obtained by crowdsourcing
|
TC
|
Abbreviations: HR, health record; MedNLP: medical natural language processing; NEN, named entity normalization; NER, named entity recognition; TC, text classification.
Table 2
Real-MedNLP tasks
Task
|
Corpus
|
Format
|
1) Just 100 Training
|
CR/RR
|
NER
|
2) Guideline Learning
|
CR/RR
|
NER
|
3) Applications
|
CR
|
ADE
|
RR
|
CI
|
Abbreviations: ADE, adverse drug event; CI, case identification; CR, case report; MedNLP: medical natural language processing; NER, named entity recognition; RR, radiology report.
To address these aspects, during 2021 to 2022, we proposed and organized Real-MedNLP, the first workshop in the MedNLP Series that handles real and parallel medical text. Our data comprised two document types: (1) case reports (CRs; MedTxt-CR) and (2) radiology reports (RRs; MedTxt-RR). Both corpora are realistic medical/clinical texts based on the materials available on the Internet, where realistic means that real case-report articles constitute MedTxt-CR, and MedTxt-RR contains newly written (dummy) RRs that interpret commonly available real radiology images. Furthermore, we manually translated the original Japanese text into English, enabling us to develop the first benchmark for cross-lingual medical NLP. Considering the data, we redesigned the workshop scheme to achieve our goal of promoting systems applicable at the bedside. This reintroduces the aforementioned challenging restrictions in medical NLP: limited dataset sizes. The proposed task format is as follows:
-
Low-resource NER (Tasks 1 and 2): participants are supposed to extract medical expressions from text, although only a limited number of annotated documents are available for training the machine learning models. This reflects the real-world MedNLP, which often suffers from a scarcity of available annotated text in hospitals or their departments owing to annotation costs. We further defined two tasks: Just 100 Training (SubtaskTask 1) and Guideline Learning (Task 2). This task set is called low-resource in the NLP research[14]; Task 2 corresponds explicitly to zero or few-shot learning in the machine learning field.
-
Applications (Task 3): corresponding to the two document types, we propose two practical and useful MedNLP applications in actual clinical work. For CRs, we designed an information extraction task for adverse drug events (ADE) reporting (i.e., pharmacovigilance) characterized by a different approach from relation extraction, which is usually adopted in existing workshops such as i2b2 2009.[15] We propose a novel case identification (CI) task for RRs to detect reports originating from identical patients.
These demanding tasks offer exciting prospects for advancing practical systems that can enhance various medical services, including phenotyping,[16] drug repositioning,[17] drug target discovery,[18] precision medicine,[19] clinical text-input methods,[20]
[21] and electronic health record summarization/aggregation.[22]
This study provides an account of the materials used, detailed task definitions, evaluation metrics employed, an overview of participants' approaches, and the overall results achieved during the Real-MedNLP workshop.
Materials
Corpora: MedTxt
Overview
The textual datasets (corpora) released by workshop participants were named MedTxt. Two types of medical and clinical documents were used as corpora: CRs and RRs. These two corpora are parallel in Japanese (JA, original) and English (EN, translated). For example, we identified the Japanese CR corpus as MedTxt-CR-JA (MedTxt-CR-JA).
Case Reports: MedTxt-CR
This CR is a medical research paper in which doctors describe specific clinical cases. CRs aimed to share clinically notable issues with other doctors, particularly those in medical societies. The format of CRs is similar to that of discharge summaries, which are clinical documents written by doctors to record the treatment history of discharged patients. While popular English medical NLP corpora are often composed of discharge summaries (e.g., MIMIC-III), techniques for CR analysis are smoothly extended to analyze discharge summaries.
MedTxt-CR-JA comprises open-access CRs obtained from the Japanese scholarly publication platform J-Stage.[23]
[Fig. 1] shows its sample. As the number of medical societies that produce open-access publications is limited, the types of patients and diseases reported in open-access CRs are highly biased. To reduce the bias caused by the publication policy (whether to prefer open access or not) of each medical society, we selected 224 CRs based on the “frequencies” in a Japanese disease-name dictionary MANBYO-DIC (J-MeDic),[24] which contains the frequency of each term in Japanese medical corpora. These CRs were manually translated from Japanese (MedTxt-CR-JA) to English (MedTxt-CR-EN) while retaining named entity annotations (described later). They were divided into 148 training and 76 test datasets.
Fig. 1 Annotated sample of the case report corpora in English (MedTxt-CR-EN). The entity notations stand for D = diseases and symptoms with the modality “certainty” such as positive (+) and negative (−); A = anatomical parts; Time = time expressions with the modality “type” such as date (DATE), age (AGE), and medically specific (MED); Tt/k/v = test set/item/values with the modality “state” such as executed (+); and Mk = medicine name with the modality “state” such as executed (+).
Radiology Reports: MedTxt-RR
A RR is a clinical document written by a radiologist to share a patient's status with physicians. Each RR discusses a single radiological examination such as radiography, computed tomography (CT), or magnetic resonance imaging. A RR contains (1) descriptions of all normal and abnormal findings, and (2) interpretations of the findings, including disease diagnosis and recommendations for the next clinical test or treatment. Although most radiology AI (artificial intelligence) research focuses only on images because image-based AI has drawn much attention, NLP on radiology-report text also has the potential for a wide variety of clinical applications.[25]
MedTxt-RR[26] consists of 135 RRs. [Fig. 2] shows its sample. The MedTxt-RR aims to provide information on the diversity of expressions used by different radiologists to describe the same diagnosis. One of the difficulties in analyzing RRs is the variety of writing styles. However, relying solely on RRs from medical institutions presents limitations. In typical clinical settings, only one report is generated per radiological exam, restricting the available data for in-depth analysis. MedTxt-RR-JA was created to overcome this problem by crowdsourcing, in which nine radiologists independently wrote RRs for the same series of 15 lung cancer cases.
Fig. 2 Annotated sample of the radiology report corpora in English (MedTxt-RR-EN). The entity notations stand for D = diseases and symptoms with the modality “certainty” such as positive (+) and suspicious (?); A = anatomical parts; Time = time expressions with the modality “type” such as date (DATE); and Tt = test sets with the modality “state” such as executed (+).
MedTxt-RR-EN is an English translation of MedTxt-RR-JA by nine translators corresponding to radiologists. We divided them into 72 training datasets (8 cases) and 63 test datasets (7 cases).
Tasks and Annotations
Tasks 1 and 2: Low-Resource NER Challenge
Because NER is the most fundamental information extraction for medical NLP, we designed a challenge regarding NER for a limited number of clinical reports (∼100). This corpus size fits into so-called low-resource NLP, in which training models become challenging.[14] This setting is the de facto standard in medical NLP mainly because of the innate difficulty of medical concept annotation and privacy concerns regarding patient health records. To address this challenge, we defined two Tasks based on the size of the available training data:
-
Task 1) Just 100 Training: we provided ∼100 reports for training, corresponding to the standard few-resource (or few-shot) supervised learning.
-
Task 2) Guideline Learning: we provided annotation guidelines containing only a handful of annotated sentences, simulating human annotator training, in which human annotators usually learn from annotation guidelines defined by NLP researchers.
Although we provided only a few training data points for both tasks, workshop participants could use any other resources outside this project (e.g., medical dictionaries and medically pretrained language models) if they found them useful for their methods.
We adopted the following entity types from an existing medical NER scheme[27]
[28]:
-
Diseases and symptoms <d> with the modality “certainty”:
-
positive: the existence is recognized.
-
negative: the absence is confirmed.
-
suspicious: the existence is speculated.
-
general: hypothetical or common knowledge mentions.
-
Anatomical parts <a>
-
Time <timex3> with the modality “type”:
-
date: a calendar date.
-
time: a clock time.
-
duration: a continuous period.
-
set: frequency consisting of multiple dates, times, and periods.
-
age: a person's age.
-
med: medicine-specific time expressions such as “post-operative.”
-
misc: exceptional time expressions other than the above.
-
Test <t-test/key/val> with the modality ‘state’:
-
Medicine <m-key/val> with the modality “state”:
-
scheduled: treatment is planned.
-
executed: treatment was executed.
-
negated: treatment was canceled.
-
other: an exceptional state other than the above.
Detailed definitions and information on modality are provided in Yada[ et al27] and Chapter 2 of our annotation guidelines.[28] Several batches of the Japanese corpus were annotated separately by multiple native Japanese speakers without medical knowledge and then translated into English while retaining the annotated entities. This annotation scheme enables a reasonable quality of coherent annotation even if annotators lack medical knowledge.[27]
[29]
The detailed statistics of the entity annotations in the datasets are presented in [Table 3]. We evaluated the following tag sets.
Table 3
Named entities of the training sets of MedTxt-CR and MedTxt-RR
|
Dataset
|
CR-JA
|
CR-EN
|
RR-JA
|
RR-EN
|
|
# of texts
|
148
|
148
|
72
|
72
|
|
# of characters (mean)
|
84,471 (570)
|
40,383 (272)
|
16,861 (234)
|
8,488 (117)
|
<a>
|
total
|
823
|
819
|
464
|
465
|
<d>
|
total
|
2,348
|
2,346
|
884
|
883
|
|
“positive”
|
1,695
|
1,693
|
465
|
462
|
|
“suspicious”
|
80
|
80
|
191
|
191
|
|
“negative”
|
251
|
251
|
149
|
148
|
|
“general”
|
302
|
302
|
1
|
1
|
<t-test>
|
total
|
387
|
388
|
26
|
27
|
|
“scheduled”
|
0
|
0
|
0
|
0
|
|
“executed”
|
362
|
363
|
19
|
19
|
|
“negated”
|
7
|
7
|
2
|
2
|
|
“other”
|
18
|
18
|
5
|
6
|
<timex3>
|
total
|
1,353
|
1,353
|
29
|
29
|
|
“date”
|
539
|
539
|
26
|
26
|
|
“time”
|
53
|
53
|
0
|
0
|
|
“duration”
|
82
|
82
|
2
|
2
|
|
“set”
|
34
|
34
|
0
|
0
|
|
“age”
|
189
|
189
|
0
|
0
|
|
“med”
|
428
|
428
|
1
|
1
|
|
“misc”
|
28
|
28
|
0
|
0
|
<m-key>
|
total
|
344
|
344
|
0
|
0
|
|
“scheduled”
|
0
|
0
|
0
|
0
|
|
“executed”
|
266
|
266
|
0
|
0
|
|
“negated”
|
27
|
27
|
0
|
0
|
|
“other”
|
51
|
51
|
0
|
0
|
<m-val>
|
total
|
64
|
64
|
0
|
0
|
|
“scheduled”
|
0
|
0
|
0
|
0
|
|
“executed”
|
0
|
0
|
0
|
0
|
|
“negated”
|
2
|
2
|
0
|
0
|
|
“other”
|
0
|
0
|
0
|
0
|
<t-key>
|
total
|
524
|
524
|
1
|
1
|
<t-val>
|
total
|
427
|
427
|
0
|
0
|
Abbreviations: CR, case report; EN, English; JA, Japanese; RR, radiology report.
-
MedTxt-CR: <d>, <a>, <t-test>, <timex3>, <m-key>, <m-val>, <t-key>, and <t-val> (all types above).
-
MedTxt-RR: <d>, <a>, <t-test>, and <timex3> (a subset of the types above).
The teams may choose whether or not to predict the modalities of the entities.
Task 3: Applications
Task3-CR: Adverse Drug Effect Extraction
In this application, tailored for MedTxt-CR, the systems were supposed to analyze input CRs and extract any ADE information. Unlike the typical relation-extraction formulation in past ADE extraction tasks,[4] we set the objective to table slot filling, that is, to independently predict the degree of involvement in ADEs for each mentioned disease and medicine. Thus, we attempted to address an issue with standard ADE extraction in which even medical professionals find it difficult to identify ADEs only from the text, leading to annotation difficulties. As shown in [Fig. 3], the ADE information consists of two tables: <d>-table for disease/symptom names and m-key-table for drug names. For each entity in these tables, the four levels of ADE certainty (ADEval) based on Kelly et al[30] were as follows:
Fig. 3 Task 3—ADE application, wherein each disease or medication entity mentioned in case reports is labeled with the degree of involvement in adverse drug events (ADEval), ranging from 0 to 3. ADE, adverse drug effect.
3: Definitely | 2: Probably | 1: Unlikely | 0: Unrelated
For disease names, ADEvals were interpreted as the likelihood of being an ADE, and for medicine names, the interpretation was the likelihood of causing an ADE. To annotate these labels, we let two annotators follow the author's perspective on whether drugs and symptoms were related to ADE (i.e., the writer's perception). However, if the annotators noticed other possibilities of ADEs that were not explicitly pointed out in the report, we allowed them to label ADEval ≥ 1 as well (i.e., the reader's perception). Note that one annotator has experience as a pharmacist and the other possesses a master's degree in biology. [Table 4] presents the distribution of ADEvals in the training set.
Table 4
ADEval distributions for each entity type in the training set of Task 3 ADE application
ADEval
|
Disease
|
Medicine
|
(total)
|
0
|
1,217
|
103
|
1,320
|
1
|
33
|
28
|
61
|
2
|
57
|
22
|
79
|
3
|
123
|
47
|
170
|
Total
|
1,430
|
200
|
1,630
|
Abbreviation: ADE, adverse drug event.
Task3-RR: Case Identification
This application was specifically designed for MedTxt-RR. Given the unsorted RRs, the participants were required to identify sets of RRs that diagnosed identical images, as depicted in [Fig. 4]. MedTxt-RR was created by collecting RRs from multiple radiologists who independently diagnosed the same CT images; this original correspondence between RRs and CT images was used as the gold standard label (document level).
Fig. 4 Task 3—CI application, where radiology reports written by different radiologists are grouped by the described cases. CI, case identification.
In addition to an educational purpose in which trainee radiologists practice writing reports on shared images, this task would contribute to better NLP models that accurately understand the clinical content of RRs without being confused by synonyms or paraphrases, as MedTxt-RR contains RRs with almost the same clinical content but with various expressions.
Data Availability
The training portions of MedTxt corpora are publicly available at NTCIR Test Collection.[31]
Methods
Baseline Systems
Overview
We propose baseline systems for each Task in Japanese corpora.[32] All the systems adopted straightforward approaches to solving tasks. The models proposed below are based on a BERT[33] model pretrained on Japanese copora,[34] which tokenizes Japanese text using the morphological analyzer MeCab.[35]
Task 1 and 2: NER Models
We fine-tuned the individual models on each training set using the same NER-training scheme as Devlin et al.[33] The model predicts all entity types defined in Yada et al,[27] instead of only targeting the subsets for the tasks, because more entity types would provide more context to the model, even though task complexity may increase.
We released the baseline model trained on MedTxt-CR-JA at the Hugging Face Hub.[36]
Task3-CR: ADE Classifier
We solved this application task using an entity-wise classification scheme. For each disease or drug entity in the table row, we designed a model input consisting of four parts separated by [SEP] special tokens ([Fig. 5]): (1) the document ID, (2) contextual tokens around the targeted entity, (3) the targeted entity itself, and (4) its entity type name (i.e., “disease” or “drug”). Specifically, Part (2) contains 50 + 50 characters before and after the target entity, including the entity itself. For simplicity, the context around the first mention is extracted when the targeted token appears multiple times in the document.
Fig. 5 Input and output formats of the baseline model for Task3-CR (ADE). Although the real inputs were in Japanese, an English sample is used in this figure for readability.
Task3-RR: CI Clusterer
We exhaustively classified all document pairs to identify RRs describing identical patients. For each pair of given documents, the model judges whether the inputs describe an identical patient (i.e., binary classification; [Fig. 6]). Each document pair is separated by a [SEP] token. Considering permutation, we regard a document pair (Article text 1, Article text 2) as identical patient reports if and only if both predictions of (Article text 1, Article text 2) and (Article text 2, Article text 1) result in “identical-patient.”
Fig. 6 Input and output formats of the baseline model for Task3-RR (CI). Although the real inputs were in Japanese, an English sample is used in this figure for readability. CI, case identification.
Participating Teams and Systems
Nine teams participated in the-MedNLP workshops. [Table 5] lists the teams, their demographics, and number of systems submitted by each team for each task. Notably, our workshop attracted global industries (i.e., five of the nine teams) from China, Japan, and the United States, demonstrating a high demand for practical medical NLP solutions worldwide. Two were multidisciplinary, and medical and computer science researchers collaborated. Most teams have adopted pretrained language models as their methodological basis, often combined with either or both external medical knowledge and data augmentation. Each team's approach is summarized: refer to the corresponding system papers for NTCIR-16 Real-MedNLP[32]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45] for more information.
Table 5
Team demographic and the number of systems developed by each team
|
Team demographic
|
# of submitted systems
|
Task 1
|
Task 2
|
Task 3
|
CR
|
RR
|
CR
|
RR
|
CR
|
RR
|
Team
|
#members
|
Country
|
Affiliation
|
JA
|
EN
|
JA
|
EN
|
JA
|
EN
|
JA
|
EN
|
JA
|
EN
|
JA
|
EN
|
AMI
|
3
|
Japan
|
Industry
|
2
|
|
2
|
|
1
|
|
1
|
|
|
|
|
|
FRDC
|
4
|
China
|
Industry
|
|
2
|
|
|
|
|
|
|
|
10
|
|
10
|
GunNLP
|
1
|
Japan
|
University
|
|
|
|
|
|
|
|
|
|
|
1
|
|
Baseline[a]
|
6
|
Japan
|
University
|
1
|
|
1
|
|
1
|
|
1
|
|
1
|
|
1
|
|
NICTmed
|
4
|
Japan
|
Institute
|
4
|
4
|
|
|
|
|
|
|
2
|
2
|
1
|
1
|
NTTD
|
4
|
Japan
|
Industry
|
1
|
|
1
|
|
|
|
|
|
|
|
|
|
SRCB
|
6
|
China
|
Industry
|
|
5
|
|
3
|
|
|
|
|
|
6
|
|
|
Syapse
|
5
|
US
|
Industry
|
|
|
|
|
|
|
|
1
|
|
1
|
|
1
|
Zukyo[a]
|
11
|
Japan and Switzerland
|
University and institute
|
4
|
4
|
4
|
4
|
|
|
|
|
|
|
1
|
|
|
|
|
Total
|
12
|
15
|
8
|
7
|
2
|
0
|
2
|
1
|
3
|
19
|
4
|
12
|
Abbreviations: CR, case report; EN, English; JA, Japanese; RR, radiology report.
a Stands for multidisciplinary (medicine + computer science) teams.
-
AMI (Task1-CR-JA, Task1-RR-JA, Task2-CR-JA, Task2-RR-JA): this team[37] adopted the medically pretrained Japanese BERT (UTH-BERT[46]) as its base model. For Task 1, two systems were proposed: an ensemble of four UTH-BERT models and a fine-tuned UTH-BERT with a CRF layer. For Task 2, a knowledge-guided pipeline was proposed in which UTH-BERT's NER predictions were corrected using medical dictionaries such as J-MeDic,[24] Hyakuyaku-Dictionary,[47] and comeJisyo[48] along with an additional vocabulary extended by bootstrapping.
-
FRDC (Task1-CR-EN, Task3-CR-EN (ADE), Task3-RR-EN (CI)): this team[39] submitted two systems utilizing BioBERT[49] for Task1-CR-EN. One system involved fine-tuning exclusively, while the other integrated data augmentation techniques, including label-wise token replacement, synonym replacement, mention replacement, and shuffling within segments.[50] For Task 3, this team proposed a vocabulary-adapted BERT (VART) model that was continuously trained from a fine-tuned BERT, but with out-of-vocabulary words from the initial fine-tuning. In Task3-CR-EN (ADE), VART was trained with multiple NLP tasks to classify the ADEval for each entity based on its contextual text and predict the entity type (disease or drug). Task3-RR-EN (CI) was solved using a combination of two main methods: (1) key feature clustering to find case-specific information, such as tumor size, and (2) K-means clustering based on document embedding using sentence BERT[51] to cluster the remaining cases unidentified in Step (1).
-
GunNLP (Task3-RR-JA (CI)): this team[39] applied collaborative filtering to an entity-frequency matrix created from the bag-of-words representation of RRs.
-
NAISTSOC (Baseline) (Task1-CR-JA, Task1-RR-JA, Task2-CR-JA, Task2-RR-JA, Task3-CR-JA (ADE), Task3-RR-JA (CI)): this multidisciplinary team[32] provides the aforementioned task baselines for Japanese corpus tracks.
-
NICTmed (Task1-CR-EN, Task1-CR-JA, Task3-CR-EN (ADE), Task3-CR-JA (ADE), Task3-RR-EN (CI), Task3-RR-JA (CI)): This team[40] investigated the effectiveness of two close multilingual language models, multilingual BERT (mBERT)[33] and XLM-RoBERTa.[52] While simply fine-tuning them for Task1-CR (NER), the team addressed Task3-CR (ADE) by considering ADEval as an additional attribute of the named entities. Task3-RR (CI) was solved by the K-means clustering of documents vectorized by mBERT.
-
NTTD (Task1-CR-JA, Task1-RR-JA): this team fine-tuned the Japanese BERT using data augmentation (i.e., synonym replacement and shuffling within segments).[50]
-
SRCB (Task1-CR-EN, Task1-RR-EN, Task3-CR-EN (ADE)): this team[42] adopted BERT,[33] BioBERT, clinical BERT,[53] PubMed BERT,[54] and entity BERT[55] as the base models for Task 1. These are fine-tuned by span-based prompting (e.g., token prediction with the prompt “[span] is a [MASK] entity,” where [span] is one of the possible spans and [MASK] is the span's entity type), along with data augmentation (i.e., language model-based token replacement). For Task 3 (ADE), the team used PubMed BERT, Clinical BERT, and BioBERT, and after multitask learning (medicine/disease classification and cloze-test tasks) and two-stage training,[56] they were fine-tuned with prompt learning (i.e., ADE-causing drug and disease pair prediction) assisted by data augmentation (back translation via Chinese and random feature dropout).
-
Syapse (Task2-RR-EN, Task3-CR-EN (ADE), Task3-RR-EN (CI)): this team[43] did not perform any fine-tuning of the given training datasets. Instead, it applies standard NLP modules to pipelines such as MetaMap[57] and ScispaCy.[58] First, the pipeline obtains entities for Task 2. For ADE applications, an additional SciBERT[59] module measures the similarity between medicine and disease-embedding pairs to regard high-similarity pairs as high-ADEval entities. A bag-of-entity representation of documents was used for the CI application to measure document similarity.
-
Zukyo (Task1-CR-JA, Task1-RR-JA, Task1-CR-EN, Task1-RR-EN, Task3-RR-JA (CI)): this multidisciplinary team addresses tasks according to language. The Japanese sub-team[44] fine-tuned an ensemble of Japanese BERT models using data augmentation (random entity replacement constrained by entity types). For the Japanese CI application, the sub-team manually re-annotated each sentence of the given RR corpus using the TNM classification,[60] the international standard of cancer staging. The same ensemble NER architecture was fine-tuned separately in the sentence-wise sequential labeling of TNM.
The English sub-team[45] adopted domain-specific BERT models to tackle Task 1: BioBERT,[33] ClinicalBERT,[53] and RoBERTa (general domain).[61] Furthermore, entity attributes were predicted by SVM using bags of contextual words around the entities. The training dataset was augmented by random entity replacement and constrained by entity types.
Evaluation Metrics
Tasks 1 and 2
We employed the common F1 measure as the NER metric; specifically, we adopted its micro-average over entity classes (i.e., micro F1). Furthermore, we considered the following two factors to enable an evaluation specific to few-resource NER:
-
Boundary factor (exact/partial): the standard NER metric treats a correct NE match if and only if the predicted span is identical to the gold standard (i.e., exact match). We also introduce partial match to Tasks 1 and 2: if the predicted span overlaps the gold-standard span, the prediction is regarded as “partially correct,” obtaining a diminished score. This is intended for downstream tasks in which partially identified NEs are still useful, such as large-scale medical record analysis and highlighting important note sections. We considered the proportion of common sub-characters between the gold standard and predicted entities to calculate the partial match score. Given that a gold-standard entity eⅰ
(1 ≤ i ≤ n) and predicted entity êj
(1 ≤ i ≤ m) overlap in their spans, we first calculate entity-level partial-match precision
and recall
, where |ek
| stands for the character length of ek
and |c| denotes the common character length between ei
and ek
. We then obtain the system-level partial-match precision P
partial and recall R
partial as follows:
.
Finally, we calculate the partial F1 score, i.e., their harmonic mean.
-
Frequency factor: in our few-resource NER setting, substantial portions of NEs in the corpora appear only a few times, making the tasks challenging for machine-learning models. To measure model performance in identifying such rare NEs, we designed a novel weighted metric that penalizes the correct guesses of the system more heavily as the predicted entity appears more frequently in the training dataset. For each gold-standard entity i in the test set, we multiplied the entity-level precision and recall scores by the weight wi
based on the term frequency fi
of the same entity in the training set, wi = 1/(log
e
(fi
+1)+1). This weighting portrays the extent to which the system relied on high-frequency entities in the training phase, as well as how well the system captured low-frequency entities.
Task 3
For the ADE application, we employed two levels of evaluation for the ADEval classification: entity level and report level.
-
Entity level: the precision (P), recall (R), and F1-score (F) of each ADEval (= 0, 1, 2, 3) were micro-averaged for the disease and medicine entities.
-
Report level: we regard a report that contains at least one entity with ADEval ≥ 1 as a POSITIVE-REPORT, otherwise NEGATIVE-REPORT. This binary classification scheme evaluates the report-wise P, R, and F values of the POSITIVE REPORT.
For the CI application, we adopted standard metrics for supervised clustering: adjusted normalized mutual information (AdjMI),[62] Fowlkes–Mallows (FM) scores,[63] and binary accuracy. We aim to penalize random predictions or split clinically similar documents into numerous clusters. Both the AdjMI and FM were robust in addressing these errors. AdjMI is an adjustment of mutual information, which is a popular clustering metric. The FM also provides a useful means of estimating performance distinctly, as it spans from zero to one.
Results
System Notations
Distinct systems proposed by a team X, for example, are denoted in combination with numbers, such as “X-1” and “X-2.” For readability, we multiplied the scores by 100, except for the Task 3 metrics.
Tasks 1 and 2
[Table 6] lists the F1 measures per evaluation factor obtained for Tasks 1 and 2. Since predicting the entity modalities is optional, we separately report the scores of the modality-aware match (“ + mod”) from the modality-agnostic match (“label”).
Table 6
Results of Tasks 1 and 2
|
Exact match
|
Partial match
|
label
|
+mod
|
label
|
+mod
|
Task
|
Corpus
|
Language
|
System ID
|
—
|
weighted
|
—
|
weighted
|
—
|
weighted
|
—
|
weighted
|
1
|
CR
|
JA
|
AMI-1
|
61.33
|
51.95
|
–
|
–
|
78.41
|
68.12
|
–
|
–
|
|
|
|
AMI-2
|
61.24
|
51.88
|
–
|
–
|
78.46
|
68.19
|
–
|
–
|
|
|
|
Baseline
|
65.25
|
55.50
|
59.21
|
49.93
|
77.27
|
66.89
|
69.77
|
59.93
|
|
|
|
NICTmed-1
|
56.96
|
47.37
|
52.49
|
43.33
|
72.67
|
62.30
|
65.52
|
55.74
|
|
|
|
NICTmed-2
|
60.76
|
50.48
|
56.02
|
46.21
|
72.57
|
61.64
|
65.96
|
55.62
|
|
|
|
NICTmed-3
|
55.50
|
46.50
|
51.71
|
43.15
|
75.22
|
64.89
|
68.28
|
58.50
|
|
|
|
NICTmed-4
|
58.13
|
48.63
|
54.20
|
45.15
|
74.64
|
63.81
|
68.21
|
57.96
|
|
|
|
NTTD-1
|
61.89
|
51.98
|
–
|
–
|
73.61
|
62.93
|
–
|
–
|
|
|
|
Zukyo-1
|
30.88
|
23.83
|
25.91
|
19.63
|
55.14
|
47.12
|
44.88
|
37.77
|
|
|
|
Zukyo-2
|
35.85
|
29.68
|
30.13
|
24.59
|
63.95
|
56.33
|
53.07
|
46.25
|
|
|
|
Zukyo-3
|
26.56
|
21.95
|
22.47
|
18.36
|
58.65
|
52.04
|
48.20
|
42.32
|
|
|
|
Zukyo-4
|
27.73
|
23.34
|
23.08
|
19.10
|
59.67
|
53.46
|
49.63
|
44.03
|
|
|
EN
|
FRDC-1
|
43.21
|
38.50
|
–
|
–
|
56.48
|
51.24
|
–
|
–
|
|
|
|
FRDC-2
|
43.71
|
38.90
|
–
|
–
|
56.55
|
51.22
|
–
|
–
|
|
|
|
NICTmed-1
|
46.83
|
40.92
|
42.45
|
37.01
|
69.99
|
62.80
|
62.42
|
55.83
|
|
|
|
NICTmed-2
|
48.60
|
42.47
|
44.06
|
38.43
|
69.90
|
62.52
|
62.95
|
56.16
|
|
|
|
NICTmed-3
|
49.18
|
43.26
|
44.80
|
39.38
|
72.39
|
65.28
|
64.86
|
58.40
|
|
|
|
NICTmed-4
|
51.45
|
45.25
|
46.96
|
41.27
|
71.42
|
64.04
|
64.81
|
58.08
|
|
|
|
SRCB-1
|
59.80
|
52.55
|
54.84
|
48.09
|
73.72
|
65.35
|
67.69
|
59.94
|
|
|
|
SRCB-2
|
63.37
|
56.16
|
58.53
|
51.81
|
78.80
|
70.42
|
72.69
|
64.88
|
|
|
|
SRCB-3
|
62.31
|
55.15
|
57.49
|
50.80
|
77.90
|
69.47
|
71.81
|
63.94
|
|
|
|
SRCB-4
|
59.33
|
52.65
|
54.52
|
48.31
|
77.84
|
70.05
|
71.56
|
64.35
|
|
|
|
SRCB-5
|
60.33
|
53.64
|
55.40
|
49.17
|
78.25
|
70.34
|
71.80
|
64.44
|
|
|
|
Zukyo-1
|
45.56
|
39.65
|
29.57
|
25.89
|
70.32
|
63.03
|
44.79
|
40.05
|
|
|
|
Zukyo-2
|
51.97
|
45.89
|
33.35
|
29.50
|
73.76
|
66.38
|
47.11
|
42.28
|
|
|
|
Zukyo-3
|
51.16
|
44.78
|
32.63
|
28.67
|
72.20
|
64.53
|
46.09
|
41.11
|
|
|
|
Zukyo-4
|
49.18
|
43.17
|
30.77
|
27.05
|
71.91
|
64.55
|
45.26
|
40.46
|
|
RR
|
JA
|
AMI-1
|
15.05
|
11.65
|
–
|
–
|
96.39
|
56.68
|
–
|
–
|
|
|
|
AMI-2
|
89.26
|
51.81
|
–
|
–
|
96.14
|
57.69
|
–
|
–
|
|
|
|
Baseline
|
84.88
|
48.71
|
80.79
|
46.74
|
92.69
|
55.36
|
87.78
|
52.81
|
|
|
|
NTTD-1
|
87.03
|
49.92
|
–
|
–
|
93.85
|
55.80
|
–
|
–
|
|
|
|
Zukyo-1
|
58.11
|
31.91
|
42.59
|
25.50
|
82.01
|
49.71
|
57.27
|
36.93
|
|
|
|
Zukyo-2
|
60.22
|
32.78
|
43.63
|
25.72
|
83.70
|
50.78
|
58.94
|
37.81
|
|
|
|
Zukyo-3
|
57.79
|
31.27
|
42.24
|
24.80
|
82.13
|
50.03
|
58.57
|
37.64
|
|
|
|
Zukyo-4
|
56.74
|
30.96
|
42.16
|
24.77
|
82.01
|
50.24
|
58.84
|
38.03
|
|
|
EN
|
SRCB-1
|
82.60
|
54.96
|
79.19
|
52.62
|
92.86
|
64.02
|
88.62
|
60.95
|
|
|
|
SRCB-2
|
82.66
|
55.00
|
78.74
|
52.31
|
92.93
|
64.06
|
88.05
|
60.59
|
|
|
|
SRCB-3
|
80.61
|
53.58
|
77.19
|
51.05
|
92.24
|
63.88
|
87.87
|
60.50
|
|
|
|
Zukyo-1
|
75.92
|
49.57
|
63.50
|
41.07
|
90.85
|
63.16
|
74.10
|
50.62
|
|
|
|
Zukyo-2
|
79.97
|
52.99
|
67.07
|
44.00
|
91.32
|
63.25
|
75.51
|
51.63
|
|
|
|
Zukyo-3
|
78.77
|
51.92
|
65.32
|
42.64
|
91.56
|
63.46
|
74.69
|
51.04
|
|
|
|
Zukyo-4
|
78.95
|
52.45
|
65.45
|
43.09
|
91.70
|
63.81
|
75.13
|
51.65
|
2
|
CR
|
JA
|
AMI-1
|
37.10
|
36.44
|
37.10
|
36.44
|
61.63
|
60.91
|
61.63
|
60.91
|
|
|
|
Baseline
|
25.12
|
24.74
|
19.49
|
19.12
|
45.89
|
45.47
|
34.64
|
34.24
|
|
RR
|
JA
|
AMI-1
|
64.85
|
62.17
|
51.33
|
49.58
|
88.43
|
85.71
|
68.64
|
66.85
|
|
|
|
Baseline
|
62.55
|
60.13
|
46.68
|
44.62
|
82.89
|
80.39
|
60.94
|
58.80
|
|
|
EN
|
Syapse-1
|
54.96
|
54.22
|
50.37
|
49.68
|
82.89
|
81.79
|
75.99
|
74.95
|
Abbreviations: CR, case report; EN, English; JA, Japanese.
Note: Bold font indicates the best score for each evaluation metric.
The best scores range from 49.93 (exact, +mod, weighted) to 78.46 (partial, label) among the evaluation factors in Task1-CR-JA, most of which were achieved by our baseline, whereas SRCB-2 consistently achieved the best scores ranging from 51.81 to 78.80 in Task1-CR-EN. The best scores in Task1-RR-JA range from 46.74 to 96.39 (by the AMI systems and our baseline), whereas the scores of 52.62 to 92.93 (by SRCB-1 and 2) were the best in Task1-RR-EN. In Task2-CR-JA and Task2-RR-JA, AMI-1 outperformed our baseline, achieving the scores 36.44–61.63 and 49.58–88.43, respectively. The only participating system, Syapse-1, scored 49.67 to 82.89 in Task2-RR-EN. No system was submitted to Task2-RR-JA.
Task 3
ADE Application
[Tables 7] and [8] list the results of the ADE application on MedTxt-CR. Note that the test dataset does not contain any ADEval = 2 entities. Three systems, including our baseline, participated the Japanese corpus track (JA). At the entity level, the NICTmed systems performed better than our baseline in prioritizing precision; P and F tended to be higher. At the reporting level, our baseline achieved the best recall (77.78), whereas NICTmed-1 performed the best in terms of P (37.50) and F (48.00).
Table 7
Results of Task 3 for MedTxt-CR-JA
|
ADEval = 0
|
ADEval = 1
|
ADEval = 3
|
Report-level
|
System ID
|
P
|
R
|
F
|
P
|
R
|
F
|
P
|
R
|
F
|
P
|
R
|
F
|
Baseline
|
95.21
|
76.04
|
84.55
|
0.00
|
0.00
|
0.00
|
6.98
|
52.94
|
12.33
|
12.73
|
77.78
|
21.88
|
NICTmed-1
|
95.76
|
97.67
|
96.71
|
0.00
|
0.00
|
0.00
|
12.50
|
11.76
|
12.12
|
37.50
|
66.67
|
48.00
|
NICTmed-2
|
96.05
|
97.00
|
96.52
|
0.00
|
0.00
|
0.00
|
27.59
|
47.06
|
34.78
|
25.00
|
44.44
|
32.00
|
Note: Italic font indicates the best score for each evaluation metric.
Table 8
Results of Task 3 for MedTxt-CR-EN
|
ADEval = 0
|
ADEval = 1
|
ADEval = 3
|
Report-level
|
System ID
|
P
|
R
|
F
|
P
|
R
|
F
|
P
|
R
|
F
|
P
|
R
|
F
|
FRDC-1
|
95.70
|
94.94
|
95.32
|
20.00
|
5.26
|
8.33
|
62.50
|
26.32
|
37.04
|
22.22
|
66.67
|
33.33
|
FRDC-2
|
95.79
|
97.00
|
96.39
|
14.29
|
5.26
|
7.69
|
43.75
|
36.84
|
40.00
|
29.41
|
55.56
|
38.46
|
FRDC-3
|
95.95
|
93.52
|
94.72
|
6.25
|
5.26
|
5.71
|
28.57
|
21.05
|
24.24
|
19.35
|
66.67
|
30.00
|
FRDC-4
|
96.05
|
92.10
|
94.03
|
25.00
|
5.26
|
8.70
|
22.22
|
42.11
|
29.09
|
18.92
|
77.78
|
30.43
|
FRDC-5
|
95.87
|
95.26
|
95.56
|
0.00
|
0.00
|
0.00
|
56.25
|
47.37
|
51.43
|
25.93
|
77.78
|
38.89
|
FRDC-6
|
96.14
|
94.47
|
95.30
|
25.00
|
10.53
|
14.81
|
50.00
|
21.05
|
29.63
|
21.21
|
77.78
|
33.33
|
FRDC-7
|
95.67
|
94.31
|
94.99
|
0.00
|
0.00
|
0.00
|
33.33
|
26.32
|
29.41
|
19.35
|
66.67
|
30.00
|
FRDC-8
|
96.42
|
97.79
|
97.10
|
20.00
|
5.26
|
8.33
|
47.62
|
52.63
|
50.00
|
50.00
|
77.78
|
60.87
|
FRDC-9
|
96.35
|
91.79
|
94.01
|
0.00
|
0.00
|
0.00
|
23.81
|
52.63
|
32.79
|
18.92
|
77.78
|
30.43
|
FRDC-10
|
95.87
|
95.26
|
95.56
|
7.14
|
5.26
|
6.06
|
26.92
|
36.84
|
31.11
|
23.08
|
66.67
|
34.29
|
NICTmed-1
|
96.53
|
96.68
|
96.61
|
0.00
|
96.68
|
0.00
|
31.25
|
52.63
|
39.22
|
25.00
|
55.56
|
34.48
|
NICTmed-2
|
95.39
|
98.10
|
96.73
|
0.00
|
0.00
|
0.00
|
40.00
|
42.11
|
41.03
|
40.00
|
44.44
|
42.11
|
SRCB-1
|
96.57
|
97.95
|
97.25
|
14.29
|
5.26
|
7.69
|
60.00
|
63.16
|
61.54
|
50.00
|
66.67
|
57.14
|
SRCB-2
|
96.57
|
97.95
|
97.25
|
0.00
|
0.00
|
0.00
|
59.09
|
68.42
|
63.41
|
50.00
|
66.67
|
57.14
|
SRCB-3
|
96.28
|
98.10
|
97.18
|
0.00
|
0.00
|
0.00
|
60.00
|
63.16
|
61.54
|
50.00
|
55.56
|
52.63
|
SRCB-4
|
96.41
|
97.63
|
97.02
|
0.00
|
0.00
|
0.00
|
57.14
|
63.16
|
60.00
|
50.00
|
66.67
|
57.14
|
SRCB-5
|
95.88
|
99.37
|
97.60
|
0.00
|
0.00
|
0.00
|
78.57
|
57.89
|
66.67
|
60.00
|
33.33
|
42.86
|
SRCB-6
|
95.99
|
98.26
|
97.11
|
33.33
|
5.26
|
9.09
|
55.56
|
52.63
|
54.05
|
50.00
|
44.44
|
47.06
|
Syapse-1
|
97.02
|
97.63
|
97.32
|
30.00
|
31.58
|
30.77
|
100.0
|
26.32
|
41.67
|
50.00
|
88.89
|
64.00
|
Note: Italic font indicates the best score for each evaluation metric.
Further, 19 systems were submitted to the English track (EN). Syapse-1 achieved the best scores most frequently (i.e., in the five metrics: P in ADEval = 0, F in ADEval = 1, P in ADEval = 3, and R and F at the reporting level). SRCB-2, -5, and -6 performed the best in several metrics. These scores were higher than those for JA, with an average difference of ∼20–30 points.
CI Application
[Table 9] shows the evaluation scores for Task3-RR, the CI application. In the Japanese corpus (JA), Zukyo-1 achieved the highest scores for all metrics, 34.09 in AdjMI and 36.22 in FM. For the English corpus (EN), FRDC-1 achieved 84.37 AdjMI and 84.36 FM. Overall, the EN scores tended to be higher than those of JA.
Table 9
Performance of each system of Task 3 for MedTxt-RR (CI) in multiple evaluation metrics
Language
|
System ID
|
AdjMI
|
FM
|
Binary Acc
|
JA
|
GunNLP-1
|
0.1988
|
0.2674
|
0.7675
|
|
Baseline
|
0.1489
|
0.1814
|
0.8131
|
|
NICTmed-1
|
-0.0117
|
0.1170
|
0.7680
|
|
Zukyo-1
|
0.3409
|
0.3622
|
0.8285
|
EN
|
FRDC-1
|
0.8437
|
0.8436
|
0.9595
|
|
GunNLP-2
|
0.8116
|
0.8110
|
0.9508
|
|
GunNLP-3
|
0.8122
|
0.8126
|
0.9514
|
|
GunNLP-4
|
0.8122
|
0.8126
|
0.9514
|
|
GunNLP-5
|
0.8122
|
0.8126
|
0.9514
|
|
GunNLP-6
|
0.8122
|
0.8126
|
0.9514
|
|
GunNLP-7
|
0.8261
|
0.8166
|
0.9524
|
|
GunNLP-8
|
0.8123
|
0.8119
|
0.9514
|
|
GunNLP-9
|
0.8123
|
0.8119
|
0.9514
|
|
GunNLP-10
|
0.8255
|
0.8150
|
0.9519
|
|
NICTmed-1
|
-0.0045
|
0.1085
|
0.7809
|
|
Syapse-1
|
0.7309
|
0.6992
|
0.9206
|
Extreme prediction
(isolate all samples)
|
|
-4.7901
|
0.0000
|
0.0000
|
Abbreviations: EN, English; JA, Japanese.
Note: Italic font indicates the best score for each evaluation metric.
Discussions
Task 1
A distinction in the nature of the corpus is evident in the higher scores achieved for the RR corpus compared with the CR corpus. This observation aligns with the tendency for RRs to exhibit linguistic simplicity when compared with CRs.[27] The CR corpus has a large vocabulary (7,369 unique tokens) that covers most medical fields, whereas the RR corpus has a smaller number of unique tokens (i.e., 1,182).
The performances of the top-tier systems in the two languages (JA vs. EN) were similar (an average difference of ∼5 points), indicating that task difficulty was independent of the language in this size of training data (∼100 documents). This could benefit from pretraining the language models, which will be discussed subsequently.
For the boundary factor (exact vs. partial match), the partial scores were at least 10 points higher than the exact scores, regardless of the corpus or language. Remarkably, the best scores for the modality-agnostic unweighted partial match were close to 80 for CR and 95 for RR. This indicates that the best systems captured medically important phrases, at least partially, despite the relatively small training data.
For the frequency factor (weighted or not), we did not observe a change in the rank of the top systems even after weighting (except for the partial unweighted modality-agnostic match in RR-JA), which suggests that the best systems did not rely too much on high-frequency entities.
Finally, we discuss the approaches adopted by the participating teams. Their common methods are (1) language models and (2) data augmentation.
-
Language models: almost all systems employ Transformer-based language models, and many teams adopt domain-specific pretrained models, such as BioBERT and Clinical BERT in EN, and UTH-BERT in JA. Now that these pretrained models drive contemporary NLP, even models without additional techniques, such as our baseline, perform well enough.
-
Data augmentation: given the few-resource issues, many systems use data augmentation techniques. The results showed that machine translation–based methods (e.g., SRCB) contributed to the performance more than simple rule-based methods (e.g., FRDC). Owing to improvements in machine translation, round-trip translation would generate semantically correct samples; conversely, rule-based augmentation, such as random word swapping, might break the medical appropriateness of sentences.
Task 2
While Task 1 provided a small corpus of ∼100 documents, our new challenge, Task 2, only included ∼100 sentences in the annotation guidelines for model training. This challenge can be observed in the exact match performance: even the best systems resulted in only 50 to 70% of the highest scores of their Task 1 counterparts. However, the partial match scores of the best systems in Task 2 were rather close to those in Task 1, that is, within only 10-point difference in most cases. For instance, AMI-1 scored 61.63 (partial, +mod, unweighted) in CR-JA, which was an 8.14-point difference from our baseline of 69.77 in Task1-CR-JA. AMI-1 also achieved 88.43 (partial, label, and unweighted) in RR-JA, the performance of which seems sufficiently high for certain practical applications, such as medical concept–based document retrieval. Thus, this challenge revealed the potential feasibility of NER based on only a few samples for human annotators.
Task 3
ADE Application
At the entity level, the average F-scores of the submitted systems were proportional to the number of corresponding ADEval entities in the training set ([Table 4]). Report-level ADE performance tended to be inconsistent with entity-level performance; a better entity-level system was not necessarily a better report-level system. Although the corpora were parallel, most EN systems performed much better than the JA systems. For this task, the domain-specific language models effectively contributed to the results. Most EN systems are based on medically pretrained language models, such as BioBERT, clinical BERT, and PubMed BERT, whereas JA systems only adopt general-domain BERT models.
We then focused on effective approaches, particularly for EN. Regarding the F-scores for ADEval = 3 and at the report level, which mostly corresponded to ADE signal detection, the SRCB systems generally performed well (average of 61.2, ADEval = 3, and 52.3, respectively). They trained models on automatically generated snippets that explicitly explained which entity in a report was related to an ADE, which seemed to enhance the local and global ADE contexts. In addition, Syapse-1 performed best at the report level (64.00 F), whose method compares medicine and disease entities embedded by SciBERT per document; drug-disorder relations inside each document would contribute to report-level performance.
CI Application
The system Zukyo-1 achieved the highest scores, suggesting the effectiveness of sentence classification in determining TNM staging, even with the limited availability of knowledge.
FRDC-1, which uses heuristics for cancer size matching and Sentence-BERT encoding,[51] achieved the highest performance of all the systems. As shown in [Table 10], the RRs of cases 4 and 5 were successfully grouped into a single cluster, suggesting that matching lesion size is helpful for case distinction.
Table 10
Number of clusters into which each case was split by each system in Task 3 for MedTxt-RR (CI)
Case ID
|
4
|
5
|
7
|
8
|
10
|
14
|
15
|
TNM
|
T2aN0M0
|
T2bN0M0
|
T3N1M0
|
T3N3M0
|
T4N0M0
|
T4N3M1a
|
T2N2M1c
|
GunNLP-1
|
5
|
6
|
3
|
3
|
3
|
5
|
2
|
Baseline
|
6
|
7
|
5
|
8
|
6
|
5
|
5
|
NICTmed-1
|
6
|
5
|
5
|
6
|
5
|
5
|
5
|
Zukyo-1
|
4
|
5
|
3
|
4
|
4
|
3
|
2
|
FRDC-1
|
1
|
1
|
2
|
2
|
1
|
2
|
3
|
FRDC-2
|
1
|
1
|
2
|
3
|
1
|
2
|
3
|
FRDC-3
|
1
|
1
|
2
|
3
|
1
|
2
|
3
|
FRDC-4
|
1
|
1
|
2
|
3
|
1
|
2
|
3
|
FRDC-5
|
1
|
1
|
2
|
3
|
1
|
2
|
3
|
FRDC-6
|
1
|
1
|
2
|
3
|
1
|
2
|
3
|
FRDC-7
|
1
|
1
|
2
|
2
|
1
|
2
|
3
|
FRDC-8
|
1
|
1
|
2
|
2
|
2
|
2
|
3
|
FRDC-9
|
1
|
1
|
2
|
2
|
2
|
2
|
3
|
FRDC-10
|
1
|
1
|
2
|
2
|
1
|
2
|
3
|
NICTmed-1
|
5
|
5
|
6
|
7
|
6
|
5
|
5
|
Syapse-1
|
2
|
2
|
1
|
1
|
1
|
3
|
4
|
Gold Standard
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
Abbreviation: CI, case identification.
Although both used a NER-based approach, a large discrepancy was observed between the scores of the GunNLP-1 and Syapse-1 systems. This may reflect differences in the availability of biomedical knowledge bases between Japanese and English. Whereas Syapse-1 used UMLS to normalize biomedical entities, GunNLP-1 had to create bag-of-entity vectors only from the training set, which probably had difficulty dealing with unseen entities in the test set.
As listed in [Table 11], most systems grouped the test cases into the same number of clusters as the gold standard, although the true cluster number was not clarified. In this task, the test sample size quickly determines the true cluster number, as exploited by FRDC-1.
Table 11
Cluster sizes created by each system in Task 3 for MedTxt-RR (CI)
Corpus
|
System ID
|
Cluster number
|
Cluster size
|
|
Gold standard
|
7
|
9, 9, 9, 9, 9, 9, 9, 9, 9
|
RR-JA
|
GunNLP-1
|
8
|
18, 17, 9, 8, 4, 3, 2, 2
|
|
Baseline
|
33
|
19, 5, 4, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
|
|
NICTmed-1
|
7
|
11, 11, 10, 9, 8, 7, 7
|
|
Zukyo-1
|
7
|
13, 11, 11, 8, 7, 7, 6
|
RR-EN
|
FRDC-1
|
7
|
10, 9, 9, 9, 9, 9, 8
|
|
FRDC-2
|
7
|
11, 9, 9, 9, 9, 9, 7
|
|
FRDC-3
|
7
|
10, 10, 9, 9, 9, 9, 7
|
|
FRDC-4
|
7
|
10, 10, 9, 9, 9, 9, 7
|
|
FRDC-5
|
7
|
10, 10, 9, 9, 9, 9, 7
|
|
FRDC-6
|
7
|
10, 10, 9, 9, 9, 9, 7
|
|
FRDC-7
|
7
|
10, 10, 9, 9, 9, 9, 7
|
|
FRDC-8
|
7
|
10, 9, 9, 9, 9, 9, 8
|
|
FRDC-9
|
7
|
10, 9, 9, 9, 9, 9, 8
|
|
FRDC-10
|
7
|
11, 9, 9, 9, 9, 9, 7
|
|
NICTmed-1
|
9
|
12, 10, 8, 8, 7, 6, 6, 5, 1
|
|
Syapse-1
|
9
|
12, 12, 9, 9, 9, 7, 2, 2, 1
|
Abbreviation: CI, case identification.
In summary, the effective strategies differed between Japanese (RR-JA) and English (RR-EN). For the RR-EN, the embedding distance with the help of a knowledge base works well and can be applied in other clinical specialties beyond lung cancer. For the RR-JA, the lack of external public knowledge motivated participants to adopt a more dataset-specific approach, resulting in comparatively lower performance and a limited possibility of application beyond lung cancer.
Limitations
Our workshop has two major limitations. First, relatively few teams participated in the new tasks that we designed: guideline learning, ADE, and CI. The numbers of participants in both languages were also unbalanced. Although few results prevented a finer analysis, we hope these tasks will attract more attention.
Second, we translated the original Japanese corpora into English to create bilingual parallel corpora for our tasks, which may have produced unnatural medical texts in English. It is generally known that the writing style of clinical documents varies in languages and nations. Our English corpora may deviate from the standard writing of typical English CRs and RRs. However, we believe that medical parallel corpora will help international communities understand clinical writing styles in non-English languages, which is important for language-independent MedNLP applications in the future.
Clinical or Public Health Implications
Clinical or Public Health Implications
The designed tasks were oriented toward real-world clinical document processing. Although they do not directly affect patient health, the participating teams proposed MedNLP techniques to extract information useful for medical research and analysis from texts (e.g., phenotyping and ADE). In the future, application systems adapting these techniques will support the work and study of medical workers, benefiting patients.
Conclusion
This study introduced the Real-MedNLP workshop, which encompassed three distinct medical NLP tasks conducted on bilingual parallel corpora (English and Japanese): NER, ADE extraction, and CI. The participating teams employed a dual approach, which involved (1) implementing data augmentation techniques and (2) utilizing domain-specific pre-trained language models like BioBERT and ClinicalBERT. These strategies partially addressed the challenges associated with limited resources in MedNLP. However, the performance in tasks involving extremely low-resource settings, such as Task 2 (guideline learning), remained insufficient. Specifically, for newly devised tasks like Task 3 ADE and CI applications, significant effort was required to establish evaluation methodologies that accurately captured their performance characteristics.
Future Work
Since our three tasks and other medical tasks are awaiting NLP solutions, organizing and sharing approaches and results worldwide is important. We believe that our datasets and results will boost future research. The results of this workshop provide the rigorous “baseline” for medical information extraction since it was held right before the rise of large language models (LLMs) such as ChatGPT[64] and Gemini.[65] By comparing their performance in our tasks with our results based on pre-LLM cutting-edge techniques, we can accurately gauge the capability of LLMs in low-resource medical NLP.
Furthermore, we worked on a successor of Real-MedNLP in NTCIR-17 (from 2022 to 2023), entitled “MedNLP-SC,” where “SC” stands for social media and clinical text.[66]
[67] This new workshop posed information extraction from patient-generated and doctor-generated texts, where the low-resource setting is still active, given our experience in Real-MedNLP. Evaluating and comparing the outcomes of this workshop with those of the current one will be another focus for future research.