CC BY-NC-ND 4.0 · Methods Inf Med
DOI: 10.1055/a-2405-2489
Original Article

Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop

Shuntaro Yada
1   Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
,
Yuta Nakamura
2   22nd Century Medical and Research Center, The University of Tokyo Hospital, Tokyo, Japan
,
Shoko Wakamiya
1   Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
,
Eiji Aramaki
1   Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
› Author Affiliations
Funding This work was supported by JST AIP Trilateral AI Research Grant Number JPMJCR20G9 and MHLW Program Grant Number JPMH21AC500111 (formerly JST AIP-PRISM Grant Number JPMJCR18Y1), Japan.
 

Abstract

Background Textual datasets (corpora) are crucial for the application of natural language processing (NLP) models. However, corpus creation in the medical field is challenging, primarily because of privacy issues with raw clinical data such as health records. Thus, the existing clinical corpora are generally small and scarce. Medical NLP (MedNLP) methodologies perform well with limited data availability.

Objectives We present the outcomes of the Real-MedNLP workshop, which was conducted using limited and parallel medical corpora. Real-MedNLP exhibits three distinct characteristics: (1) limited annotated documents: the training data comprise only a small set (∼100) of case reports (CRs) and radiology reports (RRs) that have been annotated. (2) Bilingually parallel: the constructed corpora are parallel in Japanese and English. (3) Practical tasks: the workshop addresses fundamental tasks, such as named entity recognition (NER) and applied practical tasks.

Methods We propose three tasks: NER of ∼100 available documents (Task 1), NER based only on annotation guidelines for humans (Task 2), and clinical applications (Task 3) consisting of adverse drug effect (ADE) detection for CRs and identical case identification (CI) for RRs.

Results Nine teams participated in this study. The best systems achieved 0.65 and 0.89 F1-scores for CRs and RRs in Task 1, whereas the top scores in Task 2 decreased by 50 to 70%. In Task 3, ADE reports were detected by up to 0.64 F1-score, and CI scored up to 0.96 binary accuracy.

Conclusion Most systems adopt medical-domain–specific pretrained language models using data augmentation methods. Despite the challenge of limited corpus size in Tasks 1 and 2, recent approaches are promising because the partial match scores reached ∼0.8–0.9 F1-scores. Task 3 applications revealed that the different availabilities of external language resources affected the performance per language.


#

Introduction

The rise of electronic medical records has heightened the importance of natural language processing (NLP) techniques in health care due to the vast amount of textual data they generate.[1] Given the widespread interest in NLP within computer science, the volume of research on medical NLP has experienced a remarkable surge annually. This trend has also been supported by numerous medical NLP workshops, such as CLEF eHealth,[2] [3] n2c2[4] (formerly known as i2b2), MADE,[5] and MEDIQA.[6] However, despite the substantial body of research, the availability of privacy-compliant medical text data remains limited, particularly in non-English languages.[7] [8] [9]

To address this limitation, we organized a series of medical NLP workshops with open datasets (the MedNLP Series) at an international conference, NII Testbeds and Community for Information Access Research (NTCIR): MedNLP-1,[10] MedNLP-2,[11] MedNLPDoc,[12] and MedWeb.[13] In MedNLP-1, we introduced a foundational NLP task, named entity recognition (NER), utilizing dummy medical records crafted by medical professionals. MedNLP-2 focused on a term normalization task, again employing dummy medical records developed by medical experts. The MedNLPDoc workshop was designed to encompass a comprehensive task. Beginning with a medical record sourced from a medical textbook, participants were tasked with identifying an appropriate disease name represented by International Classification of Diseases codes. In MedWeb, a disease tweet classification task was designed to simulate the use of social media data in the medical and health care domains; dummy Twitter data were created in Japanese and translated into English and Chinese.

Past workshops in the MedNLP Series, summarized in [Table 1], successfully produced valuable datasets. However, two major problems have been identified. (1) The data were not real clinical texts but were dummy records or sample texts from medical textbooks. (2) The datasets were limited to Japanese, which made it difficult to compare the results with those of other English-based workshop results ([ Table 2]).

Table 1

Past MedNLP Series workshops proposed by the authors

Workshop

Year

Corpus

Task

MedNLP-1[10]

2012–2013

Dummy HR written by clinicians

NER

MedNLP-2[11]

2013–2014

Dummy HR written by clinicians

NEN

MedNLPDoc[12]

2015–2016

Dummy HR extracted from clinical textbooks

NEN

MedWeb[13]

2016–2017

Dummy Tweets obtained by crowdsourcing

TC

Abbreviations: HR, health record; MedNLP: medical natural language processing; NEN, named entity normalization; NER, named entity recognition; TC, text classification.


Table 2

Real-MedNLP tasks

Task

Corpus

Format

1) Just 100 Training

CR/RR

NER

2) Guideline Learning

CR/RR

NER

3) Applications

CR

ADE

RR

CI

Abbreviations: ADE, adverse drug event; CI, case identification; CR, case report; MedNLP: medical natural language processing; NER, named entity recognition; RR, radiology report.


To address these aspects, during 2021 to 2022, we proposed and organized Real-MedNLP, the first workshop in the MedNLP Series that handles real and parallel medical text. Our data comprised two document types: (1) case reports (CRs; MedTxt-CR) and (2) radiology reports (RRs; MedTxt-RR). Both corpora are realistic medical/clinical texts based on the materials available on the Internet, where realistic means that real case-report articles constitute MedTxt-CR, and MedTxt-RR contains newly written (dummy) RRs that interpret commonly available real radiology images. Furthermore, we manually translated the original Japanese text into English, enabling us to develop the first benchmark for cross-lingual medical NLP. Considering the data, we redesigned the workshop scheme to achieve our goal of promoting systems applicable at the bedside. This reintroduces the aforementioned challenging restrictions in medical NLP: limited dataset sizes. The proposed task format is as follows:

  • Low-resource NER (Tasks 1 and 2): participants are supposed to extract medical expressions from text, although only a limited number of annotated documents are available for training the machine learning models. This reflects the real-world MedNLP, which often suffers from a scarcity of available annotated text in hospitals or their departments owing to annotation costs. We further defined two tasks: Just 100 Training (SubtaskTask 1) and Guideline Learning (Task 2). This task set is called low-resource in the NLP research[14]; Task 2 corresponds explicitly to zero or few-shot learning in the machine learning field.

  • Applications (Task 3): corresponding to the two document types, we propose two practical and useful MedNLP applications in actual clinical work. For CRs, we designed an information extraction task for adverse drug events (ADE) reporting (i.e., pharmacovigilance) characterized by a different approach from relation extraction, which is usually adopted in existing workshops such as i2b2 2009.[15] We propose a novel case identification (CI) task for RRs to detect reports originating from identical patients.

These demanding tasks offer exciting prospects for advancing practical systems that can enhance various medical services, including phenotyping,[16] drug repositioning,[17] drug target discovery,[18] precision medicine,[19] clinical text-input methods,[20] [21] and electronic health record summarization/aggregation.[22]

This study provides an account of the materials used, detailed task definitions, evaluation metrics employed, an overview of participants' approaches, and the overall results achieved during the Real-MedNLP workshop.


#

Materials

Corpora: MedTxt

Overview

The textual datasets (corpora) released by workshop participants were named MedTxt. Two types of medical and clinical documents were used as corpora: CRs and RRs. These two corpora are parallel in Japanese (JA, original) and English (EN, translated). For example, we identified the Japanese CR corpus as MedTxt-CR-JA (MedTxt-CR-JA).


#

Case Reports: MedTxt-CR

This CR is a medical research paper in which doctors describe specific clinical cases. CRs aimed to share clinically notable issues with other doctors, particularly those in medical societies. The format of CRs is similar to that of discharge summaries, which are clinical documents written by doctors to record the treatment history of discharged patients. While popular English medical NLP corpora are often composed of discharge summaries (e.g., MIMIC-III), techniques for CR analysis are smoothly extended to analyze discharge summaries.

MedTxt-CR-JA comprises open-access CRs obtained from the Japanese scholarly publication platform J-Stage.[23] [Fig. 1] shows its sample. As the number of medical societies that produce open-access publications is limited, the types of patients and diseases reported in open-access CRs are highly biased. To reduce the bias caused by the publication policy (whether to prefer open access or not) of each medical society, we selected 224 CRs based on the “frequencies” in a Japanese disease-name dictionary MANBYO-DIC (J-MeDic),[24] which contains the frequency of each term in Japanese medical corpora. These CRs were manually translated from Japanese (MedTxt-CR-JA) to English (MedTxt-CR-EN) while retaining named entity annotations (described later). They were divided into 148 training and 76 test datasets.

Zoom Image
Fig. 1 Annotated sample of the case report corpora in English (MedTxt-CR-EN). The entity notations stand for D = diseases and symptoms with the modality “certainty” such as positive (+) and negative (−); A = anatomical parts; Time = time expressions with the modality “type” such as date (DATE), age (AGE), and medically specific (MED); Tt/k/v = test set/item/values with the modality “state” such as executed (+); and Mk = medicine name with the modality “state” such as executed (+).

#

Radiology Reports: MedTxt-RR

A RR is a clinical document written by a radiologist to share a patient's status with physicians. Each RR discusses a single radiological examination such as radiography, computed tomography (CT), or magnetic resonance imaging. A RR contains (1) descriptions of all normal and abnormal findings, and (2) interpretations of the findings, including disease diagnosis and recommendations for the next clinical test or treatment. Although most radiology AI (artificial intelligence) research focuses only on images because image-based AI has drawn much attention, NLP on radiology-report text also has the potential for a wide variety of clinical applications.[25]

MedTxt-RR[26] consists of 135 RRs. [Fig. 2] shows its sample. The MedTxt-RR aims to provide information on the diversity of expressions used by different radiologists to describe the same diagnosis. One of the difficulties in analyzing RRs is the variety of writing styles. However, relying solely on RRs from medical institutions presents limitations. In typical clinical settings, only one report is generated per radiological exam, restricting the available data for in-depth analysis. MedTxt-RR-JA was created to overcome this problem by crowdsourcing, in which nine radiologists independently wrote RRs for the same series of 15 lung cancer cases.

Zoom Image
Fig. 2 Annotated sample of the radiology report corpora in English (MedTxt-RR-EN). The entity notations stand for D = diseases and symptoms with the modality “certainty” such as positive (+) and suspicious (?); A = anatomical parts; Time = time expressions with the modality “type” such as date (DATE); and Tt = test sets with the modality “state” such as executed (+).

MedTxt-RR-EN is an English translation of MedTxt-RR-JA by nine translators corresponding to radiologists. We divided them into 72 training datasets (8 cases) and 63 test datasets (7 cases).


#
#

Tasks and Annotations

Tasks 1 and 2: Low-Resource NER Challenge

Because NER is the most fundamental information extraction for medical NLP, we designed a challenge regarding NER for a limited number of clinical reports (∼100). This corpus size fits into so-called low-resource NLP, in which training models become challenging.[14] This setting is the de facto standard in medical NLP mainly because of the innate difficulty of medical concept annotation and privacy concerns regarding patient health records. To address this challenge, we defined two Tasks based on the size of the available training data:

  • Task 1) Just 100 Training: we provided ∼100 reports for training, corresponding to the standard few-resource (or few-shot) supervised learning.

  • Task 2) Guideline Learning: we provided annotation guidelines containing only a handful of annotated sentences, simulating human annotator training, in which human annotators usually learn from annotation guidelines defined by NLP researchers.

Although we provided only a few training data points for both tasks, workshop participants could use any other resources outside this project (e.g., medical dictionaries and medically pretrained language models) if they found them useful for their methods.

We adopted the following entity types from an existing medical NER scheme[27] [28]:

  • Diseases and symptoms <d> with the modality “certainty”:

    • positive: the existence is recognized.

    • negative: the absence is confirmed.

    • suspicious: the existence is speculated.

    • general: hypothetical or common knowledge mentions.

  • Anatomical parts <a>

  • Time <timex3> with the modality “type”:

    • date: a calendar date.

    • time: a clock time.

    • duration: a continuous period.

    • set: frequency consisting of multiple dates, times, and periods.

    • age: a person's age.

    • med: medicine-specific time expressions such as “post-operative.”

    • misc: exceptional time expressions other than the above.

  • Test <t-test/key/val> with the modality ‘state’:

  • Medicine <m-key/val> with the modality “state”:

    • scheduled: treatment is planned.

    • executed: treatment was executed.

    • negated: treatment was canceled.

    • other: an exceptional state other than the above.

Detailed definitions and information on modality are provided in Yada[ et al27] and Chapter 2 of our annotation guidelines.[28] Several batches of the Japanese corpus were annotated separately by multiple native Japanese speakers without medical knowledge and then translated into English while retaining the annotated entities. This annotation scheme enables a reasonable quality of coherent annotation even if annotators lack medical knowledge.[27] [29]

The detailed statistics of the entity annotations in the datasets are presented in [Table 3]. We evaluated the following tag sets.

Table 3

Named entities of the training sets of MedTxt-CR and MedTxt-RR

Dataset

CR-JA

CR-EN

RR-JA

RR-EN

# of texts

148

148

72

72

# of characters (mean)

84,471 (570)

40,383 (272)

16,861 (234)

8,488 (117)

<a>

total

823

819

464

465

<d>

total

2,348

2,346

884

883

“positive”

1,695

1,693

465

462

“suspicious”

80

80

191

191

“negative”

251

251

149

148

“general”

302

302

1

1

<t-test>

total

387

388

26

27

“scheduled”

0

0

0

0

“executed”

362

363

19

19

“negated”

7

7

2

2

“other”

18

18

5

6

<timex3>

total

1,353

1,353

29

29

“date”

539

539

26

26

“time”

53

53

0

0

“duration”

82

82

2

2

“set”

34

34

0

0

“age”

189

189

0

0

“med”

428

428

1

1

“misc”

28

28

0

0

<m-key>

total

344

344

0

0

“scheduled”

0

0

0

0

“executed”

266

266

0

0

“negated”

27

27

0

0

“other”

51

51

0

0

<m-val>

total

64

64

0

0

“scheduled”

0

0

0

0

“executed”

0

0

0

0

“negated”

2

2

0

0

“other”

0

0

0

0

<t-key>

total

524

524

1

1

<t-val>

total

427

427

0

0

Abbreviations: CR, case report; EN, English; JA, Japanese; RR, radiology report.


  • MedTxt-CR: <d>, <a>, <t-test>, <timex3>, <m-key>, <m-val>, <t-key>, and <t-val> (all types above).

  • MedTxt-RR: <d>, <a>, <t-test>, and <timex3> (a subset of the types above).

The teams may choose whether or not to predict the modalities of the entities.


#

Task 3: Applications

Task3-CR: Adverse Drug Effect Extraction

In this application, tailored for MedTxt-CR, the systems were supposed to analyze input CRs and extract any ADE information. Unlike the typical relation-extraction formulation in past ADE extraction tasks,[4] we set the objective to table slot filling, that is, to independently predict the degree of involvement in ADEs for each mentioned disease and medicine. Thus, we attempted to address an issue with standard ADE extraction in which even medical professionals find it difficult to identify ADEs only from the text, leading to annotation difficulties. As shown in [Fig. 3], the ADE information consists of two tables: <d>-table for disease/symptom names and m-key-table for drug names. For each entity in these tables, the four levels of ADE certainty (ADEval) based on Kelly et al[30] were as follows:

Zoom Image
Fig. 3 Task 3—ADE application, wherein each disease or medication entity mentioned in case reports is labeled with the degree of involvement in adverse drug events (ADEval), ranging from 0 to 3. ADE, adverse drug effect.

3: Definitely  |  2: Probably  |  1: Unlikely  |  0: Unrelated

For disease names, ADEvals were interpreted as the likelihood of being an ADE, and for medicine names, the interpretation was the likelihood of causing an ADE. To annotate these labels, we let two annotators follow the author's perspective on whether drugs and symptoms were related to ADE (i.e., the writer's perception). However, if the annotators noticed other possibilities of ADEs that were not explicitly pointed out in the report, we allowed them to label ADEval ≥ 1 as well (i.e., the reader's perception). Note that one annotator has experience as a pharmacist and the other possesses a master's degree in biology. [Table 4] presents the distribution of ADEvals in the training set.

Table 4

ADEval distributions for each entity type in the training set of Task 3 ADE application

ADEval

Disease

Medicine

(total)

0

1,217

103

1,320

1

33

28

61

2

57

22

79

3

123

47

170

Total

1,430

200

1,630

Abbreviation: ADE, adverse drug event.



#
#

Task3-RR: Case Identification

This application was specifically designed for MedTxt-RR. Given the unsorted RRs, the participants were required to identify sets of RRs that diagnosed identical images, as depicted in [Fig. 4]. MedTxt-RR was created by collecting RRs from multiple radiologists who independently diagnosed the same CT images; this original correspondence between RRs and CT images was used as the gold standard label (document level).

Zoom Image
Fig. 4 Task 3—CI application, where radiology reports written by different radiologists are grouped by the described cases. CI, case identification.

In addition to an educational purpose in which trainee radiologists practice writing reports on shared images, this task would contribute to better NLP models that accurately understand the clinical content of RRs without being confused by synonyms or paraphrases, as MedTxt-RR contains RRs with almost the same clinical content but with various expressions.


#
#

Data Availability

The training portions of MedTxt corpora are publicly available at NTCIR Test Collection.[31]


#
#

Methods

Baseline Systems

Overview

We propose baseline systems for each Task in Japanese corpora.[32] All the systems adopted straightforward approaches to solving tasks. The models proposed below are based on a BERT[33] model pretrained on Japanese copora,[34] which tokenizes Japanese text using the morphological analyzer MeCab.[35]


#

Task 1 and 2: NER Models

We fine-tuned the individual models on each training set using the same NER-training scheme as Devlin et al.[33] The model predicts all entity types defined in Yada et al,[27] instead of only targeting the subsets for the tasks, because more entity types would provide more context to the model, even though task complexity may increase.

We released the baseline model trained on MedTxt-CR-JA at the Hugging Face Hub.[36]


#

Task3-CR: ADE Classifier

We solved this application task using an entity-wise classification scheme. For each disease or drug entity in the table row, we designed a model input consisting of four parts separated by [SEP] special tokens ([Fig. 5]): (1) the document ID, (2) contextual tokens around the targeted entity, (3) the targeted entity itself, and (4) its entity type name (i.e., “disease” or “drug”). Specifically, Part (2) contains 50 + 50 characters before and after the target entity, including the entity itself. For simplicity, the context around the first mention is extracted when the targeted token appears multiple times in the document.

Zoom Image
Fig. 5 Input and output formats of the baseline model for Task3-CR (ADE). Although the real inputs were in Japanese, an English sample is used in this figure for readability.

#

Task3-RR: CI Clusterer

We exhaustively classified all document pairs to identify RRs describing identical patients. For each pair of given documents, the model judges whether the inputs describe an identical patient (i.e., binary classification; [Fig. 6]). Each document pair is separated by a [SEP] token. Considering permutation, we regard a document pair (Article text 1, Article text 2) as identical patient reports if and only if both predictions of (Article text 1, Article text 2) and (Article text 2, Article text 1) result in “identical-patient.”

Zoom Image
Fig. 6 Input and output formats of the baseline model for Task3-RR (CI). Although the real inputs were in Japanese, an English sample is used in this figure for readability. CI, case identification.

#
#

Participating Teams and Systems

Nine teams participated in the-MedNLP workshops. [Table 5] lists the teams, their demographics, and number of systems submitted by each team for each task. Notably, our workshop attracted global industries (i.e., five of the nine teams) from China, Japan, and the United States, demonstrating a high demand for practical medical NLP solutions worldwide. Two were multidisciplinary, and medical and computer science researchers collaborated. Most teams have adopted pretrained language models as their methodological basis, often combined with either or both external medical knowledge and data augmentation. Each team's approach is summarized: refer to the corresponding system papers for NTCIR-16 Real-MedNLP[32] [37] [38] [39] [40] [41] [42] [43] [44] [45] for more information.

Table 5

Team demographic and the number of systems developed by each team

Team demographic

# of submitted systems

Task 1

Task 2

Task 3

CR

RR

CR

RR

CR

RR

Team

#members

Country

Affiliation

JA

EN

JA

EN

JA

EN

JA

EN

JA

EN

JA

EN

AMI

3

Japan

Industry

2

2

1

1

FRDC

4

China

Industry

2

10

10

GunNLP

1

Japan

University

1

Baseline[a]

6

Japan

University

1

1

1

1

1

1

NICTmed

4

Japan

Institute

4

4

2

2

1

1

NTTD

4

Japan

Industry

1

1

SRCB

6

China

Industry

5

3

6

Syapse

5

US

Industry

1

1

1

Zukyo[a]

11

Japan and Switzerland

University and institute

4

4

4

4

1

Total

12

15

8

7

2

0

2

1

3

19

4

12

Abbreviations: CR, case report; EN, English; JA, Japanese; RR, radiology report.


a Stands for multidisciplinary (medicine + computer science) teams.


  • AMI (Task1-CR-JA, Task1-RR-JA, Task2-CR-JA, Task2-RR-JA): this team[37] adopted the medically pretrained Japanese BERT (UTH-BERT[46]) as its base model. For Task 1, two systems were proposed: an ensemble of four UTH-BERT models and a fine-tuned UTH-BERT with a CRF layer. For Task 2, a knowledge-guided pipeline was proposed in which UTH-BERT's NER predictions were corrected using medical dictionaries such as J-MeDic,[24] Hyakuyaku-Dictionary,[47] and comeJisyo[48] along with an additional vocabulary extended by bootstrapping.

  • FRDC (Task1-CR-EN, Task3-CR-EN (ADE), Task3-RR-EN (CI)): this team[39] submitted two systems utilizing BioBERT[49] for Task1-CR-EN. One system involved fine-tuning exclusively, while the other integrated data augmentation techniques, including label-wise token replacement, synonym replacement, mention replacement, and shuffling within segments.[50] For Task 3, this team proposed a vocabulary-adapted BERT (VART) model that was continuously trained from a fine-tuned BERT, but with out-of-vocabulary words from the initial fine-tuning. In Task3-CR-EN (ADE), VART was trained with multiple NLP tasks to classify the ADEval for each entity based on its contextual text and predict the entity type (disease or drug). Task3-RR-EN (CI) was solved using a combination of two main methods: (1) key feature clustering to find case-specific information, such as tumor size, and (2) K-means clustering based on document embedding using sentence BERT[51] to cluster the remaining cases unidentified in Step (1).

  • GunNLP (Task3-RR-JA (CI)): this team[39] applied collaborative filtering to an entity-frequency matrix created from the bag-of-words representation of RRs.

  • NAISTSOC (Baseline) (Task1-CR-JA, Task1-RR-JA, Task2-CR-JA, Task2-RR-JA, Task3-CR-JA (ADE), Task3-RR-JA (CI)): this multidisciplinary team[32] provides the aforementioned task baselines for Japanese corpus tracks.

  • NICTmed (Task1-CR-EN, Task1-CR-JA, Task3-CR-EN (ADE), Task3-CR-JA (ADE), Task3-RR-EN (CI), Task3-RR-JA (CI)): This team[40] investigated the effectiveness of two close multilingual language models, multilingual BERT (mBERT)[33] and XLM-RoBERTa.[52] While simply fine-tuning them for Task1-CR (NER), the team addressed Task3-CR (ADE) by considering ADEval as an additional attribute of the named entities. Task3-RR (CI) was solved by the K-means clustering of documents vectorized by mBERT.

  • NTTD (Task1-CR-JA, Task1-RR-JA): this team fine-tuned the Japanese BERT using data augmentation (i.e., synonym replacement and shuffling within segments).[50]

  • SRCB (Task1-CR-EN, Task1-RR-EN, Task3-CR-EN (ADE)): this team[42] adopted BERT,[33] BioBERT, clinical BERT,[53] PubMed BERT,[54] and entity BERT[55] as the base models for Task 1. These are fine-tuned by span-based prompting (e.g., token prediction with the prompt “[span] is a [MASK] entity,” where [span] is one of the possible spans and [MASK] is the span's entity type), along with data augmentation (i.e., language model-based token replacement). For Task 3 (ADE), the team used PubMed BERT, Clinical BERT, and BioBERT, and after multitask learning (medicine/disease classification and cloze-test tasks) and two-stage training,[56] they were fine-tuned with prompt learning (i.e., ADE-causing drug and disease pair prediction) assisted by data augmentation (back translation via Chinese and random feature dropout).

  • Syapse (Task2-RR-EN, Task3-CR-EN (ADE), Task3-RR-EN (CI)): this team[43] did not perform any fine-tuning of the given training datasets. Instead, it applies standard NLP modules to pipelines such as MetaMap[57] and ScispaCy.[58] First, the pipeline obtains entities for Task 2. For ADE applications, an additional SciBERT[59] module measures the similarity between medicine and disease-embedding pairs to regard high-similarity pairs as high-ADEval entities. A bag-of-entity representation of documents was used for the CI application to measure document similarity.

  • Zukyo (Task1-CR-JA, Task1-RR-JA, Task1-CR-EN, Task1-RR-EN, Task3-RR-JA (CI)): this multidisciplinary team addresses tasks according to language. The Japanese sub-team[44] fine-tuned an ensemble of Japanese BERT models using data augmentation (random entity replacement constrained by entity types). For the Japanese CI application, the sub-team manually re-annotated each sentence of the given RR corpus using the TNM classification,[60] the international standard of cancer staging. The same ensemble NER architecture was fine-tuned separately in the sentence-wise sequential labeling of TNM.

The English sub-team[45] adopted domain-specific BERT models to tackle Task 1: BioBERT,[33] ClinicalBERT,[53] and RoBERTa (general domain).[61] Furthermore, entity attributes were predicted by SVM using bags of contextual words around the entities. The training dataset was augmented by random entity replacement and constrained by entity types.


#

Evaluation Metrics

Tasks 1 and 2

We employed the common F1 measure as the NER metric; specifically, we adopted its micro-average over entity classes (i.e., micro F1). Furthermore, we considered the following two factors to enable an evaluation specific to few-resource NER:

  • Boundary factor (exact/partial): the standard NER metric treats a correct NE match if and only if the predicted span is identical to the gold standard (i.e., exact match). We also introduce partial match to Tasks 1 and 2: if the predicted span overlaps the gold-standard span, the prediction is regarded as “partially correct,” obtaining a diminished score. This is intended for downstream tasks in which partially identified NEs are still useful, such as large-scale medical record analysis and highlighting important note sections. We considered the proportion of common sub-characters between the gold standard and predicted entities to calculate the partial match score. Given that a gold-standard entity e (1 ≤ in) and predicted entity êj (1 ≤ im) overlap in their spans, we first calculate entity-level partial-match precision and recall , where |ek | stands for the character length of ek and |c| denotes the common character length between ei and ek . We then obtain the system-level partial-match precision P partial and recall R partial as follows:

.

Finally, we calculate the partial F1 score, i.e., their harmonic mean.

  • Frequency factor: in our few-resource NER setting, substantial portions of NEs in the corpora appear only a few times, making the tasks challenging for machine-learning models. To measure model performance in identifying such rare NEs, we designed a novel weighted metric that penalizes the correct guesses of the system more heavily as the predicted entity appears more frequently in the training dataset. For each gold-standard entity i in the test set, we multiplied the entity-level precision and recall scores by the weight wi based on the term frequency fi of the same entity in the training set, wi = 1/(log e (fi +1)+1). This weighting portrays the extent to which the system relied on high-frequency entities in the training phase, as well as how well the system captured low-frequency entities.

Task 3

For the ADE application, we employed two levels of evaluation for the ADEval classification: entity level and report level.

  • Entity level: the precision (P), recall (R), and F1-score (F) of each ADEval (= 0, 1, 2, 3) were micro-averaged for the disease and medicine entities.

  • Report level: we regard a report that contains at least one entity with ADEval ≥ 1 as a POSITIVE-REPORT, otherwise NEGATIVE-REPORT. This binary classification scheme evaluates the report-wise P, R, and F values of the POSITIVE REPORT.

For the CI application, we adopted standard metrics for supervised clustering: adjusted normalized mutual information (AdjMI),[62] Fowlkes–Mallows (FM) scores,[63] and binary accuracy. We aim to penalize random predictions or split clinically similar documents into numerous clusters. Both the AdjMI and FM were robust in addressing these errors. AdjMI is an adjustment of mutual information, which is a popular clustering metric. The FM also provides a useful means of estimating performance distinctly, as it spans from zero to one.


#
#
#
#

Results

System Notations

Distinct systems proposed by a team X, for example, are denoted in combination with numbers, such as “X-1” and “X-2.” For readability, we multiplied the scores by 100, except for the Task 3 metrics.


#

Tasks 1 and 2

[Table 6] lists the F1 measures per evaluation factor obtained for Tasks 1 and 2. Since predicting the entity modalities is optional, we separately report the scores of the modality-aware match (“ + mod”) from the modality-agnostic match (“label”).

Table 6

Results of Tasks 1 and 2

Exact match

Partial match

label

+mod

label

+mod

Task

Corpus

Language

System ID

weighted

weighted

weighted

weighted

1

CR

JA

AMI-1

61.33

51.95

78.41

68.12

AMI-2

61.24

51.88

78.46

68.19

Baseline

65.25

55.50

59.21

49.93

77.27

66.89

69.77

59.93

NICTmed-1

56.96

47.37

52.49

43.33

72.67

62.30

65.52

55.74

NICTmed-2

60.76

50.48

56.02

46.21

72.57

61.64

65.96

55.62

NICTmed-3

55.50

46.50

51.71

43.15

75.22

64.89

68.28

58.50

NICTmed-4

58.13

48.63

54.20

45.15

74.64

63.81

68.21

57.96

NTTD-1

61.89

51.98

73.61

62.93

Zukyo-1

30.88

23.83

25.91

19.63

55.14

47.12

44.88

37.77

Zukyo-2

35.85

29.68

30.13

24.59

63.95

56.33

53.07

46.25

Zukyo-3

26.56

21.95

22.47

18.36

58.65

52.04

48.20

42.32

Zukyo-4

27.73

23.34

23.08

19.10

59.67

53.46

49.63

44.03

EN

FRDC-1

43.21

38.50

56.48

51.24

FRDC-2

43.71

38.90

56.55

51.22

NICTmed-1

46.83

40.92

42.45

37.01

69.99

62.80

62.42

55.83

NICTmed-2

48.60

42.47

44.06

38.43

69.90

62.52

62.95

56.16

NICTmed-3

49.18

43.26

44.80

39.38

72.39

65.28

64.86

58.40

NICTmed-4

51.45

45.25

46.96

41.27

71.42

64.04

64.81

58.08

SRCB-1

59.80

52.55

54.84

48.09

73.72

65.35

67.69

59.94

SRCB-2

63.37

56.16

58.53

51.81

78.80

70.42

72.69

64.88

SRCB-3

62.31

55.15

57.49

50.80

77.90

69.47

71.81

63.94

SRCB-4

59.33

52.65

54.52

48.31

77.84

70.05

71.56

64.35

SRCB-5

60.33

53.64

55.40

49.17

78.25

70.34

71.80

64.44

Zukyo-1

45.56

39.65

29.57

25.89

70.32

63.03

44.79

40.05

Zukyo-2

51.97

45.89

33.35

29.50

73.76

66.38

47.11

42.28

Zukyo-3

51.16

44.78

32.63

28.67

72.20

64.53

46.09

41.11

Zukyo-4

49.18

43.17

30.77

27.05

71.91

64.55

45.26

40.46

RR

JA

AMI-1

15.05

11.65

96.39

56.68

AMI-2

89.26

51.81

96.14

57.69

Baseline

84.88

48.71

80.79

46.74

92.69

55.36

87.78

52.81

NTTD-1

87.03

49.92

93.85

55.80

Zukyo-1

58.11

31.91

42.59

25.50

82.01

49.71

57.27

36.93

Zukyo-2

60.22

32.78

43.63

25.72

83.70

50.78

58.94

37.81

Zukyo-3

57.79

31.27

42.24

24.80

82.13

50.03

58.57

37.64

Zukyo-4

56.74

30.96

42.16

24.77

82.01

50.24

58.84

38.03

EN

SRCB-1

82.60

54.96

79.19

52.62

92.86

64.02

88.62

60.95

SRCB-2

82.66

55.00

78.74

52.31

92.93

64.06

88.05

60.59

SRCB-3

80.61

53.58

77.19

51.05

92.24

63.88

87.87

60.50

Zukyo-1

75.92

49.57

63.50

41.07

90.85

63.16

74.10

50.62

Zukyo-2

79.97

52.99

67.07

44.00

91.32

63.25

75.51

51.63

Zukyo-3

78.77

51.92

65.32

42.64

91.56

63.46

74.69

51.04

Zukyo-4

78.95

52.45

65.45

43.09

91.70

63.81

75.13

51.65

2

CR

JA

AMI-1

37.10

36.44

37.10

36.44

61.63

60.91

61.63

60.91

Baseline

25.12

24.74

19.49

19.12

45.89

45.47

34.64

34.24

RR

JA

AMI-1

64.85

62.17

51.33

49.58

88.43

85.71

68.64

66.85

Baseline

62.55

60.13

46.68

44.62

82.89

80.39

60.94

58.80

EN

Syapse-1

54.96

54.22

50.37

49.68

82.89

81.79

75.99

74.95

Abbreviations: CR, case report; EN, English; JA, Japanese.


Note: Bold font indicates the best score for each evaluation metric.


The best scores range from 49.93 (exact, +mod, weighted) to 78.46 (partial, label) among the evaluation factors in Task1-CR-JA, most of which were achieved by our baseline, whereas SRCB-2 consistently achieved the best scores ranging from 51.81 to 78.80 in Task1-CR-EN. The best scores in Task1-RR-JA range from 46.74 to 96.39 (by the AMI systems and our baseline), whereas the scores of 52.62 to 92.93 (by SRCB-1 and 2) were the best in Task1-RR-EN. In Task2-CR-JA and Task2-RR-JA, AMI-1 outperformed our baseline, achieving the scores 36.44–61.63 and 49.58–88.43, respectively. The only participating system, Syapse-1, scored 49.67 to 82.89 in Task2-RR-EN. No system was submitted to Task2-RR-JA.


#

Task 3

ADE Application

[Tables 7] and [8] list the results of the ADE application on MedTxt-CR. Note that the test dataset does not contain any ADEval = 2 entities. Three systems, including our baseline, participated the Japanese corpus track (JA). At the entity level, the NICTmed systems performed better than our baseline in prioritizing precision; P and F tended to be higher. At the reporting level, our baseline achieved the best recall (77.78), whereas NICTmed-1 performed the best in terms of P (37.50) and F (48.00).

Table 7

Results of Task 3 for MedTxt-CR-JA

ADEval = 0

ADEval = 1

ADEval = 3

Report-level

System ID

P

R

F

P

R

F

P

R

F

P

R

F

Baseline

95.21

76.04

84.55

0.00

0.00

0.00

6.98

52.94

12.33

12.73

77.78

21.88

NICTmed-1

95.76

97.67

96.71

0.00

0.00

0.00

12.50

11.76

12.12

37.50

66.67

48.00

NICTmed-2

96.05

97.00

96.52

0.00

0.00

0.00

27.59

47.06

34.78

25.00

44.44

32.00

Note: Italic font indicates the best score for each evaluation metric.


Table 8

Results of Task 3 for MedTxt-CR-EN

ADEval = 0

ADEval = 1

ADEval = 3

Report-level

System ID

P

R

F

P

R

F

P

R

F

P

R

F

FRDC-1

95.70

94.94

95.32

20.00

5.26

8.33

62.50

26.32

37.04

22.22

66.67

33.33

FRDC-2

95.79

97.00

96.39

14.29

5.26

7.69

43.75

36.84

40.00

29.41

55.56

38.46

FRDC-3

95.95

93.52

94.72

6.25

5.26

5.71

28.57

21.05

24.24

19.35

66.67

30.00

FRDC-4

96.05

92.10

94.03

25.00

5.26

8.70

22.22

42.11

29.09

18.92

77.78

30.43

FRDC-5

95.87

95.26

95.56

0.00

0.00

0.00

56.25

47.37

51.43

25.93

77.78

38.89

FRDC-6

96.14

94.47

95.30

25.00

10.53

14.81

50.00

21.05

29.63

21.21

77.78

33.33

FRDC-7

95.67

94.31

94.99

0.00

0.00

0.00

33.33

26.32

29.41

19.35

66.67

30.00

FRDC-8

96.42

97.79

97.10

20.00

5.26

8.33

47.62

52.63

50.00

50.00

77.78

60.87

FRDC-9

96.35

91.79

94.01

0.00

0.00

0.00

23.81

52.63

32.79

18.92

77.78

30.43

FRDC-10

95.87

95.26

95.56

7.14

5.26

6.06

26.92

36.84

31.11

23.08

66.67

34.29

NICTmed-1

96.53

96.68

96.61

0.00

96.68

0.00

31.25

52.63

39.22

25.00

55.56

34.48

NICTmed-2

95.39

98.10

96.73

0.00

0.00

0.00

40.00

42.11

41.03

40.00

44.44

42.11

SRCB-1

96.57

97.95

97.25

14.29

5.26

7.69

60.00

63.16

61.54

50.00

66.67

57.14

SRCB-2

96.57

97.95

97.25

0.00

0.00

0.00

59.09

68.42

63.41

50.00

66.67

57.14

SRCB-3

96.28

98.10

97.18

0.00

0.00

0.00

60.00

63.16

61.54

50.00

55.56

52.63

SRCB-4

96.41

97.63

97.02

0.00

0.00

0.00

57.14

63.16

60.00

50.00

66.67

57.14

SRCB-5

95.88

99.37

97.60

0.00

0.00

0.00

78.57

57.89

66.67

60.00

33.33

42.86

SRCB-6

95.99

98.26

97.11

33.33

5.26

9.09

55.56

52.63

54.05

50.00

44.44

47.06

Syapse-1

97.02

97.63

97.32

30.00

31.58

30.77

100.0

26.32

41.67

50.00

88.89

64.00

Note: Italic font indicates the best score for each evaluation metric.


Further, 19 systems were submitted to the English track (EN). Syapse-1 achieved the best scores most frequently (i.e., in the five metrics: P in ADEval = 0, F in ADEval = 1, P in ADEval = 3, and R and F at the reporting level). SRCB-2, -5, and -6 performed the best in several metrics. These scores were higher than those for JA, with an average difference of ∼20–30 points.


#

CI Application

[Table 9] shows the evaluation scores for Task3-RR, the CI application. In the Japanese corpus (JA), Zukyo-1 achieved the highest scores for all metrics, 34.09 in AdjMI and 36.22 in FM. For the English corpus (EN), FRDC-1 achieved 84.37 AdjMI and 84.36 FM. Overall, the EN scores tended to be higher than those of JA.

Table 9

Performance of each system of Task 3 for MedTxt-RR (CI) in multiple evaluation metrics

Language

System ID

AdjMI

FM

Binary Acc

JA

GunNLP-1

0.1988

0.2674

0.7675

Baseline

0.1489

0.1814

0.8131

NICTmed-1

-0.0117

0.1170

0.7680

Zukyo-1

0.3409

0.3622

0.8285

EN

FRDC-1

0.8437

0.8436

0.9595

GunNLP-2

0.8116

0.8110

0.9508

GunNLP-3

0.8122

0.8126

0.9514

GunNLP-4

0.8122

0.8126

0.9514

GunNLP-5

0.8122

0.8126

0.9514

GunNLP-6

0.8122

0.8126

0.9514

GunNLP-7

0.8261

0.8166

0.9524

GunNLP-8

0.8123

0.8119

0.9514

GunNLP-9

0.8123

0.8119

0.9514

GunNLP-10

0.8255

0.8150

0.9519

NICTmed-1

-0.0045

0.1085

0.7809

Syapse-1

0.7309

0.6992

0.9206

Extreme prediction

(isolate all samples)

-4.7901

0.0000

0.0000

Abbreviations: EN, English; JA, Japanese.


Note: Italic font indicates the best score for each evaluation metric.



#
#
#

Discussions

Task 1

A distinction in the nature of the corpus is evident in the higher scores achieved for the RR corpus compared with the CR corpus. This observation aligns with the tendency for RRs to exhibit linguistic simplicity when compared with CRs.[27] The CR corpus has a large vocabulary (7,369 unique tokens) that covers most medical fields, whereas the RR corpus has a smaller number of unique tokens (i.e., 1,182).

The performances of the top-tier systems in the two languages (JA vs. EN) were similar (an average difference of ∼5 points), indicating that task difficulty was independent of the language in this size of training data (∼100 documents). This could benefit from pretraining the language models, which will be discussed subsequently.

For the boundary factor (exact vs. partial match), the partial scores were at least 10 points higher than the exact scores, regardless of the corpus or language. Remarkably, the best scores for the modality-agnostic unweighted partial match were close to 80 for CR and 95 for RR. This indicates that the best systems captured medically important phrases, at least partially, despite the relatively small training data.

For the frequency factor (weighted or not), we did not observe a change in the rank of the top systems even after weighting (except for the partial unweighted modality-agnostic match in RR-JA), which suggests that the best systems did not rely too much on high-frequency entities.

Finally, we discuss the approaches adopted by the participating teams. Their common methods are (1) language models and (2) data augmentation.

  • Language models: almost all systems employ Transformer-based language models, and many teams adopt domain-specific pretrained models, such as BioBERT and Clinical BERT in EN, and UTH-BERT in JA. Now that these pretrained models drive contemporary NLP, even models without additional techniques, such as our baseline, perform well enough.

  • Data augmentation: given the few-resource issues, many systems use data augmentation techniques. The results showed that machine translation–based methods (e.g., SRCB) contributed to the performance more than simple rule-based methods (e.g., FRDC). Owing to improvements in machine translation, round-trip translation would generate semantically correct samples; conversely, rule-based augmentation, such as random word swapping, might break the medical appropriateness of sentences.


#

Task 2

While Task 1 provided a small corpus of ∼100 documents, our new challenge, Task 2, only included ∼100 sentences in the annotation guidelines for model training. This challenge can be observed in the exact match performance: even the best systems resulted in only 50 to 70% of the highest scores of their Task 1 counterparts. However, the partial match scores of the best systems in Task 2 were rather close to those in Task 1, that is, within only 10-point difference in most cases. For instance, AMI-1 scored 61.63 (partial, +mod, unweighted) in CR-JA, which was an 8.14-point difference from our baseline of 69.77 in Task1-CR-JA. AMI-1 also achieved 88.43 (partial, label, and unweighted) in RR-JA, the performance of which seems sufficiently high for certain practical applications, such as medical concept–based document retrieval. Thus, this challenge revealed the potential feasibility of NER based on only a few samples for human annotators.


#

Task 3

ADE Application

At the entity level, the average F-scores of the submitted systems were proportional to the number of corresponding ADEval entities in the training set ([Table 4]). Report-level ADE performance tended to be inconsistent with entity-level performance; a better entity-level system was not necessarily a better report-level system. Although the corpora were parallel, most EN systems performed much better than the JA systems. For this task, the domain-specific language models effectively contributed to the results. Most EN systems are based on medically pretrained language models, such as BioBERT, clinical BERT, and PubMed BERT, whereas JA systems only adopt general-domain BERT models.

We then focused on effective approaches, particularly for EN. Regarding the F-scores for ADEval = 3 and at the report level, which mostly corresponded to ADE signal detection, the SRCB systems generally performed well (average of 61.2, ADEval = 3, and 52.3, respectively). They trained models on automatically generated snippets that explicitly explained which entity in a report was related to an ADE, which seemed to enhance the local and global ADE contexts. In addition, Syapse-1 performed best at the report level (64.00 F), whose method compares medicine and disease entities embedded by SciBERT per document; drug-disorder relations inside each document would contribute to report-level performance.


#

CI Application

The system Zukyo-1 achieved the highest scores, suggesting the effectiveness of sentence classification in determining TNM staging, even with the limited availability of knowledge.

FRDC-1, which uses heuristics for cancer size matching and Sentence-BERT encoding,[51] achieved the highest performance of all the systems. As shown in [Table 10], the RRs of cases 4 and 5 were successfully grouped into a single cluster, suggesting that matching lesion size is helpful for case distinction.

Table 10

Number of clusters into which each case was split by each system in Task 3 for MedTxt-RR (CI)

Case ID

4

5

7

8

10

14

15

TNM

T2aN0M0

T2bN0M0

T3N1M0

T3N3M0

T4N0M0

T4N3M1a

T2N2M1c

GunNLP-1

5

6

3

3

3

5

2

Baseline

6

7

5

8

6

5

5

NICTmed-1

6

5

5

6

5

5

5

Zukyo-1

4

5

3

4

4

3

2

FRDC-1

1

1

2

2

1

2

3

FRDC-2

1

1

2

3

1

2

3

FRDC-3

1

1

2

3

1

2

3

FRDC-4

1

1

2

3

1

2

3

FRDC-5

1

1

2

3

1

2

3

FRDC-6

1

1

2

3

1

2

3

FRDC-7

1

1

2

2

1

2

3

FRDC-8

1

1

2

2

2

2

3

FRDC-9

1

1

2

2

2

2

3

FRDC-10

1

1

2

2

1

2

3

NICTmed-1

5

5

6

7

6

5

5

Syapse-1

2

2

1

1

1

3

4

Gold Standard

1

1

1

1

1

1

1

Abbreviation: CI, case identification.


Although both used a NER-based approach, a large discrepancy was observed between the scores of the GunNLP-1 and Syapse-1 systems. This may reflect differences in the availability of biomedical knowledge bases between Japanese and English. Whereas Syapse-1 used UMLS to normalize biomedical entities, GunNLP-1 had to create bag-of-entity vectors only from the training set, which probably had difficulty dealing with unseen entities in the test set.

As listed in [Table 11], most systems grouped the test cases into the same number of clusters as the gold standard, although the true cluster number was not clarified. In this task, the test sample size quickly determines the true cluster number, as exploited by FRDC-1.

Table 11

Cluster sizes created by each system in Task 3 for MedTxt-RR (CI)

Corpus

System ID

Cluster number

Cluster size

Gold standard

7

9, 9, 9, 9, 9, 9, 9, 9, 9

RR-JA

GunNLP-1

8

18, 17, 9, 8, 4, 3, 2, 2

Baseline

33

19, 5, 4, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

NICTmed-1

7

11, 11, 10, 9, 8, 7, 7

Zukyo-1

7

13, 11, 11, 8, 7, 7, 6

RR-EN

FRDC-1

7

10, 9, 9, 9, 9, 9, 8

FRDC-2

7

11, 9, 9, 9, 9, 9, 7

FRDC-3

7

10, 10, 9, 9, 9, 9, 7

FRDC-4

7

10, 10, 9, 9, 9, 9, 7

FRDC-5

7

10, 10, 9, 9, 9, 9, 7

FRDC-6

7

10, 10, 9, 9, 9, 9, 7

FRDC-7

7

10, 10, 9, 9, 9, 9, 7

FRDC-8

7

10, 9, 9, 9, 9, 9, 8

FRDC-9

7

10, 9, 9, 9, 9, 9, 8

FRDC-10

7

11, 9, 9, 9, 9, 9, 7

NICTmed-1

9

12, 10, 8, 8, 7, 6, 6, 5, 1

Syapse-1

9

12, 12, 9, 9, 9, 7, 2, 2, 1

Abbreviation: CI, case identification.


In summary, the effective strategies differed between Japanese (RR-JA) and English (RR-EN). For the RR-EN, the embedding distance with the help of a knowledge base works well and can be applied in other clinical specialties beyond lung cancer. For the RR-JA, the lack of external public knowledge motivated participants to adopt a more dataset-specific approach, resulting in comparatively lower performance and a limited possibility of application beyond lung cancer.


#
#

Limitations

Our workshop has two major limitations. First, relatively few teams participated in the new tasks that we designed: guideline learning, ADE, and CI. The numbers of participants in both languages were also unbalanced. Although few results prevented a finer analysis, we hope these tasks will attract more attention.

Second, we translated the original Japanese corpora into English to create bilingual parallel corpora for our tasks, which may have produced unnatural medical texts in English. It is generally known that the writing style of clinical documents varies in languages and nations. Our English corpora may deviate from the standard writing of typical English CRs and RRs. However, we believe that medical parallel corpora will help international communities understand clinical writing styles in non-English languages, which is important for language-independent MedNLP applications in the future.


#
#

Clinical or Public Health Implications

The designed tasks were oriented toward real-world clinical document processing. Although they do not directly affect patient health, the participating teams proposed MedNLP techniques to extract information useful for medical research and analysis from texts (e.g., phenotyping and ADE). In the future, application systems adapting these techniques will support the work and study of medical workers, benefiting patients.


#

Conclusion

This study introduced the Real-MedNLP workshop, which encompassed three distinct medical NLP tasks conducted on bilingual parallel corpora (English and Japanese): NER, ADE extraction, and CI. The participating teams employed a dual approach, which involved (1) implementing data augmentation techniques and (2) utilizing domain-specific pre-trained language models like BioBERT and ClinicalBERT. These strategies partially addressed the challenges associated with limited resources in MedNLP. However, the performance in tasks involving extremely low-resource settings, such as Task 2 (guideline learning), remained insufficient. Specifically, for newly devised tasks like Task 3 ADE and CI applications, significant effort was required to establish evaluation methodologies that accurately captured their performance characteristics.


#

Future Work

Since our three tasks and other medical tasks are awaiting NLP solutions, organizing and sharing approaches and results worldwide is important. We believe that our datasets and results will boost future research. The results of this workshop provide the rigorous “baseline” for medical information extraction since it was held right before the rise of large language models (LLMs) such as ChatGPT[64] and Gemini.[65] By comparing their performance in our tasks with our results based on pre-LLM cutting-edge techniques, we can accurately gauge the capability of LLMs in low-resource medical NLP.

Furthermore, we worked on a successor of Real-MedNLP in NTCIR-17 (from 2022 to 2023), entitled “MedNLP-SC,” where “SC” stands for social media and clinical text.[66] [67] This new workshop posed information extraction from patient-generated and doctor-generated texts, where the low-resource setting is still active, given our experience in Real-MedNLP. Evaluating and comparing the outcomes of this workshop with those of the current one will be another focus for future research.


#
#

Conflict of Interest

None declared.

Acknowledgment

The authors also greatly appreciate the NTCIR-16 chairs for their efforts in organizing the NTCIR-16 conference. Finally, the authors thank all participants for their contributions to our Real-MedNLP workshop.

Ethical Approval Statement

This study did not require the participants to be involved in any physical or mental intervention. Furthermore, as it did not utilize personally identifiable information, the study was exempt from institutional review board approval in accordance with the Ethical Guidelines for Medical and Health Research Involving Human Subjects outlined by the Japanese national government.


  • References

  • 1 Aramaki E, Wakamiya S, Yada S, Nakamura Y. Natural language processing: From bedside to everywhere. Yearb Med Inform 2022; 31 (01) 243-253
  • 2 Suominen H, Salanterä S, Velupillai S. et al. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Berlin, Heidelberg:: Springer; 2013: 212-231
  • 3 Névéol A, Cohen KB, Grouin C. et al. Clinical information extraction at the CLEF eHealth evaluation lab 2016. CEUR Workshop Proc 2016; 1609: 28-42
  • 4 Henry S, Buchan K, Filannino M, Stubbs A, Uzuner O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J Am Med Inform Assoc 2020; 27 (01) 3-12
  • 5 Jagannatha A, Liu F, Liu W, Yu H. Overview of the first natural language processing challenge for extracting medication, indication, and adverse drug events from electronic health record notes (MADE 1.0). Drug Saf 2019; 42 (01) 99-111
  • 6 Ben Abacha A, Mrabet Y, Zhang Y, Shivade C, Langlotz C, Demner-Fushman D. Overview of the MEDIQA 2021 shared task on summarization in the medical domain. In: Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics; 2021: 74-85
  • 7 He B, Dong B, Guan Y. et al. Building a comprehensive syntactic and semantic corpus of Chinese clinical texts. J Biomed Inform 2017; 69: 203-217
  • 8 Campillos L, Deléger L, Grouin C, Hamon T, Ligozat AL, Névéol A. A French clinical corpus with comprehensive semantic annotations: development of the medical entity and relation LIMSI annOtated text corpus (MERLOT). Lang Resour Eval 2018; 52 (02) 571-601
  • 9 Oliveira LESE, Peters AC, da Silva AMP. et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semantics 2022; 13 (01) 13
  • 10 Morita M, Kano Y, Ohkuma T, Miyabe M, Aramaki E. Overview of the NTCIR-10 MedNLP task. In: Proceedings of the 10th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-10, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2013
  • 11 Morita M, Kano Y, Ohkuma T, Aramaki E. Overview of the NTCIR-11 MedNLP task. In: Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-11, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2014
  • 12 Morita M, Kano Y, Ohkuma T, Aramaki E. Overview of the NTCIR-12 MedNLPDoc task. In: Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-12, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2016
  • 13 Wakamiya S, Morita M, Kano Y, Ohkuma T, Aramaki E. Overview of the NTCIR-13 MedWeb task. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-13, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2017
  • 14 Uzuner O, Solti I, Xia F, Cadag E. Community annotation experiment for ground truth generation for the i2b2 medication challenge. J Am Med Inform Assoc 2010; 17 (05) 519-523
  • 15 Hedderich MA, Lange L, Adel H, Strötgen J, Klakow D. A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021: 2545-2568
  • 16 Kirby JC, Speltz P, Rasmussen LV. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc 2016; 23 (06) 1046-1052
  • 17 Xue H, Li J, Xie H, Wang Y. Review of drug repositioning approaches and resources. Int J Biol Sci 2018; 14 (10) 1232-1244
  • 18 Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25 (04) 689-705
  • 19 Roberts K, Demner-Fushman D, Voorhees EM. et al. Overview of the TREC 2017 precision medicine track. In: Proceedings of the Twenty-Sixth Text REtrieval Conference. 2017: 26
  • 20 Biswal S, Xiao C, Glass LM, Westover B, Sun J. CLARA: Clinical report auto-completion. In: Proceedings of the Web Conference 2020. WWW '20. Association for Computing Machinery; 2020: 541-550
  • 21 Yazdani A, Safdari R, Golkar A, RNiakanKalhori S. Words prediction based on N-gram model for free-text entry in electronic health records. Health Inf Sci Syst 2019; 7 (01) 6
  • 22 Pivovarov R, Elhadad N. Automated methods for the summarization of electronic health records. J Am Med Inform Assoc 2015; 22 (05) 938-947
  • 23 J-Stage. Accessed April 18, 2024 at: https://www.jstage.jst.go.jp/
  • 24 Ito K, Nagai H, Okahisa T, Wakamiya S, Iwao T, Aramaki E. J-MeDic: A Japanese disease name dictionary based on real clinical usage. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA); 2018 . Accessed September 4, 2024 at: https://aclanthology.org/L18-1375
  • 25 Pons E, Braun LM, Hunink MG, Kors JA. Natural language processing in radiology: a systematic review. Radiology 2016; 279 (02) 329-343
  • 26 Nakamura Y, Hanaoka S, Nomura Y. et al. Clinical comparable corpus describing the same subjects with different expressions. Stud Health Technol Inform 2022; 290: 253-257
  • 27 Yada S, Joh A, Tanaka R, Cheng F, Aramaki E, Kurohashi S. Towards a versatile medical-annotation guideline feasible without heavy medical knowledge: starting from critical lung diseases. In: Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association; 2020 :4565–4572. Accessed September 4, 2024 at: https://www.aclweb.org/anthology/2020.lrec-1.561
  • 28 Yada S. Aramaki E, Tanaka R, Cheng F, Kurohashi S. Medical/Clinical Text Annotation Guidelines. 2021
  • 29 Yada S, Tanaka R, Cheng F, Aramaki E, Kurohashi S. Versatile annotation guidelines for clinical-medical text with an application to critical lung diseases [in Japanese]. J Nat Lang Process 2022; 29 (04) 1165-1197
  • 30 Kelly CR, Kunde SS, Khoruts A. Guidance on preparing an investigational new drug application for fecal microbiota transplantation studies. Clin Gastroenterol Hepatol 2014; 12 (02) 283-288
  • 31 NTCIR test collection. Accessed September 4, 2024 at: https://research.nii.ac.jp/ntcir/data/data-en.html
  • 32 Nishiyama T, Nishidani M, Ando A, Yada S, Wakamiya S, Aramaki E. NAISTSOC at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 330-333
  • 33 Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019: 4171-4186
  • 34 BERT models for Japanese NLP. Accessed April 18, 2023 at: https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
  • 35 MeCab. Accessed April 18, 2023 at: https://taku910.github.io/mecab/
  • 36 MedNER-CR-JA. Accessed April 18, 2023 at: https://huggingface.co/sociocom/MedNER-CR-JA
  • 37 Hiai S, Nagayama S, Kojima A. AMI team at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies;. 2022: 297-304
  • 38 Zhong Z, Fang L, Cao Y. FRDC at NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 311-315
  • 39 Noguchi R. GunNLP at the NTCIR-16 Real-MedNLP task: Collaborative filtering-based similar case identification method via structured data “case matrix.”. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 349-352
  • 40 Ideuchi M, Tsuchiya M, Wang Y, Utiyama M. NICTmed at the NCTIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 339-344
  • 41 Shao S, Jin G, Satoh D, Nomura Y. NTTD at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 345-348
  • 42 Zhang Y, Cheng R, Luo L, Gao H, Jiang S, Dong B. SRCB at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 305-310
  • 43 Holmes B, Gagorik A, Loving J, Green F, Huang H. Syapse at the NCTIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 334-338
  • 44 Fujimoto K, Nishio M, Sugiyama O. et al. Approach for named entity recognition and case identification implemented by ZuKyo-JA sub-team at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 322-329
  • 45 Nooralahzadeh F, Horvath AN, Krauthammer M. Leveraging token-based concept information and data augmentation in few-resource NER: ZuKyo-EN at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies;. 2022: 316-321
  • 46 Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One 2021; 16 (11) e0259763
  • 47 Hyakuyaku-Dictionary. Accessed April 18, 2023 at: https://sociocom.naist.jp/hyakuyaku-dic/
  • 48 ComeJisyo. . Accessed April 18, 2023 at: https://ja.osdn.net/projects/comedic/
  • 49 Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (04) 1234-1240
  • 50 Dai X, Adel H. An analysis of simple data augmentation for named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics;. 2020: 3861-3867
  • 51 Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019: 3982-3992
  • 52 Conneau A, Khandelwal K, Goyal N. et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020: 8440-8451
  • 53 Alsentzer E, Murphy J, Boag W. et al. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2019: 72-78
  • 54 Gu Y, Tinn R, Cheng H. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc 2021; 3 (01) 1-23
  • 55 Lin C. Miller T, Dligach D, Bethard S, Savova GK. EntityBERT: Entity-centric masking strategy for model pretraining for the clinical domain. In: Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics; 2021: 191-201
  • 56 Zhou B, Cui Q, Wei XS, Chen ZM. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. Published online 2019. doi:
  • 57 Aronson AR. Effective mapping of biomedical text to the UMLS metathesaurus: The MetaMap program. In: Proceedings of the AMIA Symposium. American Medical Informatics Association; 2001: 17-21
  • 58 Neumann M, King D, Beltagy I, Ammar W. ScispaCy: Fast and robust models for biomedical natural language processing. Published online February 2019. Accessed September 4, 2024 at: https://arxiv.org/abs/1902.07669
  • 59 Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. Published online March 2019. Accessed September 4, 2024 at: https://arxiv.org/abs/1903.10676
  • 60 Brierley JD, Gospodarowicz MK, Wittekind C. TNM Classification of Malignant Tumours. Oxford: John Wiley & Sons; 2017
  • 61 Zhuang L, Wayne L, Ya S, Jun Z. A robustly optimized BERT pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics. Chinese Information Processing Society of China; 2021 :1218–1227. Accessed September 4, 2024 at: https://aclanthology.org/2021.ccl-1.108
  • 62 Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Is a correction for chance necessary?. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML '09. Association for Computing Machinery; 2009: 1073-1080
  • 63 Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. J Am Stat Assoc 1983; 78 (383) 553-569
  • 64 ChatGPT. Accessed September 4, 2024 at: https://chat.openai.com/
  • 65 Gemini Team Google. Gemini: A family of highly capable multimodal models. Published online December 2023. Accessed September 4, 2024 at: https://arxiv.org/abs/2312.11805
  • 66 Nakamura Y, Hanaoka S, Yada S, Wakamiya S, Aramaki E. NTCIR-17 MedNLP-SC Radiology Report Subtask overview: Dataset and solutions for automated lung cancer staging. In: Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 145-151
  • 67 Wakamiya S, Pereira LK, Raithel L. et al. NTCIR-17 MedNLP-SC social media adverse drug event detection: Subtask overview. In: Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 131-141

Address for correspondence

Eiji Aramaki, PhD
Graduate School of Science and Technology, Nara Institute of Science and Technology
Nara
Japan   

Publication History

Received: 07 June 2023

Accepted: 23 August 2024

Accepted Manuscript online:
29 August 2024

Article published online:
29 October 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Stuttgart · New York

  • References

  • 1 Aramaki E, Wakamiya S, Yada S, Nakamura Y. Natural language processing: From bedside to everywhere. Yearb Med Inform 2022; 31 (01) 243-253
  • 2 Suominen H, Salanterä S, Velupillai S. et al. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In: Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Berlin, Heidelberg:: Springer; 2013: 212-231
  • 3 Névéol A, Cohen KB, Grouin C. et al. Clinical information extraction at the CLEF eHealth evaluation lab 2016. CEUR Workshop Proc 2016; 1609: 28-42
  • 4 Henry S, Buchan K, Filannino M, Stubbs A, Uzuner O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J Am Med Inform Assoc 2020; 27 (01) 3-12
  • 5 Jagannatha A, Liu F, Liu W, Yu H. Overview of the first natural language processing challenge for extracting medication, indication, and adverse drug events from electronic health record notes (MADE 1.0). Drug Saf 2019; 42 (01) 99-111
  • 6 Ben Abacha A, Mrabet Y, Zhang Y, Shivade C, Langlotz C, Demner-Fushman D. Overview of the MEDIQA 2021 shared task on summarization in the medical domain. In: Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics; 2021: 74-85
  • 7 He B, Dong B, Guan Y. et al. Building a comprehensive syntactic and semantic corpus of Chinese clinical texts. J Biomed Inform 2017; 69: 203-217
  • 8 Campillos L, Deléger L, Grouin C, Hamon T, Ligozat AL, Névéol A. A French clinical corpus with comprehensive semantic annotations: development of the medical entity and relation LIMSI annOtated text corpus (MERLOT). Lang Resour Eval 2018; 52 (02) 571-601
  • 9 Oliveira LESE, Peters AC, da Silva AMP. et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semantics 2022; 13 (01) 13
  • 10 Morita M, Kano Y, Ohkuma T, Miyabe M, Aramaki E. Overview of the NTCIR-10 MedNLP task. In: Proceedings of the 10th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-10, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2013
  • 11 Morita M, Kano Y, Ohkuma T, Aramaki E. Overview of the NTCIR-11 MedNLP task. In: Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-11, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2014
  • 12 Morita M, Kano Y, Ohkuma T, Aramaki E. Overview of the NTCIR-12 MedNLPDoc task. In: Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-12, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2016
  • 13 Wakamiya S, Morita M, Kano Y, Ohkuma T, Aramaki E. Overview of the NTCIR-13 MedWeb task. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR-13, National Center of Sciences,. Tokyo. National Institute of Informatics (NII); 2017
  • 14 Uzuner O, Solti I, Xia F, Cadag E. Community annotation experiment for ground truth generation for the i2b2 medication challenge. J Am Med Inform Assoc 2010; 17 (05) 519-523
  • 15 Hedderich MA, Lange L, Adel H, Strötgen J, Klakow D. A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021: 2545-2568
  • 16 Kirby JC, Speltz P, Rasmussen LV. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc 2016; 23 (06) 1046-1052
  • 17 Xue H, Li J, Xie H, Wang Y. Review of drug repositioning approaches and resources. Int J Biol Sci 2018; 14 (10) 1232-1244
  • 18 Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25 (04) 689-705
  • 19 Roberts K, Demner-Fushman D, Voorhees EM. et al. Overview of the TREC 2017 precision medicine track. In: Proceedings of the Twenty-Sixth Text REtrieval Conference. 2017: 26
  • 20 Biswal S, Xiao C, Glass LM, Westover B, Sun J. CLARA: Clinical report auto-completion. In: Proceedings of the Web Conference 2020. WWW '20. Association for Computing Machinery; 2020: 541-550
  • 21 Yazdani A, Safdari R, Golkar A, RNiakanKalhori S. Words prediction based on N-gram model for free-text entry in electronic health records. Health Inf Sci Syst 2019; 7 (01) 6
  • 22 Pivovarov R, Elhadad N. Automated methods for the summarization of electronic health records. J Am Med Inform Assoc 2015; 22 (05) 938-947
  • 23 J-Stage. Accessed April 18, 2024 at: https://www.jstage.jst.go.jp/
  • 24 Ito K, Nagai H, Okahisa T, Wakamiya S, Iwao T, Aramaki E. J-MeDic: A Japanese disease name dictionary based on real clinical usage. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA); 2018 . Accessed September 4, 2024 at: https://aclanthology.org/L18-1375
  • 25 Pons E, Braun LM, Hunink MG, Kors JA. Natural language processing in radiology: a systematic review. Radiology 2016; 279 (02) 329-343
  • 26 Nakamura Y, Hanaoka S, Nomura Y. et al. Clinical comparable corpus describing the same subjects with different expressions. Stud Health Technol Inform 2022; 290: 253-257
  • 27 Yada S, Joh A, Tanaka R, Cheng F, Aramaki E, Kurohashi S. Towards a versatile medical-annotation guideline feasible without heavy medical knowledge: starting from critical lung diseases. In: Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association; 2020 :4565–4572. Accessed September 4, 2024 at: https://www.aclweb.org/anthology/2020.lrec-1.561
  • 28 Yada S. Aramaki E, Tanaka R, Cheng F, Kurohashi S. Medical/Clinical Text Annotation Guidelines. 2021
  • 29 Yada S, Tanaka R, Cheng F, Aramaki E, Kurohashi S. Versatile annotation guidelines for clinical-medical text with an application to critical lung diseases [in Japanese]. J Nat Lang Process 2022; 29 (04) 1165-1197
  • 30 Kelly CR, Kunde SS, Khoruts A. Guidance on preparing an investigational new drug application for fecal microbiota transplantation studies. Clin Gastroenterol Hepatol 2014; 12 (02) 283-288
  • 31 NTCIR test collection. Accessed September 4, 2024 at: https://research.nii.ac.jp/ntcir/data/data-en.html
  • 32 Nishiyama T, Nishidani M, Ando A, Yada S, Wakamiya S, Aramaki E. NAISTSOC at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 330-333
  • 33 Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019: 4171-4186
  • 34 BERT models for Japanese NLP. Accessed April 18, 2023 at: https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
  • 35 MeCab. Accessed April 18, 2023 at: https://taku910.github.io/mecab/
  • 36 MedNER-CR-JA. Accessed April 18, 2023 at: https://huggingface.co/sociocom/MedNER-CR-JA
  • 37 Hiai S, Nagayama S, Kojima A. AMI team at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies;. 2022: 297-304
  • 38 Zhong Z, Fang L, Cao Y. FRDC at NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 311-315
  • 39 Noguchi R. GunNLP at the NTCIR-16 Real-MedNLP task: Collaborative filtering-based similar case identification method via structured data “case matrix.”. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 349-352
  • 40 Ideuchi M, Tsuchiya M, Wang Y, Utiyama M. NICTmed at the NCTIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 339-344
  • 41 Shao S, Jin G, Satoh D, Nomura Y. NTTD at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 345-348
  • 42 Zhang Y, Cheng R, Luo L, Gao H, Jiang S, Dong B. SRCB at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 305-310
  • 43 Holmes B, Gagorik A, Loving J, Green F, Huang H. Syapse at the NCTIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 334-338
  • 44 Fujimoto K, Nishio M, Sugiyama O. et al. Approach for named entity recognition and case identification implemented by ZuKyo-JA sub-team at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 322-329
  • 45 Nooralahzadeh F, Horvath AN, Krauthammer M. Leveraging token-based concept information and data augmentation in few-resource NER: ZuKyo-EN at the NTCIR-16 Real-MedNLP task. In: Proceedings of the 16th NTCIR Conference on Evaluation of Information Access Technologies;. 2022: 316-321
  • 46 Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One 2021; 16 (11) e0259763
  • 47 Hyakuyaku-Dictionary. Accessed April 18, 2023 at: https://sociocom.naist.jp/hyakuyaku-dic/
  • 48 ComeJisyo. . Accessed April 18, 2023 at: https://ja.osdn.net/projects/comedic/
  • 49 Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (04) 1234-1240
  • 50 Dai X, Adel H. An analysis of simple data augmentation for named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics;. 2020: 3861-3867
  • 51 Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019: 3982-3992
  • 52 Conneau A, Khandelwal K, Goyal N. et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020: 8440-8451
  • 53 Alsentzer E, Murphy J, Boag W. et al. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2019: 72-78
  • 54 Gu Y, Tinn R, Cheng H. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc 2021; 3 (01) 1-23
  • 55 Lin C. Miller T, Dligach D, Bethard S, Savova GK. EntityBERT: Entity-centric masking strategy for model pretraining for the clinical domain. In: Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics; 2021: 191-201
  • 56 Zhou B, Cui Q, Wei XS, Chen ZM. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. Published online 2019. doi:
  • 57 Aronson AR. Effective mapping of biomedical text to the UMLS metathesaurus: The MetaMap program. In: Proceedings of the AMIA Symposium. American Medical Informatics Association; 2001: 17-21
  • 58 Neumann M, King D, Beltagy I, Ammar W. ScispaCy: Fast and robust models for biomedical natural language processing. Published online February 2019. Accessed September 4, 2024 at: https://arxiv.org/abs/1902.07669
  • 59 Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. Published online March 2019. Accessed September 4, 2024 at: https://arxiv.org/abs/1903.10676
  • 60 Brierley JD, Gospodarowicz MK, Wittekind C. TNM Classification of Malignant Tumours. Oxford: John Wiley & Sons; 2017
  • 61 Zhuang L, Wayne L, Ya S, Jun Z. A robustly optimized BERT pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics. Chinese Information Processing Society of China; 2021 :1218–1227. Accessed September 4, 2024 at: https://aclanthology.org/2021.ccl-1.108
  • 62 Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Is a correction for chance necessary?. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML '09. Association for Computing Machinery; 2009: 1073-1080
  • 63 Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. J Am Stat Assoc 1983; 78 (383) 553-569
  • 64 ChatGPT. Accessed September 4, 2024 at: https://chat.openai.com/
  • 65 Gemini Team Google. Gemini: A family of highly capable multimodal models. Published online December 2023. Accessed September 4, 2024 at: https://arxiv.org/abs/2312.11805
  • 66 Nakamura Y, Hanaoka S, Yada S, Wakamiya S, Aramaki E. NTCIR-17 MedNLP-SC Radiology Report Subtask overview: Dataset and solutions for automated lung cancer staging. In: Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 145-151
  • 67 Wakamiya S, Pereira LK, Raithel L. et al. NTCIR-17 MedNLP-SC social media adverse drug event detection: Subtask overview. In: Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies. 2022: 131-141

Zoom Image
Fig. 1 Annotated sample of the case report corpora in English (MedTxt-CR-EN). The entity notations stand for D = diseases and symptoms with the modality “certainty” such as positive (+) and negative (−); A = anatomical parts; Time = time expressions with the modality “type” such as date (DATE), age (AGE), and medically specific (MED); Tt/k/v = test set/item/values with the modality “state” such as executed (+); and Mk = medicine name with the modality “state” such as executed (+).
Zoom Image
Fig. 2 Annotated sample of the radiology report corpora in English (MedTxt-RR-EN). The entity notations stand for D = diseases and symptoms with the modality “certainty” such as positive (+) and suspicious (?); A = anatomical parts; Time = time expressions with the modality “type” such as date (DATE); and Tt = test sets with the modality “state” such as executed (+).
Zoom Image
Fig. 3 Task 3—ADE application, wherein each disease or medication entity mentioned in case reports is labeled with the degree of involvement in adverse drug events (ADEval), ranging from 0 to 3. ADE, adverse drug effect.
Zoom Image
Fig. 4 Task 3—CI application, where radiology reports written by different radiologists are grouped by the described cases. CI, case identification.
Zoom Image
Fig. 5 Input and output formats of the baseline model for Task3-CR (ADE). Although the real inputs were in Japanese, an English sample is used in this figure for readability.
Zoom Image
Fig. 6 Input and output formats of the baseline model for Task3-RR (CI). Although the real inputs were in Japanese, an English sample is used in this figure for readability. CI, case identification.