2 Methods
We collected most of the papers reviewed for this study in November 2022 using Google Scholar. Although not perfect, this search engine aggregates publications from various databases. Additionally, some publicly available archives and anthologies (such as the Anthology of the Association for Computational Linguistics) rely on Google's indices for their paper retrieval. To automate our search, we utilized an available API[1]. Due to time and resource constraints, we did not create individual queries for LoE. Instead, we used the general terms „multilingual“ and „cross-lingual,“ as these terms are frequently mentioned in papers addressing tasks in LoE. Furthermore, in the existing NLP literature, the term “multilingual” can refer to a wide range of topics, including the implementation of language-agnostic pipelines [[10]], learning language representations in multilingual space, and experimenting with downstream tasks in LoE. Multilingual language models are often used in the latter case [[11], [12]].
Our main query was (multilingual OR cross-lingual) AND (medical OR clinical) AND (NLP OR „natural language processing“), with a time window restriction of 2020-2022. To retrieve results for our selected topics, we used targeted queries with specific keywords, including (corpus OR database), (NER OR „named entity recognition“), and negation. These targeted queries resulted in 7,990, 3,280, and 5,950 hits, respectively. We ordered our findings based on relevance[2] and manually inspected the top 100 titles and abstracts in each category, selecting papers that address the predefined downstream tasks in LoE, and publicly accessible non-English and, when available, multilingual models and datasets. Despite relevance-based ordering, we encountered many irrelevant hits, confirming the limitations of automatic querying of online databases. We used the same query to search PubMed and medRxiv, selecting 10 papers from PubMed and one already published paper from medRxiv that met our criteria. Furthermore, we analyzed the retrieved publications to assemble a list of shared tasks from 2020 to 2022, examining their proceedings for multilingualism, cross-lingual methods, and LoE. Eleven papers in our survey come from arXiv.org, with seven of them having peer-reviewed publications available at a later date.
As an additional point of reference, we collected all available NLP review and survey papers in the biomedical and clinical domain that were published in the last two years and mentioned multilingualism [[1], [3], [13]
[14]
[15]]. Although we tried to be as complete as possible, this paper is not meant to be a comprehensive review. There could be limitations due to the usage of search engines, language-agnostic queries, and the fact that we focused on papers written in English.
3 Results
Section 3.1 introduces new corpora in LoE as well as some multilingual resources. Section 3.2 lists new large Pretrained Language Models (PLMs) in LoE and discusses various pretraining strategies. Sections 3.3 and 3.4 present new publications and advances in two foundational medical NLP tasks, namely Named Entity Recognition (NER) and negation detection. Special attention is given to the recently shared tasks in LoE, which we provide in Section 3.5. While conducting our survey, we identified an unparalleled rise in Spanish medical NLP. We dedicate Section 3.6 to this phenomenon.
3.1 New Multilingual Resources and Monolingual Datasets in LoE
There is a well-known need for biomedical databases and annotated corpora in different languages. The issue remains at the center of attention for many years while slowly improving. In their 2018 survey, Névéol et al. listed annotated corpora in French, German, Greek, Japanese, Portuguese, Spanish, and Swedish [[4]]. In our survey, we address annotated corpora in LoE that have been made public in the years 2020-2022. We note that the number of languages has more than doubled and now we have data in Arabic, Basque, Bengali, Catalan, traditional and simplified Chinese, Dutch, French, Spanish, German, Italian, Japanese, Korean, Portuguese, Romanian, Russian, Serbian, Spanish and Vietnamese. The publicly available datasets can be found in [Table 1].
Table 1 Multilingual resources and non-English datasets.
A notable trend concerns the availability of multilingual corpora. The European Clinical Case Corpus (E3C) is a collection of clinical narratives and descriptive clinical documents in five languages [[16]]. The corpus contains annotations for temporal information as well as clinical entities. While English, French and Spanish texts were sourced from publicly available corpora, clinical cases in Italian and Basque had to be manually extracted from various sources. Significant advances in the quality of Neural Machine Translation (NMT) opened new opportunities for the creation of silver standard corpora. To foster the development of multilingual tools, the LivingNER [[17]] and DisTEMIST [[18]] corpora provide gold standard annotated data in Spanish together with the machine-translated and post-edited counterparts in English, Portuguese, French, Italian, Romanian, Catalan, and Galician. The two corpora feature annotations of species mentions and disease mentions, respectively, which have been transferred to all the languages. NMT facilitated the development of RuMedNLI [[19]], the first comprehensive open Russian medical language understanding benchmark, which is a Russian version of MedNLI [[20]]. The corpus was initially translated using Google Translate and DeepL MT systems, followed by post-editing from a team of native speakers and medical specialists who corrected numerous domain-specific translation errors, as well as cultural and localization issues, including drug name adaptations and measurement conversions.
Several corpora have been developed for German, most of them being an approximation for real data. There now exists an MT-produced GERNERMED [[21]], as well as a fully generated GPTNERMED [[22]] corpus. Both of these silver standard corpora were created using open resources and utilized to train models which were then evaluated on gold standard data. The Graz Synthetic Clinical Corpus (GraSCCo) is composed of original clinical documents that have undergone several rounds of anonymization and significant linguistic alterations [[23]]. The corpus was subsequently compared syntactically and semantically with real, non-shareable German clinical data and found to be a suitable dataset for training clinical language models. GGPONC [[24]] and GGPONC 2.0 [[25]] contain German clinical guidelines for oncology and can be used as a proxy for real clinical data. Finally, BRONCO [[26]] is a dataset of de-identified real clinical documents. However, this corpus is rather small and its sentences are randomly shuffled across documents to prevent deanonymization.
Two medical French datasets, CAS [[27]] and CLISTER [[28]], comprise clinical cases extracted from the scientific literature. Yada et al. [[29]] created a Japanese dataset for NER, consisting of case and radiology reports. Additionally, there are now two Korean datasets for NER that feature multiple question-answer pairs [[30], [31]]. Other resources include SIMONERO, a Romanian medical treebank with gold standard morphological annotations; BanglaBioMed [[32]], a biomedical named-entity annotated Bengali corpus; and UIT-ViNewsQA [[33]], a crowd-sourced Vietnamese dataset. Boudjellal et al. [[34]] created an Arabic dataset for NER and entity linking, employing a silver standard annotation scheme. This dataset was later utilized for fine-tuning a NER model, yielding good results.
Unfortunately, many corpora in LoE remain unavailable to the public for various reasons (ethics, data sensitivity, company policy, etc.). Nevertheless, they are often featured in publications that carry detailed and valuable information on the specificities of a particular LoE (Ukrainian [[35]]); the resource selection (Arabic [[36]]); the annotation process (Tibetan [[37]]), or the evaluation of different machine learning methods (French [[38]]).
Aside from annotated corpora, there is an interest in generalizable methods of corpus creation that are also extendable to other languages or adaptable to other domains. Vivaldi and Rodriguez [[39]] pooled together 13 available mono-, multi-, and cross-lingual resources in seven languages, including high (English), medium (French, German, Spanish), and low resource languages (Arabic, Basque, and Catalan), to create a multilingual terminology and a language-agnostic methodology for medical semantic tagging.
Overall, we report 30 newly available medical corpora in LoE ([Table 1]), with 20 of them evaluated on NER and four on negation and uncertainty, for a total of 15 distinct languages (19 if silver standard corpora are also taken into account). Moreover, nine of these datasets are used in the shared tasks in Spanish, French, and Japanese. Considering that multiple datasets are now available for some LoE (for example, Spanish and German), there is an opportunity to pool them into benchmarks for evaluating language models on biomedical and clinical tasks. This would make a great contribution to biomedical NLP research since domain-appropriate comprehensive benchmarks and leaderboards are still lacking for many languages.
3.2 Pretrained Language Models for LoE
Large Pretrained Language Models (PLMs) have become the state-of-the-art method for solving various NLP tasks. There is now a plethora of biomedical and clinical PLMs available in English [[13], [40]]. Moreover, there is a steadily growing number of PLMs in LoE. We report biomedical and clinical language-specific PLMs that have become available since 2020 in [Table 2]. Most models are initialized with the weights of their language-specific general-domain counterparts [[41]
[42]
[43]] which has been the go-to method for the creation of domain-specific models. However, the general-domain vocabulary created within the model during pretraining is not representative of the biomedical or clinical domain and can endanger the downstream performance. Similarly, training models from scratch on mixed-domain data is disputed by the biomedical NLP community, especially in a low-resource setting [[44]]. Overall, in-domain pretraining is arguably the best option, given the availability of training data and computational resources [[45]
[46]
[47]].
Table 2 Medical language models in LoE from the years 2020-2022.
Nevertheless, mixing closely related domains, such as biomedical and clinical data can boost performance even within a mid-resource scenario [[48]]. Moreover, mixed-domain pretraining can be unavoidable when resources in the target language are scarce. Despite the controversy about mixed-domain pretraining, there persists an assumption that out-of-domain text contributes to better language modeling. Pretraining on a domain-specific corpus alone can be insufficient to learn proper language representations [[43]].
Another question that arises when it comes to PLMs in the biomedical domain is whether to use monolingual or multilingual models for tasks in LoE. General-domain monolingual BERT models tend to outperform multilingual models (for example, BERTje for Dutch [[49]] or EstBERT for Estonian [[50]]). Similar tendencies can be observed in the medical domain. The language-specific Dutch model, RobBERT, outperformed multilingual BERT (mBERT) on the multilabel classification of chest imaging requests and report items [[51]]. The Swedish KB-BERT model outperformed mBERT when fine-tuned for the de-identification task, albeit marginally [[52]].
The characteristics of training and fine-tuning data can strongly affect the outcome of fine-tuning. The large French BERT model CamemBERT [[53]], fine-tuned on French biomedical articles yielded worse results than the fine-tuned mBERT base model. The amount of biomedical fine-tuning data was supposedly insufficient to influence the knowledge gained by CamemBERT during pretraining [[12]]. The large Portuguese BERT model also performed worse than mBERT in a related study [[54]]. The authors suspected that “catastrophic forgetting” was the culprit, i.e. fine-tuning the monolingual Portuguese BERT on clinical data might have “erased” the knowledge learned during pretraining. This might be due to vast differences in the linguistic characteristics of the pretraining and fine-tuning data. The effect of forgetting might be less noticeable in the multilingual model because of the larger variability of its pretraining data.
In summary, there is an undisputed need for better biomedical and clinical representations in many languages. Training and releasing domain-specific LMs for LoE is a great way to fulfill this need. However, many factors need to be taken into account, including the size, compatibility, and homogeneity of training and fine-tuning data, linguistic transferability of involved languages, optimization of hyperparameters, as well as hardware availability. Combining large generic data with minimal in-domain data when training a model from scratch might not be effective. Yet continual pretraining on a small-sized domain-specific corpus can be the right solution, depending on the task. Finally, the size of the vocabulary and the number of parameters impact the carbon footprint of a model's training, which is becoming a commonly reported issue for consideration [[12]]. [Figure 1] provides an overview of the decision-making process for training a model in a new language, informed by insights from the reviewed papers.
Fig. 1 Tree diagram describing the choice of the optimal type of LM according to the nature of the data.
3.3 Named Entity Recognition, Normalization, and Linking for LoE
Named Entity Recognition (NER) is a foundational task for efficient information extraction. Unsurprisingly, it is known as “the most studied task in the biomedical and clinical NLP literature” [[48]]. Medical named entities have a more complicated structure than entities in other domains. They represent a wide range of specific information including diseases, symptoms, treatments, drugs, and anatomical concepts. Such a variety of labels complicates NER and calls for domain-specific resources and solutions which are often lacking.
NER has specific challenges in each given language, which requires a variety of language-specific solutions. For instance, capitalization is a boolean morphological feature that may help identify named entities in English or French. This does not apply to German, however, where all nouns are capitalized, or to Semitic languages such as Arabic and Hebrew that do not use capital letters at all. Semitic languages represent vowels as optional diacritics, which increases potential ambiguity. Transliteration from Latin characters into Semitic languages and cross-linguistic correspondences between consonants result in orthographic variations of proper nouns. Bitton et al. [[55]] propose an unsupervised method to build a synthetic dataset linking possible Hebrew transliterations with Unified Medical Language System (UMLS [[39]]) concepts. This dataset is then used to train a model that maps the transliterated Hebrew entities to the corresponding Concept Unique Identifier (CUI) code. Further, Arabic clinical documents typically contain terms written in the Latin alphabet, such as locations, first names, or proper nouns. These terms can be easily removed (if the task allows it) to process the documents with a purely Arabic model [[36]].
Non-segmented languages, such as Japanese, Chinese, or Tibetan, pose a real challenge to proper tokenization and NER. Leveraging multi-level representations can alleviate the loss of semantics caused by erroneous segmentation in non-segmented languages [[37], [56]]. Wang et al. [[57]] outperformed the current Chinese NER state-of-the-art by navigating character- and word-level trees via different-length paths in order to combine character representations with word and position embeddings.
Despite the constant need for training data, the usual data augmentation methods such as paraphrasing, noising, and translation do not always work for NER. These methods can negatively affect the original semantics, sample diversity, and domain specificity. A novel medical data augmentation (MDA) method based on medical knowledge graphs was evaluated on Chinese NER and relational classification with good results [[58]].
Transfer learning and pretrained language models, such as mBERT and BERT, have shown considerable success in NER tasks for LoE, often outperforming traditional methods like BiLSTM-CRF and CRF. For example, the mBERT model significantly outperformed the character-level BiLSTM-CRF method for Korean clinical NER [[59]]. In an extremely low-resource setting, the Japanese medical UTH-BERT [[60]] provided benefits only for radiology reports, owing to their linguistic similarity to the training data; for other cases, the general BERT model produced more favorable results [[29]]. These models have also demonstrated their effectiveness in languages such as Portuguese [[61]] and Romanian [[62]]. Kepler et al. conducted a comprehensive evaluation of state-of-the-art NER methods on Serbian clinical narratives and found that combining existing models in a majority voting ensemble produced the best F1 score of 89.2%, showcasing the potential of hybrid approaches [[63]].
German clinical NER research has flourished in recent years, owing to the release of several annotated corpora, the development of PLMs, and the availability of high-quality English-German MT systems. The novel pipelines [[21], [64]] involve training a model on machine-translated open-access, high-quality, high-resource (English) datasets, thus bypassing potential issues related to privacy, security, and bureaucracy, and projecting annotations with the assistance of neural word alignment tools. The model is then evaluated on either an existing gold standard corpus or a custom-made out-of-distribution dataset. Alternatively, training data can be entirely synthesized by a generative model using a few-shot method.
The NER task is often combined with entity linking, which consists of Biomedical Entity Linking (BEL), or grounding, where a medical term is disambiguated by linking it to a unique concept identifier in a knowledge base or terminology such as the UMLS. Entity linking generally depends on the availability of terminological resources in the target language, which creates problems for LoE. For example, French is one of the most represented LoE in the UMLS metathesaurus, yet only 4% of all the available concepts can be associated with at least one label term in French [[65]]. One of the most straightforward ways to overcome this challenge has been the use of human or machine translation. Entities in LoE are simply translated into English and then linked to a concept identifier [[66]], however translating medical jargon poses its own difficulties.
Overall, the resources for BEL in LoE are extremely scarce. According to a recent review on BEL [[67]], there are only 3 BEL corpora available in LoE: one in French [[68]] and two in Spanish [[69], [70]]. Such scarcity of data has a profound negative effect of BEL in LoE. Notwithstanding, deep learning techniques and multilingual representations offer a promising alternative to annotation-dependent NER and BEL approaches in LoE. Leveraging data-driven knowledge of pretrained language models and available multilingual terminologies allows for good results without labeled data or translation [[65]]. The cross-lingual extension of SAPBERT (the model that currently achieves state-of-the-art results in BEL) pools all the synonyms in the UMLS that share the same CUI to obtain similar representations, regardless of the language [[71]].
3.4 Recent Advances in Negation Resolution for LoE
Negation is a complex linguistic phenomenon that changes the meaning of an utterance. It plays a crucial role in biomedical text mining since its identification can greatly affect the quality and veracity of retrieved information. Negation exhibits great diversity in its syntactic and morphological representation across languages which poses additional challenges to the use of such methods as transfer learning [[72]]. The lack of annotated data hampers research on negation even more than its linguistic particularities. Before 2020, there were nine LoE resources out of a total of 19 publicly available negation-annotated corpora [[73], [74]]. However, only two languages were annotated for the medical domain: the IULA Spanish Clinical Record Corpus [[75]] and the two corpora of clinical texts in French [[27], [76]]. Four additional corpora have been released since then: a cancer dataset [[77]] and a clinical corpus of negation and uncertainty [[78]] in Spanish, a multipurpose clinical corpus in Brazilian Portuguese [[79]], and a German corpus of discharge summaries [[26]] ([Table 1]).
Despite the advancement of transfer learning for negation detection [[76], [79], [80]], rule-based [[27]] and supervised machine learning approaches [[76], [77], [81]
[82]
[83]] for LoE continue to be researched and employed. One paper presented a corpus-free approach, which is an attractive prospect in a scenario where there is no annotated data [[84]]. However, this work relied on gazetteers and the similarity of anamneses. A particularly interesting and challenging aspect of negation detection is the resolution of its scope, i.e. the identification of all negated tokens in a sentence. We found six publications dedicated to biomedical negation scope resolution in LoE. Two papers explore zero-shot transfer-learning methods across languages and domains for French and Spanish [[80], [85]] with the best F1 score of 90.8% for Spanish. A supervised BiLSTM-based method achieves 84.8% in Brazilian Portuguese [[83]] and 90.0% in Spanish [[78]]. Overall, despite some reports on the inapplicability of negation detection across domains [[74]], the deep learning methods prove to be feasible for transferring negation scope knowledge across corpora and languages [[77], [80], [85], [86]].
3.5 Shared Tasks
A major problem in the field of medical NLP is that contributions and results are often made using private, non-shareable, and ethically sensitive clinical data. This problem leads to a lack of common public benchmarks to use in order to evaluate and compare the quality of new works. This is even more true for data in LoE. To overcome data availability issues, shared tasks provide non-sensitive data extracted through public sources such as social media [[87]], or medical literature [[27], [88]]. In other cases, datasets might contain synthetic examples [[70]]. Furthermore, these datasets remain available to the public beyond the challenge time frame, making them solid medical NLP benchmarks [[89]]. Besides, such annotated datasets spare time and expenses for other research groups because the creation of an annotated gold standard requires months of work by trained domain experts [[90], [91]].Our query revealed a number of medical NER challenges for the Spanish language ([Table 3]), including CLEF eHealth (2020-21) [[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]], IberLEF (2020-22) [[17], [70], [102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]], and CLEF BioASQ (2022) [[116]
[117]
[118]
[119]]. Furthermore, IberLEF 2022 [[17]] and BioASQ 2022 [[116]] released their datasets translated into seven other languages, encouraging future contributions to multilingual medical NLP. For the French language, the CAS corpus [[27]] was used in DEFT [[53], [120]
[121]
[122]
[123]
[124]], an annual French-speaking text-mining challenge. The 2020 edition of DEFT involved the automatic annotation of 13 different medical entity types, while the 2021 edition proposed to identify the patient's clinical profile through multilabel classification of diseases using the Medical Subject Headings (MeSH) thesaurus. In the same way as DEFT 2020, NTCIR 16 challenge [[29], [125]] consisted in finding a wide variety of medical entity types for the Japanese language.
Table 3 Shared tasks in LoE related to clinical NLP from challenges in 2020-2022. Task: NER = Named Entity Recognition, EL = Entity Linking, MLC = Multi-Label Classification. Models: X = multilingual LM, I = monolingual LM pretrained on LoE, E = English LM.
Shared tasks provide excellent opportunities to test recently published PLMs and build new ones. For example, the Spanish Medical RoBERTa model, released in 2021 [[48]], was tested and validated in the IberLEF 2022 [[17]], BioASQ 2022 [[116]], and SocialDisNER 2022 [[87], [126]
[127]
[128]
[129]
[130]] challenges. UTH-BERT, a model pretrained on Japanese clinical text published in 2021 [[60]], was among the top models for the Japanese NER task of NTCIR 16. Alternatively, some LMs were built specifically for a shared task:
-
IberLEF 2020 [[115]]: XLM-RoBERTa [[131]] + Galén dataset (private real-world de-identified clinical cases in Spanish), mBERT + Galén, BETO [[132]] + Galén [[133]];
-
DEFT 2020 [[120]]: CamemBERT [[134]] + French abstracts extracted from PubMed [[121]];
-
DEFT 2021 [[123]]: FlauBERT [[135]] + corpus of medical emergencies cases obtained through house calls (SOS Médecins) [[124]].
Multilingual models such as mBERT and XLM-RoBERTa are frequently used in non-English language challenges. Due to domain constraints, English models might perform better than the available language-specific models, like in the case of the 16th NTCIR NER task. We have observed in shared tasks the tendency of creating a medical monolingual model when one is not available. This is often done through continual pretraining of existing models.
3.6 The Special Case of Spanish
Our survey uncovered a large increase in the biomedical NLP resources for the Spanish language. Spanish is one of the most spoken languages in the world, yet it lacked linguistic data and trained models up until very recently. BETO [[132]], the first publicly available BERT-based general-domain monolingual Spanish model, was released in 2020. Several additional general-domain Spanish LMs emerged since then, e.g. IXAmBERT [[136]] (also deals with Basque and English), and MarIA, an entire family of Spanish LMs based on RoBERTa and GPT2 [[137]].
López-García et al. obtained clinical language models in Spanish through the continual pretraining of the mBERT, BETO, and XLM-RoBERTa models [[138]]. Carrino et al. released several Spanish models based on RoBERTa, evaluating mixed-domain pretraining [[48]] and training from scratch [[46]].
The published research on Spanish now dominates biomedical NLP papers concerning LoE. Indeed, based on the queries detailed in the Methods section, 59 out of 97 articles related to NER in LoE involve the Spanish language. In the case of negation papers, 9 out of 25 were about Spanish. Furthermore, the queried articles were related to 12 different shared tasks, with 9 of them using Spanish data (see [Table 3]).
The growth in quantity and quality of Spanish language resources could be attributed to investments made by the Spanish government through initiatives such as the Digital Agenda for Spain and the Spanish Strategy for Science, Technology, and Innovation. Among their various objectives, these initiatives specifically target information communication technology for support, striving to establish Spain as a global leader in language technology and innovation[3]. Specifically, the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL)[4], particularly from 2018 onwards, included a dedicated flagship project focused on Health and Biomedicine. The promoted shared tasks were directly related to high-impact biomedical use cases and healthcare data exploitation scenarios. More than 15 healthcare-related shared tasks with data in Spanish were organized over the past five years. The shared tasks promoted by the Plan TL stimulated global participation and research in biomedical NLP beyond research groups located in Spanish-speaking countries, as evidenced by the diverse teams participating from around the world.
4 Discussion and Conclusion
In recent years, novel NLP technologies have revolutionized various research areas dealing with text analysis, including biomedical and clinical text mining. This survey clearly indicates the trend, demonstrating that significant changes have taken place between 2020 and 2022. The period started with the dominance of transformer-based models, beginning with BERT, and currently, we witness the introduction and implementation of very large models that can achieve feats previously thought impossible to reach, such as GPT-3 [[139]]. These powerful models provide new opportunities, while accentuating the resource gap between different research labs. In fact, the release of the ChatGPT[5] conversational agent in November 2022 sent shockwaves through many industries, including the medical field [[140]]. Given its recent emergence, we leave the analysis of this extremely dynamic and highly impactful development to a future survey.
Medical NLP has undertaken new paradigms for processing data outside of the English language. The training of language- and domain-specific models is on the rise. Most often biomedical models for LoE are further trained or fine-tuned using the weights of their language-specific general domain counterparts. However, several factors must be considered when training a given model, such as the size and domain composition of the training data as well as its linguistic compatibility with the target texts (see [Figure 1]). Finally, while monolingual models often outperform their multilingual counterparts, the latter should always be considered when dealing with low-resource languages.
Another interesting trend of the past few years is the increased use of generated texts in the creation of corpora. The advanced quality of NMT allows for the creation of large, multilingual parallel silver standard corpora, such as LivingNER. For high-resource languages like German, it is now possible to bypass many constraints associated with data sensitivity and sparsity of annotated resources. High-quality open datasets can be automatically translated, and annotations projected using neural aligners. Furthermore, the GPT family of models offers an opportunity to create training data with just a few well-engineered prompts. Nevertheless, validation with gold-standard data remains essential.
Negation detection and NER are foundational tasks in biomedical NLP that require language-specific awareness and customized methods. For instance, lexical, morphological, and semantic characteristics of many non-Indo-European languages may be too specific to follow an NLP pipeline designed for English. Nevertheless, both tasks are often solved using PLMs and transfer-learning. Entity linking and negation scope resolution still suffer from the lack of annotated data in many LoE. It would be reasonable to investigate the power of large multilingual generative models to start closing that gap.
Overall, there is a noticeable acceleration in biomedical NLP research for LoE. We determined that significantly more papers were dedicated to this area of research in the last two years than over the previous decade [[4]], presenting new labeled corpora, trained models, and shared tasks. All these new resources provide an opportunity for the creation of new comprehensive benchmarks in LoE. The specific advances seen for the Spanish language, likely the result of targeted research funding and support, is a particularly good example of the current dynamic research environment that can lead to considerable advances in a limited period of time.
NLP techniques have the potential to render medical documents much more valuable for primary and secondary research. They can be used to make specific clinical documentation more easily accessible for medical practitioners (primary usage), and once de-identified and aggregated they can be used to derive previously unseen correlations among medical events, such as between specific treatments and outcomes (secondary usage). The fact that NLP tools for the medical domain are increasingly being developed in LoE goes a long way to enable their usability across different countries and language communities. The widespread usage of such technologies in turn has the potential to enable large-scale evidence-based medical research, which will have positive outcomes on public health.