Keywords
natural language processing - semantic textual similarity - clinical semantic textual similarity - bidirectional encoder representations from transformers
Introduction
Semantic textual similarity (STS) aims to compute the degree of semantic equivalence between texts based on the semantic content and meanings. It is common in many general English domain tasks such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking.[1] Although STS is similar to plagiarism detection, there are two major differences. First, plagiarism detection finds whether texts are similar, whereas STS finds the degree of similarity. Second, plagiarism detection uses texts from the internet for comparison, whereas STS uses texts depending on the research interests. STS is also related to paraphrase detection and textual entailment. There is a difference in that STS aims to capture the level of semantic equivalence, whereas paraphrase detection and textual entailment are a binary yes/no decision.[1]
[2]
[Fig. 1] shows an example of semantic textual similarity.
Fig. 1 Semantic textual similarity example. Given a sentence-pair, a model computes semantic similarity score on a scale from 0 (low semantic similarity) to 5 (high semantic similarity).
Due to its application across diverse tasks, many approaches to compute semantic similarity have been proposed. Existing approaches include corpus-based and knowledge-based models,[3] machine learning-based models,[4]
[5]
[6]
[7] neural networks-based models,[8]
[9]
[10]
[11]
[12]
[13] and BERT-based models.[11]
[14] Corpus-based method measure the degree of similarity between texts by using information exclusively extracted from a large corpus. Knowledge-based method measure the semantic similarity based on information extracted from semantic networks or structured resources like dictionaries, encyclopedias, thesauruses, Wikipedia, or WordNet.
Chen et al.[5] achieved the best performance in the 2018 clinical STS shared task.[15] Their proposed model employed traditional machine learning and deep learning. They trained a model with 63 features which included string-based, entity-based, number, and deep learning-based similarity features. Moreover, Zhao et al[7] used latent semantic analysis to learn vector-space representations, together with handcrafted features. Although traditional NLP approaches such as designing handcrafted features achieve good performance, they suffer from sparsity due to lack of large annotated data and language ambiguity.[10]
Mueller and Thyagarajan[9] proposed Siamese long short-term memory (LSTM) network for labeled data consisting of sentence pairs with variable length. Their approach relies on pretrained word-embeddings[16] and synonym augmentation. Further, Tai et al[13] proposed Tree-LSTMs which use syntactic trees to construct sentence representations. The standard LSTM model determines the hidden state from the current time-step input and previous time-step's hidden state. However, the Tree-LSTM model determines its hidden state from an input vector and the hidden states of all child units. The basic idea is that, by reflecting the sentence syntactic properties, the tree network can efficiently propagate more information than the standard sequential architecture.
Recently bidirectional encoder representations from transformers (BERT)[14] has achieved state-of-the-art performance in more than 10 NLP tasks. It is a popular approach for transfer learning and has been proven to be effective in achieving good accuracy for small datasets.[14]
[17] It can be used for tasks whose input is a sentence pair, such as sentence pair regression, question answering, and natural language inference. It learns distinctive embedding for the sentences so as to help the model in differentiating the sentences.
SemEval (semantic evaluation) shared tasks have been held since 2012 to encourage the development of automated methods for STS tasks.[1]
[2]
[18]
[19]
[20]
[21] English STS has been widely studied with proposed state-of-the-art systems achieving high correlation (Pearson correlation score >80%) with human judgment.[2] However, these previous tasks focus on the general English domain. There exist very few resources for STS tasks in the clinical domain due to restricted access to clinical data because of patient privacy and confidentiality.[15]
[22] Wang et al[15]
[22] created an English clinical STS dataset from actual notes at Mayo clinic and organized shared tasks in 2018 and 2019. In their dataset, they removed all the protected health information, and the dataset can be accessed by signing a Data Use Agreement.
In this study we created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). As previously mentioned, the reason for few resources in the clinical domain is due to data privacy, which prohibits public sharing of medical data. To overcome this challenge, we created one dataset from a public resource, and made this dataset publicly available.[a] Specifically, the CR dataset was created by extracting case reports from CiNii,[b] a Japanese database containing research publications in Japanese and English. Although research publications are different from real clinical texts, they have been widely used in various clinical natural language processing (NLP) researches to fill the gap for lack of publicly available real clinical texts. We also created a second dataset, the EMR dataset, from real clinical documents, however, this dataset is not publicly available.
Moreover, we used a BERT-based approach to capture the semantic similarity between texts. Recently many pretrained models, both general domain and domain-specific, have been developed. We investigate the performance of general and clinical domain pretrained Japanese BERT models on clinical domain datasets. Therefore, our contributions include:
-
Creating a publicly available dataset for Japanese sentence-level clinical STS from a public resource (CiNii) due to privacy issues associated with hospital clinical data.
-
Comparing the performance of the general and clinical Japanese BERT models.
Methods
Materials
This study used two Japanese datasets: case reports (CR dataset) and EMR documents (EMR dataset). We created the CR dataset from case reports which is publicly available.[c] By using the CR dataset, model performance can be measured with a publicly shareable dataset. In contrast, the EMR dataset was generated from medical documents and is not publicly available. The datasets consist of sentence pairs annotated on a scale from 0 (low semantic similarity) to 5 (high semantic similarity), where 0 means that the two-sentence pairs are completely dissimilar, i.e., their meanings do not overlap, and 5 means that the sentence pairs are completely similar semantically.
Japanese Case Reports (CR Dataset)
We created a publicly available dataset[c] to motivate research on Japanese clinical STS. There exist few resources for clinical STS tasks due to data privacy and confidentiality issues that prohibit public sharing of medical data. To overcome this challenge and create a publicly available dataset, we extract Japanese case reports from CiNii[d], which is a Japanese database containing articles published to Japanese journals and conferences. Japanese case reports were extracted from CiNii in PDF format (1,747 documents). The PDF documents were then converted to OCR format and split into sentences. Sentences that generally would not be found in real clinical documents such as references, author affiliations, and so on were removed.
After extracting all sentences, we created a dataset by using all possible combinations of sentence pairs. This resulted in a huge number of sentence pairs. Choosing sentence pairs randomly would have likely resulted in a dataset where the semantic similarity scores are highly imbalanced. Therefore, we adopted the approach used in previous tasks (SemEval[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21] and MedSTS[22]). These previous studies use string similarity approaches to select sentence pairs for annotation. Although string similarity cannot entirely capture semantic similarity, they can capture some level of surface/syntactic similarity and hence significantly reduce the human effort required in selecting sentence pairs for annotation.
In this study, we used Python simstring library[e] to compute cosine similarity between the sentence pairs. Cosine similarity returns a score between 0 and 1. About 4,000 sentence pairs across all scores (0 to 1) were then selected for annotation by staff with medical background. The annotator assigned each sentence pair with a similarity score 0 (low semantic similarity) to 5 (high semantic similarity) depending on the semantic similarity. A second annotator annotated 10% of the data, and the annotators had a weighted Cohen Kappa agreement of 0.67, which can be regarded as acceptable for NLP tasks.[23] We used the same annotation guidelines as used in previous STS tasks[1]
[2]
[15]
[18]
[19]
[20]
[21]
[22] as shown in [Supplementary Appendix A], available in the online version only. [Fig. 2] shows the distribution of semantic similarity scores in the CR dataset.
Fig. 2 Distribution of semantic similarity scores in the Japanese case report dataset.
Japanese Electronic Medical Documents (EMR Dataset)
This dataset was created from actual Japanese medical documents consisting of radiography reports and electronic health record (EHR) notes. The EHR notes consist of progress notes. The radiography reports were provided by the National Cancer Center Japan, and the EHR notes were provided by Osaka University Hospital.[24] We filtered medical documents for patients with more than one entry and created document pairs based in chronological order as [dt
, dt
+1]. We asked the annotator to read sentences in dt
and determine their semantic similarity with sentences in dt
+1. This dataset consists of approximately 2,000 sentence pairs annotated with semantic similarity scores from 0 to 5 similarly to the CR dataset. [Fig. 3] shows the distribution of the semantic similarity scores in the EMR dataset.
Fig. 3 Distribution of semantic similarity scores in the EMR dataset. EMR, electronic medical record.
Model
We adopted BERT[14] since it has been proven to be effective in achieving good accuracy for small datasets[14]
[17]
[25] like ours. Whereas data scarcity is one of the biggest challenges in NLP tasks, most NLP tasks require large amounts of training data so as to achieve reasonable accuracy. Dataset creation and annotation are expensive in terms of time and labor, and data are not available especially in the clinical domain. This challenge can be addressed by pretraining general domain language models using huge amounts of unlabeled data.[1] These pretrained models can be fine-tuned to specific tasks. It takes a lot of time to train the original BERT model from the beginning, and therefore training on a fine-tuned model reduces the time and memory usage.
Pretrained BERT models, both general domain and domain-specific, have been developed. General domain models are pretrained on cross-domain texts and therefore lack domain-related knowledge. Also, the linguistic characteristics of general domain texts and clinical domain texts are different hence creating the need for domain-specific BERT models.[26] In this study we investigate the performance of general Japanese BERT[f] and clinical Japanese BERT[27] models. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts (mainly notes by physicians and nurses) at University of Tokyo Hospital.[27]
The most common approach to use BERT is a feature-based approach where fine-tuning is not required, and instead the BERT vectors are used like word embeddings. The output of the BERT CLS token can also be used as a feature vector. The CLS (classification) token is a special BERT token added at the start of a sequence and represents the entire sequence.[14] Reimers and Gurevych[11] suggested that averaging the output of BERT or using the CLS token does not achieve good performance. They investigated different pooling methods for the BERT output such as mean and maximum pooling. However, the best strategy for extracting the feature vectors is still an open problem.
[Fig. 4] shows the overview of our model. The input consists of a sequence of tokens of the two sentences concatenated by a special token, [SEP]. The input sequence also has the [SEP] token at the end to show the end of the input. The first token of the input sequence is the BERT special classification token, [CLS]. BERT encodes the sentence pair, and passes the final hidden state of the [CLS] as a representation of the input sequence. The output of the [CLS] token is passed to a fully connected linear output layer to calculate the semantic similarity score. The CR and EMR datasets are annotated on a discrete scale from 0 to 5 (i.e., 0, 1, 2, 3, 4, 5). We approached it as a classification problem and used a linear classifier with cross-entropy loss.[28]
Fig. 4 Overview of our model.
Results
Experimental Settings
The CR and EMR datasets were split into 70% training set and 30% test set, respectively. Also, we prepared additional training set, n2c2, by translating the n2c2/OHNLP English dataset to Japanese using Googletrans, which is a python library that communicates with Google Translate API[g]. This n2c2/OHNLP dataset was provided in the 2019 n2c2/OHNLP Clinical Semantic Textual Similarity shared task, and is discussed in Wang et al.[15]
[22] In our experiments we do different combinations of the training data, to see how different datasets with different language variability affect the model performance.
We consider two experimental settings which we refer to as strict and relaxed. In the strict setting, we use the six scale semantic similarity scores (i.e., 0, 1, 2, 3, 4, 5) as discussed in the data annotation guidelines. In the relaxed setting, we consider a four scale where we combine scores 1 and 2, and 3 and 4, i.e., (0, [1, 2], [3, 4], 5). In the annotation guidelines (refer to [Supplementary Appendix A], available in the online version only) the annotators stated that sometimes it was difficult to choose between semantic similarity scores 3 and 4. This is because, in some cases it is difficult to decide what constitutes “important” and “unimportant” information. Similarly, the same problem was experienced for semantic similarity scores 1 and 2. Therefore, we consider the relaxed setting for uniformity. We also expect that by using this kind of setting the classification performance can be improved.
Performance of General and Clinical Japanese BERT Models
We evaluated the performance based on two evaluation metrics; the Pearson correlation coefficient (as in the previous STS shared tasks[1]
[2]
[18]
[19]
[20]
[21]
[22]) and classification accuracy between the predicted scores and gold scores. [Tables 1] and [2] show the results for the CR and EMR test sets, respectively. Both models, the general Japanese BERT and the clinical Japanese BERT, achieved a good performance. In the CR results, the general Japanese BERT achieved a Pearson correlation of 0.904 and 72% accuracy, whereas the clinical Japanese BERT best Pearson correlation and accuracy were 0.890 and 69%, respectively, in the strict setting. In the relaxed setting, the general Japanese BERT highest Pearson score and accuracy were 0.882 and 79%, respectively, while the clinical Japanese BERT achieved Pearson score of 0.862 and 75% accuracy.
Table 1
CR results based on Pearson correlation and classification accuracy
|
|
Strict
|
Relaxed
|
Model
|
Training data
|
Pearson
|
Accuracy
|
Pearson
|
Accuracy
|
General Japanese BERT
|
CR
|
0.904
|
72%
|
0.878
|
79%
|
EMR
|
0.730
|
33%
|
0.749
|
53%
|
n2c2
|
0.716
|
35%
|
0.705
|
55%
|
CR + EMR
|
0.897
|
71%
|
0.882
|
78%
|
CR + EMR+ n2c2
|
0.895
|
71%
|
0.879
|
78%
|
Clinical Japanese BERT
|
CR
|
0.890
|
67%
|
0.854
|
75%
|
EMR
|
0.745
|
29%
|
0.696
|
47%
|
n2c2
|
0.656
|
25%
|
0.613
|
39%
|
CR + EMR
|
0.885
|
68%
|
0.862
|
76%
|
CR + EMR+ n2c2
|
0.870
|
69%
|
0.855
|
75%
|
Abbreviations: BERT, bidirectional encoder representations from transformers; CR, case report; EMR, electronic medical record.
Table 2
EMR results based on Pearson correlation and classification accuracy
|
|
Strict
|
Relaxed
|
Model
|
Training data
|
Pearson
|
Accuracy
|
Pearson
|
Accuracy
|
General Japanese BERT
|
CR
|
0.692
|
53%
|
0.692
|
68%
|
EMR
|
0.864
|
79%
|
0.860
|
84%
|
n2c2
|
0.569
|
33%
|
0.558
|
63%
|
CR + EMR
|
0.856
|
79%
|
0.857
|
85%
|
CR + EMR+ n2c2
|
0.875
|
81%
|
0.870
|
86%
|
Clinical Japanese BERT
|
CR
|
0.685
|
44%
|
0.693
|
62%
|
EMR
|
0.845
|
76%
|
0.824
|
82%
|
n2c2
|
0.521
|
23%
|
0.513
|
52%
|
CR + EMR
|
0.862
|
79%
|
0.848
|
83%
|
CR + EMR+ n2c2
|
0.848
|
78%
|
0.833
|
82%
|
Abbreviations: BERT, bidirectional encoder representations from transformers; CR, case report; EMR, electronic medical record.
In the EMR results, the general Japanese BERT best Pearson score and accuracy were 0.875 and 81%, respectively, while the clinical Japanese BERT achieved Pearson score of 0.862 and 79% accuracy in the strict setting. In the relaxed setting, the general Japanese BERT Pearson score and accuracy were 0.870 and 86%, respectively, whereas the clinical Japanese BERT were 0.848 and 83%, respectively. Although both BERT models performed well, in overall the general Japanese BERT model achieved the highest performance in both datasets.
Discussion
Effect of Training Data
In the CR results, training only on the CR dataset achieved the highest performance in the strict setting (Pearson correlation of 0.904 and accuracy of 72%). In the relaxed setting, training on the CR+ EMR achieved the best performance, Pearson correlation score of 0.882, but training on only the CR achieved the highest accuracy of 79%. We expected that training on more data would improve the performance, but training only on CR had best performance. Note that training only on n2c2 or EMR datasets achieved average performance in terms of Pearson correlation score (0.716 and 0.730 for the strict setting; 0.705 and 0.749 for the relaxed setting). Nevertheless, the classification accuracy is relatively low (35 and 33% for the strict setting; 55 and 53% for the relaxed setting). Although clinical Japanese BERT was trained on clinical texts, it achieved low performance in the CR test data. This could be attributed to the reason that case reports and real hospital text data are different in terms of vocabulary, abbreviations, linguistic patterns, and even sentence length.
In the EMR results, training on a combination of all the datasets achieved the highest performance (Pearson correlation of 0.875 and accuracy of 81% for the strict setting; Pearson correlation of 0.870 and accuracy of 86% in the relaxed setting). The EMR training set was small and therefore adding more data provided more training examples for our model hence improving the performance. Although both EMR and n2c2 datasets are created from real hospital documents, training on n2c2 dataset achieved the lowest performance (Pearson correlation of 0.569 and 33% accuracy for the strict setting; Pearson correlation 0.558 and 63% accuracy for the relaxed setting). This could be due to the reason that the n2c2 and our EMR datasets were created from different types of clinical notes. Our EMR dataset consisted of sentences from radiography notes and progress notes, while the n2c2 dataset consisted of sentences from other different types of clinical notes. Further, the n2c2 dataset was translated from English to Japanese using Google translate machine translation. The quality of machine translation was sufficient and most medical terms were translated efficiently. Although our preliminary manual check of the translated sentences looks sufficient, the performance of the proposed method could be improved by adopting better translation models. However, to compare the precise relation between the machine translation quality and STS performance is one of the future works.
In the EMR results, we expected the clinical Japanese BERT to achieve the best performance since it is trained on clinical texts, but the general Japanese BERT attained the highest performance. Although this could be surprising since domain specific pretraining is expected to perform better in general, the result suggests that semantic textual similarity relies more on fundamental linguistic features. This finding therefore encourages clinical applications based on semantic textual similarity, since widely available, general domain BERT models would work well. Moreover, the high performance of general Japanese BERT could also be due to the fact that it is trained on a wide range of texts and therefore it could generalize well.
Error Analysis
[Tables 3] and [4] show error examples for the CR and EMR test sets, respectively. In the CR results, example (a) in [Table 3] shows an example of abbreviation expansion problem. Abbreviation expansion is a major problem even in other NLP tasks, and in future there is a need for a precise method to handle this problem. Example (b) is a case of language variability, although the sentences are similar in meaning the choice of words varies greatly. In example (c), the model assigned a higher score to the sentence pair because actually the sentences are highly similar and have only a minor difference (“Yamada type 1 or type II” in the first sentence, and “Yamada type III” in the second sentence). Although this kind of difference is important in the clinical domain, in the general English domain this sentence pair can be treated as semantically equal. In example (d), the sentences are roughly equivalent, and although our model assigned a score of 4, the gold score should be 3.
Table 3
Error analysis in the CR dataset
|
Table 4
Error analysis in the EMR dataset
|
Abbreviation: EMR, electronic medical record.
In the EMR results, the sentence length varies greatly from very short sentences (1 or 2 words). Example (a) in [Table 4] shows a typical example of short sentences found in EMR notes. The EMR dataset sentence lengths have a large difference, and our model was not able to correctly classify sentence pairs with very short sentences. In sentence pairs of examples (b) and (c), our model assigned a lower score because although the sentences have a high semantic similarity, the choice of words is quite different. For example in (c), “almost no change” and “slightly decreased” have close meaning semantically. It is easy for human beings to capture this kind of meaning but difficult for machines to capture this kind of similarity. Sentence pairs of examples (d) and (e) show a case of positive–negative relationship. Our model was not able to capture negation, and in future it is necessary to train our model to identify this kind of relationship.
Conclusion
STS tasks have been widely studied especially in the general English domain. However, only a few resources exist for STS tasks in the clinical domain and languages other than English such as Japanese. To bridge this gap, we created a publicly available dataset for Japanese clinical STS. The dataset consists of approximately 4,000 sentence pairs extracted from Japanese case reports annotated with a semantic similarity score from 0 (low semantic similarity) to 5 (high semantic similarity).
We used a BERT-based approach to capture semantic similarity between clinical domain texts. In our experiments we achieved a high Pearson correlation score between the gold scores and human scores (0.904 in the CR dataset; 0.875 in the EMR dataset). In this study we also compared the performance of the general and clinical Japanese BERT models. Although both models achieved a good performance, the general Japanese BERT achieved the highest performance compared with the clinical Japanese BERT in our clinical domain datasets. Though this could be surprising because domain specific pretraining is known to perform better in general, the results suggest that semantic textual similarity relies more on fundamental linguistic features. This finding particularly encourages clinical applications based on semantic textual similarity, since widely available general domain BERT models would work well.