Keywords
natural language processing - semantic textual similarity - clinical semantic textual
similarity - bidirectional encoder representations from transformers
Introduction
Semantic textual similarity (STS) aims to compute the degree of semantic equivalence
between texts based on the semantic content and meanings. It is common in many general
English domain tasks such as text summarization, question answering, machine translation,
information retrieval, dialog systems, plagiarism detection, and query ranking.[1] Although STS is similar to plagiarism detection, there are two major differences.
First, plagiarism detection finds whether texts are similar, whereas STS finds the
degree of similarity. Second, plagiarism detection uses texts from the internet for
comparison, whereas STS uses texts depending on the research interests. STS is also
related to paraphrase detection and textual entailment. There is a difference in that
STS aims to capture the level of semantic equivalence, whereas paraphrase detection
and textual entailment are a binary yes/no decision.[1]
[2]
[Fig. 1] shows an example of semantic textual similarity.
Fig. 1 Semantic textual similarity example. Given a sentence-pair, a model computes semantic
similarity score on a scale from 0 (low semantic similarity) to 5 (high semantic similarity).
Due to its application across diverse tasks, many approaches to compute semantic similarity
have been proposed. Existing approaches include corpus-based and knowledge-based models,[3] machine learning-based models,[4]
[5]
[6]
[7] neural networks-based models,[8]
[9]
[10]
[11]
[12]
[13] and BERT-based models.[11]
[14] Corpus-based method measure the degree of similarity between texts by using information
exclusively extracted from a large corpus. Knowledge-based method measure the semantic
similarity based on information extracted from semantic networks or structured resources
like dictionaries, encyclopedias, thesauruses, Wikipedia, or WordNet.
Chen et al.[5] achieved the best performance in the 2018 clinical STS shared task.[15] Their proposed model employed traditional machine learning and deep learning. They
trained a model with 63 features which included string-based, entity-based, number,
and deep learning-based similarity features. Moreover, Zhao et al[7] used latent semantic analysis to learn vector-space representations, together with
handcrafted features. Although traditional NLP approaches such as designing handcrafted
features achieve good performance, they suffer from sparsity due to lack of large
annotated data and language ambiguity.[10]
Mueller and Thyagarajan[9] proposed Siamese long short-term memory (LSTM) network for labeled data consisting
of sentence pairs with variable length. Their approach relies on pretrained word-embeddings[16] and synonym augmentation. Further, Tai et al[13] proposed Tree-LSTMs which use syntactic trees to construct sentence representations.
The standard LSTM model determines the hidden state from the current time-step input
and previous time-step's hidden state. However, the Tree-LSTM model determines its
hidden state from an input vector and the hidden states of all child units. The basic
idea is that, by reflecting the sentence syntactic properties, the tree network can
efficiently propagate more information than the standard sequential architecture.
Recently bidirectional encoder representations from transformers (BERT)[14] has achieved state-of-the-art performance in more than 10 NLP tasks. It is a popular
approach for transfer learning and has been proven to be effective in achieving good
accuracy for small datasets.[14]
[17] It can be used for tasks whose input is a sentence pair, such as sentence pair regression,
question answering, and natural language inference. It learns distinctive embedding
for the sentences so as to help the model in differentiating the sentences.
SemEval (semantic evaluation) shared tasks have been held since 2012 to encourage
the development of automated methods for STS tasks.[1]
[2]
[18]
[19]
[20]
[21] English STS has been widely studied with proposed state-of-the-art systems achieving
high correlation (Pearson correlation score >80%) with human judgment.[2] However, these previous tasks focus on the general English domain. There exist very
few resources for STS tasks in the clinical domain due to restricted access to clinical
data because of patient privacy and confidentiality.[15]
[22] Wang et al[15]
[22] created an English clinical STS dataset from actual notes at Mayo clinic and organized
shared tasks in 2018 and 2019. In their dataset, they removed all the protected health
information, and the dataset can be accessed by signing a Data Use Agreement.
In this study we created two datasets for Japanese clinical STS: (1) Japanese case
reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). As
previously mentioned, the reason for few resources in the clinical domain is due to
data privacy, which prohibits public sharing of medical data. To overcome this challenge,
we created one dataset from a public resource, and made this dataset publicly available.[a] Specifically, the CR dataset was created by extracting case reports from CiNii,[b] a Japanese database containing research publications in Japanese and English. Although
research publications are different from real clinical texts, they have been widely
used in various clinical natural language processing (NLP) researches to fill the
gap for lack of publicly available real clinical texts. We also created a second dataset,
the EMR dataset, from real clinical documents, however, this dataset is not publicly
available.
Moreover, we used a BERT-based approach to capture the semantic similarity between
texts. Recently many pretrained models, both general domain and domain-specific, have
been developed. We investigate the performance of general and clinical domain pretrained
Japanese BERT models on clinical domain datasets. Therefore, our contributions include:
-
Creating a publicly available dataset for Japanese sentence-level clinical STS from
a public resource (CiNii) due to privacy issues associated with hospital clinical
data.
-
Comparing the performance of the general and clinical Japanese BERT models.
Methods
Materials
This study used two Japanese datasets: case reports (CR dataset) and EMR documents
(EMR dataset). We created the CR dataset from case reports which is publicly available.[c] By using the CR dataset, model performance can be measured with a publicly shareable
dataset. In contrast, the EMR dataset was generated from medical documents and is
not publicly available. The datasets consist of sentence pairs annotated on a scale
from 0 (low semantic similarity) to 5 (high semantic similarity), where 0 means that
the two-sentence pairs are completely dissimilar, i.e., their meanings do not overlap,
and 5 means that the sentence pairs are completely similar semantically.
Japanese Case Reports (CR Dataset)
We created a publicly available dataset[c] to motivate research on Japanese clinical STS. There exist few resources for clinical
STS tasks due to data privacy and confidentiality issues that prohibit public sharing
of medical data. To overcome this challenge and create a publicly available dataset,
we extract Japanese case reports from CiNii[d], which is a Japanese database containing articles published to Japanese journals
and conferences. Japanese case reports were extracted from CiNii in PDF format (1,747
documents). The PDF documents were then converted to OCR format and split into sentences.
Sentences that generally would not be found in real clinical documents such as references,
author affiliations, and so on were removed.
After extracting all sentences, we created a dataset by using all possible combinations
of sentence pairs. This resulted in a huge number of sentence pairs. Choosing sentence
pairs randomly would have likely resulted in a dataset where the semantic similarity
scores are highly imbalanced. Therefore, we adopted the approach used in previous
tasks (SemEval[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21] and MedSTS[22]). These previous studies use string similarity approaches to select sentence pairs
for annotation. Although string similarity cannot entirely capture semantic similarity,
they can capture some level of surface/syntactic similarity and hence significantly
reduce the human effort required in selecting sentence pairs for annotation.
In this study, we used Python simstring library[e] to compute cosine similarity between the sentence pairs. Cosine similarity returns
a score between 0 and 1. About 4,000 sentence pairs across all scores (0 to 1) were
then selected for annotation by staff with medical background. The annotator assigned
each sentence pair with a similarity score 0 (low semantic similarity) to 5 (high
semantic similarity) depending on the semantic similarity. A second annotator annotated
10% of the data, and the annotators had a weighted Cohen Kappa agreement of 0.67,
which can be regarded as acceptable for NLP tasks.[23] We used the same annotation guidelines as used in previous STS tasks[1]
[2]
[15]
[18]
[19]
[20]
[21]
[22] as shown in [Supplementary Appendix A], available in the online version only. [Fig. 2] shows the distribution of semantic similarity scores in the CR dataset.
Fig. 2 Distribution of semantic similarity scores in the Japanese case report dataset.
Japanese Electronic Medical Documents (EMR Dataset)
This dataset was created from actual Japanese medical documents consisting of radiography
reports and electronic health record (EHR) notes. The EHR notes consist of progress
notes. The radiography reports were provided by the National Cancer Center Japan,
and the EHR notes were provided by Osaka University Hospital.[24] We filtered medical documents for patients with more than one entry and created
document pairs based in chronological order as [dt
, dt
+1]. We asked the annotator to read sentences in dt
and determine their semantic similarity with sentences in dt
+1. This dataset consists of approximately 2,000 sentence pairs annotated with semantic
similarity scores from 0 to 5 similarly to the CR dataset. [Fig. 3] shows the distribution of the semantic similarity scores in the EMR dataset.
Fig. 3 Distribution of semantic similarity scores in the EMR dataset. EMR, electronic medical
record.
Model
We adopted BERT[14] since it has been proven to be effective in achieving good accuracy for small datasets[14]
[17]
[25] like ours. Whereas data scarcity is one of the biggest challenges in NLP tasks,
most NLP tasks require large amounts of training data so as to achieve reasonable
accuracy. Dataset creation and annotation are expensive in terms of time and labor,
and data are not available especially in the clinical domain. This challenge can be
addressed by pretraining general domain language models using huge amounts of unlabeled
data.[1] These pretrained models can be fine-tuned to specific tasks. It takes a lot of time
to train the original BERT model from the beginning, and therefore training on a fine-tuned
model reduces the time and memory usage.
Pretrained BERT models, both general domain and domain-specific, have been developed.
General domain models are pretrained on cross-domain texts and therefore lack domain-related
knowledge. Also, the linguistic characteristics of general domain texts and clinical
domain texts are different hence creating the need for domain-specific BERT models.[26] In this study we investigate the performance of general Japanese BERT[f] and clinical Japanese BERT[27] models. The general Japanese BERT is pretrained on Japanese Wikipedia texts while
the clinical Japanese BERT is pretrained on Japanese clinical texts (mainly notes
by physicians and nurses) at University of Tokyo Hospital.[27]
The most common approach to use BERT is a feature-based approach where fine-tuning
is not required, and instead the BERT vectors are used like word embeddings. The output
of the BERT CLS token can also be used as a feature vector. The CLS (classification)
token is a special BERT token added at the start of a sequence and represents the
entire sequence.[14] Reimers and Gurevych[11] suggested that averaging the output of BERT or using the CLS token does not achieve
good performance. They investigated different pooling methods for the BERT output
such as mean and maximum pooling. However, the best strategy for extracting the feature
vectors is still an open problem.
[Fig. 4] shows the overview of our model. The input consists of a sequence of tokens of the
two sentences concatenated by a special token, [SEP]. The input sequence also has
the [SEP] token at the end to show the end of the input. The first token of the input
sequence is the BERT special classification token, [CLS]. BERT encodes the sentence
pair, and passes the final hidden state of the [CLS] as a representation of the input
sequence. The output of the [CLS] token is passed to a fully connected linear output
layer to calculate the semantic similarity score. The CR and EMR datasets are annotated
on a discrete scale from 0 to 5 (i.e., 0, 1, 2, 3, 4, 5). We approached it as a classification
problem and used a linear classifier with cross-entropy loss.[28]
Fig. 4 Overview of our model.
Results
Experimental Settings
The CR and EMR datasets were split into 70% training set and 30% test set, respectively.
Also, we prepared additional training set, n2c2, by translating the n2c2/OHNLP English
dataset to Japanese using Googletrans, which is a python library that communicates
with Google Translate API[g]. This n2c2/OHNLP dataset was provided in the 2019 n2c2/OHNLP Clinical Semantic Textual
Similarity shared task, and is discussed in Wang et al.[15]
[22] In our experiments we do different combinations of the training data, to see how
different datasets with different language variability affect the model performance.
We consider two experimental settings which we refer to as strict and relaxed. In the strict setting, we use the six scale semantic similarity scores (i.e., 0, 1, 2, 3, 4, 5)
as discussed in the data annotation guidelines. In the relaxed setting, we consider a four scale where we combine scores 1 and 2, and 3 and 4, i.e.,
(0, [1, 2], [3, 4], 5). In the annotation guidelines (refer to [Supplementary Appendix A], available in the online version only) the annotators stated that sometimes it was
difficult to choose between semantic similarity scores 3 and 4. This is because, in
some cases it is difficult to decide what constitutes “important” and “unimportant”
information. Similarly, the same problem was experienced for semantic similarity scores
1 and 2. Therefore, we consider the relaxed setting for uniformity. We also expect
that by using this kind of setting the classification performance can be improved.
Performance of General and Clinical Japanese BERT Models
We evaluated the performance based on two evaluation metrics; the Pearson correlation
coefficient (as in the previous STS shared tasks[1]
[2]
[18]
[19]
[20]
[21]
[22]) and classification accuracy between the predicted scores and gold scores. [Tables 1] and [2] show the results for the CR and EMR test sets, respectively. Both models, the general
Japanese BERT and the clinical Japanese BERT, achieved a good performance. In the
CR results, the general Japanese BERT achieved a Pearson correlation of 0.904 and
72% accuracy, whereas the clinical Japanese BERT best Pearson correlation and accuracy
were 0.890 and 69%, respectively, in the strict setting. In the relaxed setting, the
general Japanese BERT highest Pearson score and accuracy were 0.882 and 79%, respectively,
while the clinical Japanese BERT achieved Pearson score of 0.862 and 75% accuracy.
Table 1
CR results based on Pearson correlation and classification accuracy
|
|
Strict
|
Relaxed
|
|
Model
|
Training data
|
Pearson
|
Accuracy
|
Pearson
|
Accuracy
|
|
General Japanese BERT
|
CR
|
0.904
|
72%
|
0.878
|
79%
|
|
EMR
|
0.730
|
33%
|
0.749
|
53%
|
|
n2c2
|
0.716
|
35%
|
0.705
|
55%
|
|
CR + EMR
|
0.897
|
71%
|
0.882
|
78%
|
|
CR + EMR+ n2c2
|
0.895
|
71%
|
0.879
|
78%
|
|
Clinical Japanese BERT
|
CR
|
0.890
|
67%
|
0.854
|
75%
|
|
EMR
|
0.745
|
29%
|
0.696
|
47%
|
|
n2c2
|
0.656
|
25%
|
0.613
|
39%
|
|
CR + EMR
|
0.885
|
68%
|
0.862
|
76%
|
|
CR + EMR+ n2c2
|
0.870
|
69%
|
0.855
|
75%
|
Abbreviations: BERT, bidirectional encoder representations from transformers; CR,
case report; EMR, electronic medical record.
Table 2
EMR results based on Pearson correlation and classification accuracy
|
|
Strict
|
Relaxed
|
|
Model
|
Training data
|
Pearson
|
Accuracy
|
Pearson
|
Accuracy
|
|
General Japanese BERT
|
CR
|
0.692
|
53%
|
0.692
|
68%
|
|
EMR
|
0.864
|
79%
|
0.860
|
84%
|
|
n2c2
|
0.569
|
33%
|
0.558
|
63%
|
|
CR + EMR
|
0.856
|
79%
|
0.857
|
85%
|
|
CR + EMR+ n2c2
|
0.875
|
81%
|
0.870
|
86%
|
|
Clinical Japanese BERT
|
CR
|
0.685
|
44%
|
0.693
|
62%
|
|
EMR
|
0.845
|
76%
|
0.824
|
82%
|
|
n2c2
|
0.521
|
23%
|
0.513
|
52%
|
|
CR + EMR
|
0.862
|
79%
|
0.848
|
83%
|
|
CR + EMR+ n2c2
|
0.848
|
78%
|
0.833
|
82%
|
Abbreviations: BERT, bidirectional encoder representations from transformers; CR,
case report; EMR, electronic medical record.
In the EMR results, the general Japanese BERT best Pearson score and accuracy were
0.875 and 81%, respectively, while the clinical Japanese BERT achieved Pearson score
of 0.862 and 79% accuracy in the strict setting. In the relaxed setting, the general
Japanese BERT Pearson score and accuracy were 0.870 and 86%, respectively, whereas
the clinical Japanese BERT were 0.848 and 83%, respectively. Although both BERT models
performed well, in overall the general Japanese BERT model achieved the highest performance
in both datasets.
Discussion
Effect of Training Data
In the CR results, training only on the CR dataset achieved the highest performance
in the strict setting (Pearson correlation of 0.904 and accuracy of 72%). In the relaxed
setting, training on the CR+ EMR achieved the best performance, Pearson correlation
score of 0.882, but training on only the CR achieved the highest accuracy of 79%.
We expected that training on more data would improve the performance, but training
only on CR had best performance. Note that training only on n2c2 or EMR datasets achieved
average performance in terms of Pearson correlation score (0.716 and 0.730 for the
strict setting; 0.705 and 0.749 for the relaxed setting). Nevertheless, the classification
accuracy is relatively low (35 and 33% for the strict setting; 55 and 53% for the
relaxed setting). Although clinical Japanese BERT was trained on clinical texts, it
achieved low performance in the CR test data. This could be attributed to the reason
that case reports and real hospital text data are different in terms of vocabulary,
abbreviations, linguistic patterns, and even sentence length.
In the EMR results, training on a combination of all the datasets achieved the highest
performance (Pearson correlation of 0.875 and accuracy of 81% for the strict setting;
Pearson correlation of 0.870 and accuracy of 86% in the relaxed setting). The EMR
training set was small and therefore adding more data provided more training examples
for our model hence improving the performance. Although both EMR and n2c2 datasets
are created from real hospital documents, training on n2c2 dataset achieved the lowest
performance (Pearson correlation of 0.569 and 33% accuracy for the strict setting;
Pearson correlation 0.558 and 63% accuracy for the relaxed setting). This could be
due to the reason that the n2c2 and our EMR datasets were created from different types
of clinical notes. Our EMR dataset consisted of sentences from radiography notes and
progress notes, while the n2c2 dataset consisted of sentences from other different
types of clinical notes. Further, the n2c2 dataset was translated from English to
Japanese using Google translate machine translation. The quality of machine translation
was sufficient and most medical terms were translated efficiently. Although our preliminary
manual check of the translated sentences looks sufficient, the performance of the
proposed method could be improved by adopting better translation models. However,
to compare the precise relation between the machine translation quality and STS performance
is one of the future works.
In the EMR results, we expected the clinical Japanese BERT to achieve the best performance
since it is trained on clinical texts, but the general Japanese BERT attained the
highest performance. Although this could be surprising since domain specific pretraining
is expected to perform better in general, the result suggests that semantic textual
similarity relies more on fundamental linguistic features. This finding therefore
encourages clinical applications based on semantic textual similarity, since widely
available, general domain BERT models would work well. Moreover, the high performance
of general Japanese BERT could also be due to the fact that it is trained on a wide
range of texts and therefore it could generalize well.
Error Analysis
[Tables 3] and [4] show error examples for the CR and EMR test sets, respectively. In the CR results,
example (a) in [Table 3] shows an example of abbreviation expansion problem. Abbreviation expansion is a
major problem even in other NLP tasks, and in future there is a need for a precise
method to handle this problem. Example (b) is a case of language variability, although
the sentences are similar in meaning the choice of words varies greatly. In example
(c), the model assigned a higher score to the sentence pair because actually the sentences
are highly similar and have only a minor difference (“Yamada type 1 or type II” in
the first sentence, and “Yamada type III” in the second sentence). Although this kind
of difference is important in the clinical domain, in the general English domain this
sentence pair can be treated as semantically equal. In example (d), the sentences
are roughly equivalent, and although our model assigned a score of 4, the gold score
should be 3.
Table 3
Error analysis in the CR dataset
|
|
Table 4
Error analysis in the EMR dataset
|
|
Abbreviation: EMR, electronic medical record.
In the EMR results, the sentence length varies greatly from very short sentences (1
or 2 words). Example (a) in [Table 4] shows a typical example of short sentences found in EMR notes. The EMR dataset sentence
lengths have a large difference, and our model was not able to correctly classify
sentence pairs with very short sentences. In sentence pairs of examples (b) and (c),
our model assigned a lower score because although the sentences have a high semantic
similarity, the choice of words is quite different. For example in (c), “almost no
change” and “slightly decreased” have close meaning semantically. It is easy for human
beings to capture this kind of meaning but difficult for machines to capture this
kind of similarity. Sentence pairs of examples (d) and (e) show a case of positive–negative
relationship. Our model was not able to capture negation, and in future it is necessary
to train our model to identify this kind of relationship.
Conclusion
STS tasks have been widely studied especially in the general English domain. However,
only a few resources exist for STS tasks in the clinical domain and languages other
than English such as Japanese. To bridge this gap, we created a publicly available
dataset for Japanese clinical STS. The dataset consists of approximately 4,000 sentence
pairs extracted from Japanese case reports annotated with a semantic similarity score
from 0 (low semantic similarity) to 5 (high semantic similarity).
We used a BERT-based approach to capture semantic similarity between clinical domain
texts. In our experiments we achieved a high Pearson correlation score between the
gold scores and human scores (0.904 in the CR dataset; 0.875 in the EMR dataset).
In this study we also compared the performance of the general and clinical Japanese
BERT models. Although both models achieved a good performance, the general Japanese
BERT achieved the highest performance compared with the clinical Japanese BERT in
our clinical domain datasets. Though this could be surprising because domain specific
pretraining is known to perform better in general, the results suggest that semantic
textual similarity relies more on fundamental linguistic features. This finding particularly
encourages clinical applications based on semantic textual similarity, since widely
available general domain BERT models would work well.