Key words
ankle - Natural Language Processing - Text Mining - Fibula Fracture - Automatic Classification - Data Set
Introduction
The analysis of electronic health records (EHRs) lays the basis for a developing healthcare system, as it enables access to large data volumes [1]
[2]
[3], which support research and ultimately can increase patient safety and decrease healthcare costs [4]
[5]. Radiological reports are a particularly rich source of compact clinical information within an EHR. These reports document information about the patient's health status and the radiologist’s interpretation of medical findings. However, written radiological reports are often unstructured, which poses a challenge for the conversion into a computer-based representation [1]
[6].
Machine learning (ML) and natural language processing (NLP) are subsections of artificial intelligence. Classic ML methods can model data, such as radiology reports, using (un-)supervised learning methods [7]. This typically requires pre-processing by means of NLP in order to extract machine-readable features from unstructured texts. In this step, feature extractors transform the raw data into a suitable internal representation. During this feature extraction, uncorrelated or superfluous features may be deleted, which can improve the accuracy of learning algorithms. Nevertheless, the complexity of the natural language used in free-text reports and the variations among the different dictation styles of radiologists can be problematic [8]. Thus, the choice of feature extraction methods during pre-processing of texts is particularly important [9]. In contrast, modern ML methods, such as neural networks (NN), have the capability to perform an end-to-end approach. This includes feature extraction in the training pipeline of the model as one of many tunable hyperparameters, potentially leading to a better adapted model. After the conversion of unstructured free text reports into feature vectors, classifiers can detect, extract, and classify patterns during (un-)supervised learning [6]
[10]. Such structured information can, e. g., be the classification of patients into different groups.
NN has become the gold standard for text processing as it can achieve reliable results [11]. The current iteration of NN-based models is derived from large transformer language models, such as BERT [12]. Adaptations for the medical domain include BioBert [13] and ClinicalBERT [14]. BioBERT was mainly trained on 4.5 billion words of PubMed abstracts and 13.5 billion words of PMC articles. ClinicalBERT was trained on nearly 2 million anonymized notes by clinical physicians.
However, classic ML methods such as vector machines (SVM) have also been demonstrated to be suitable for the high dimensional vectors extracted by NLP and are thus used in recent studies [15]. Logistic regression (LR) is a well-established method, providing robust results [16].
Reports of X-ray images of the ankle are a suitable candidate to test a feature extraction/classification system, as fractures of the distal fibula are common (accounting for 70 % of all ankle fractures [17]). Distal fibula fractures can be isolated or combined with distal tibia fractures (bimalleolar or trimalleolar fractures) [18]. Unstable ankle fractures are usually treated by open reduction and internal fixation [19]
[20]. Subsequently, plenty of pre- and postoperative X-ray images of the lower fibula exist in every hospital with a trauma or orthopedic unit. As postoperative complications can potentially lead to long-term impairments [18], further research taking into account the enormous data amounts certainly leads to improved patient safety.
Text mining (commonly used term to denote the task of NLP) [6] techniques for radiological reports have been previously proposed to support the detection and surveillance of various diseases, including bone fractures [5]
[21]
[22]
[23]. The aim of this study was to find the best feature extraction method for free-text radiological reports and to classify reports of ankle X-rays by fractures of the distal fibula.
Materials and methods
This retrospective, IRB-approved study was performed between 02/2019 and 01/2020. We assessed de-identified free-text radiological reports of ankle X-rays in two planes of patients treated at Hannover Medical School, between 01/2015 and 09/2019.
Training dataset
Due to a lack of existing data, we established a novel German language report dataset. A designated search engine based on the Enterprise Clinical Research Data Warehouse of the Hannover Medical School comprising pseudonymized clinical data of > 2.3 million patients was used to identify radiographs of the ankle. Data was used exclusively from inpatients who consented to the usage of their data for research purposes. The search was conducted using the search term “OSG in 2 Ebenen” (ankle X-ray in two planes). A radiologist manually assigned class labels to 3268 reports according to whether the report described a fracture of the distal fibula or not. Reports were excluded if no statement about the distal fibula was made. Only texts directly reporting on the presence (e. g., “dislocated fracture of the distal fibula”) or absence (e. g., “no fracture of the distal fibula”) were included in the training dataset. Reports describing tibial involvement (bi- or trimalleolar fractures), other fractures, and combined reports covering X-rays beyond the ankle were included in the analysis. Another dataset containing 400,000 radiology reports was used to train the Doc2Vec models (see below).
For the freely available dataset (link: https://doi.org/10.26068/mhhrpm/20230208-000), a further de-identification step was manually performed to displace names of patients or doctors and dates, if applicable.
Pre-processing
As classification is performed on numerical data, the first steps of ML on the texts were cleaning, normalizing, and pre-processing the data, which transformed text into machine-readable numerical vectors ([Fig. 1]). We used the nltk stopword list to remove stopwords and a self-programmed script to remove HTML tags. Since stemming of German words and clinically used abbreviations resulted in a different literal sense and thus negatively impacted the AUC, we decided not to use a stemmer. Furthermore, we removed the words “nicht”, “viel”, and “sehr” (engl. “not”, “much”, “very”) from the stopword list.
Fig. 1 Machine learning workflow in this study. BOW: bag-of-words; NMF: non-negative matrix factorization; TF-IDF: term frequency-inverse document frequency; PCA: principal component analysis; LDA: latent Dirichlet allocation; LR: logistic regression; SVM: vector machines; NN: neuronal networks.
The feature extraction methods bag-of-words (BOW), term frequency-inverse document frequency (TF-IDF), principal component analysis (PCA), non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), and Doc2Vec were used for pre-processing. BOW is the easiest and most commonly used method for text representation [24], but TF-DIF is a robust and common method in pre-processing as well. Since they count the frequency of occurrences in a text, both techniques transform text data into very high-dimensional vectors. NMF, PCA, and LDA are methods for dimensionality reduction. PCA is one of the most commonly used methods in the basic literature [25], leading to solid results. Simply put, PCA reduces a dataset of potentially correlated features to a set of values that are linearly uncorrelated. NMF is an easily interpretable linear technique that is robust for word and vocabulary recognition while compressing original text into smaller data vectors. LDA is popular in topic modeling, where the main topics in a text are extracted and classified [26]. Doc2Vec is a method that uses Deep Learning (a technique based on neural networks (NN)) to train a model that not only transforms text into vectors, it also models how similar these texts are. The various methods were compared by accuracy (acc) and area under the curve (AUC).
Supervised learning
The pre-processed data were randomly divided into training and test datasets, with a validation dataset for the neural network in order to avoid overfitting the data and, subsequently, more reliable results. During training, the algorithms never came into contact with the test data. It was kept separate for the evaluation of the trained algorithms on unknown data. Three different ML algorithms were trained on the resulting feature vectors: NN, SVM, and LR. The algorithms were optimized for AUC and evaluated with 10-fold cross-validation on the training dataset.
Results
Training dataset
We assessed 3268 unstructured radiological reports of two-plane ankle X-rays. 640 reports were excluded, as they did not directly report on the distal fibula, thus it could not be defined whether a distal fibula fracture is present or not. The remaining 2628 free-text reports were included in the training dataset. Of those, 41 % (1076) described a fracture of the distal fibula. 59 % of the reports (1552) stated that no fracture of the distal fibula was present. The free-text reports were short in length, containing a median of 646 (interquartile range (IQR) 514–824) characters.
Due to the open data initiative for research transparency, the dataset is published under the following link: https://doi.org/10.26068/mhhrpm/20230208-000.
Machine Learning
Six feature extraction methods (BOW, TF-IDF, PCA, NMF, LDA, Doc2Vec) were used to train three different ML algorithms (NN, SVM, and LR) and optimized for the AUC. The trained networks were used to predict the label of the test data and reached the AUC. The BOW model achieved the best results (AUC: NN 0.99; SVM 0.97; LR 0.97), closely followed by the TF-IDF (AUC: NN 0.99; SVM 0.96; LR 0.96). In combination with NN, NMF achieved similar results (AUC 0.98). For details, refer to [Table 1] (AUC data) and [Table 2] (Accuracy data).
Table 1
Overview table of AUC values of various feature extraction methods used to train different ML algorithms and evaluated with 10-fold cross-validation on the training dataset.
BOW: bag of words; LDA: Latent Dirichlet allocation; LR: Logistic regression; NMF: Non-negative matrix factorization; NN: Neural network; PCA: Principal component analysis; SVM: Support vector machine; TF-IDF: Term frequency-inverse document frequency.
|
NN
|
SVM
|
LR
|
Average AUC
|
Dummy
|
0.5
|
|
BOW
|
0.99
|
0.97
|
0.97
|
0.977
|
TF-IDF
|
0.99
|
0.96
|
0.96
|
0.970
|
NMF
|
0.98
|
0.9
|
0.9
|
0.927
|
PCA
|
0.95
|
0.91
|
0.9
|
0.920
|
LDA
|
0.94
|
0.89
|
0.88
|
0.903
|
Doc2Vec
|
0.94
|
0.9
|
0.85
|
0.897
|
Table 2
Overview table of accuracy values of various feature extraction methods used to train different ML algorithms and evaluated with 10-fold cross-validation on the training dataset.
BOW: bag of words; LDA: Latent Dirichlet allocation; LR: Logistic regression; NMF: Non-negative matrix factorization; NN: Neural network; PCA: Principal component analysis; SVM: Support vector machine; TF-IDF: Term frequency-inverse document frequency.
|
NN
|
SVM
|
LR
|
Average Accuracy
|
Dummy
|
0.5
|
|
BOW
|
0.96
|
0.97
|
0.97
|
0.967
|
TF-IDF
|
0.95
|
0.96
|
0.97
|
0.96
|
NMF
|
0.94
|
0.91
|
0.9
|
0.917
|
PCA
|
0.91
|
0.9
|
0.9
|
0.903
|
LDA
|
0.88
|
0.89
|
0.88
|
0.883
|
Doc2Vec
|
0.87
|
0.9
|
0.86
|
0.877
|
Discussion
In this manuscript, we describe our approach to classify unstructured radiograph reports according to fractures of the distal fibula. Special attention was paid to various feature extraction methods for pre-processing. To do so, we created a manually labeled novel German language report dataset, which is not yet available across the German medical NLP landscape in this format and is specifically based on radiological findings. We invite other groups to use our dataset, which is available as open data (link: https://doi.org/10.26068/mhhrpm/20230208-000).
Our automated classification pipeline was able to reliably detect findings of fractures of the distal fibula. BOW was the most reliable feature extraction method for the tested models in combination with the aforementioned dataset. TF-IDF achieved AUC values very similar to BOW. TF-IDF is characterized by a lower number of dimensions. However, this does not confer a relevant advantage, as the employed models (especially neural networks) can reliably compute high dimensional data as provided by methods like BOW. Non-negative matrix factorization (NMF) proved to be a solid alternative method for producing vectors with lower dimensions. In conjunction with the supervised learning method NN, the results of NMF achieved AUC values similar to BOW and TF-IDF. The selection of an appropriate feature extraction method for pre-processing significantly impacted the results of the machine learning model – meaning that, in our tests, the best classification method could not compensate for an ill-suited feature extraction method. In this study, the choice of document representation for pre-processing of the data might be more important than the classifiers for ML-part.
In various studies, open-source datasets in English were used to compare innovative feature extraction methods to established techniques. Kim et al. e. g., performed a comparison of BOW, doc2vec, TF-IDF, and a self-made text representation method (bag-of-concepts). Contrary to our results, doc2vec showed the best results, and TF-IDF outscored BOW [27]. In contrast to our study, Kim et al. classified non-medical texts. Similar results were presented in a study comparing TF-IDF, LDA, and doc2vec for several datasets, of which one was EHR-based [28]. Doc2vec showed the best results, LDA and TF-IDF were on par. However, there is limited comparability to our study, as medical and non-medical texts were not separately analyzed. Furthermore, in our study, Doc2vec was trained on a specific sort of medical texts (radiologic reports), which might lead to a lack of diversity of informational content. This might imply that text representation methods need to be designated to the type of text. However, further research is necessary to substantiate this hypothesis.
For further studies, it could be interesting to evaluate the impact of the inclusion of various medical texts on the results. A suitable dataset to validate (or refute) our results in future studies might be a German preprint dataset published by Borchert et al. [29], which was not available at the time of our analysis.
Large transformer-based language models for the medical domain, such as BioBERT and ClinicalBERT, did not apply to our dataset, as they target the English language specifically. Currently, this type of model is not publicly available for the German language in the radiological domain. However, we see the potential of this development and are contributing our anonymized dataset of German clinical notes as open access.
Conclusion
The future of improved patient care relies on the utilization of big data. The health sector has experienced widespread digitalization during the last years, which has led to a continuously growing amount of patient data. As radiology was among the first specialties for which computerization became obligatory for daily work, it is widely digitized. Therefore, a significant amount of data is digitally stored in radiologic reports. Unfortunately, they mostly contain unstructured text. This is a major obstacle for rapid extraction and subsequent use of information by clinicians and researchers [6]. As a result, radiology reports are often used only once by the clinician who ordered the study and are rarely used again [8].
ML information extraction techniques provide an effective method to automatically identify and classify free-text radiology reports, which can be useful in various clinical and non-clinical settings. An automated classification can support diagnostic surveillance, e. g., assist in the management of cases that require follow-up or even monitor public health-related trends such as increases in disease activity in a hospital or on a population level. Moreover, it can support cohort building for epidemiologic studies and also provide query-based case retrieval.
This study shows that automated classification of unstructured reports of radiographs of the ankle can reliably detect findings of fractures of the distal fibula. Special attention was paid to various methods for pre-processing, and it was shown that a particularly suitable feature extraction method is the BOW model for our setting. This automated classification system can serve as a reference for future studies as well as decision-support systems, which might prospectively improve clinical management and patient safety.
Limitations
It needs to be emphasized that the comparability between the mentioned studies is limited due to the varying pipeline setups and the used datasets. Contrary to the discussed studies, our dataset was German, which might impact the results. Furthermore, this project was narrowly focused on extracting a single type of information – presence or absence of a fracture of the distal fibula. Information on other fractures or pathologies was not extracted. We set up a binary classification system, which did not classify the fractures into different subclasses. Furthermore, it needs to be assessed whether the classification system can reliably be used for other radiology reports.
Regarding the dataset, although the exam description should be “OSG in 2 Ebenen”, we cannot guarantee that the search term is exhaustive. Lastly, the achieved results might be over-adapted to the training dataset, which is a common problem in ML. To rule this out, the system will be validated with an unknown dataset.
-
Text mining techniques have the potential to support the detection and surveillance of diseases.
-
In this manuscript, we describe our approach to automatically classify unstructured radiograph reports according to fractures of the distal fibula.
-
Our automated classification system as well as the enclosed dataset might serve as a reference for future studies as well as decision-support systems, which could potentially improve clinical management and patient safety.