Development of a Medical-text Parsing Algorithm Based on Character Adjacent Probability Distribution for Japanese Radiology Reports

N. Nishimoto; S. Terae; M. Uesugi; K. Ogasawara; T. Sakurai

doi:10.3414/ME9127

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2008; 47(06): 513-521
DOI: 10.3414/ME9127

Original Article

Schattauer GmbH

Development of a Medical-text Parsing Algorithm Based on Character Adjacent Probability Distribution for Japanese Radiology Reports

Authors

N. Nishimoto

¹Department of Medical Informatics, Hokkaido University, Sapporo, Japan
S. Terae

²Department of Radiology, Hokkaido University, Sapporo, Japan
M. Uesugi

¹Department of Medical Informatics, Hokkaido University, Sapporo, Japan
K. Ogasawara

³Department of Health Sciences, Hokkaido University, Sapporo, Japan
T. Sakurai

¹Department of Medical Informatics, Hokkaido University, Sapporo, Japan

Further Information

Publication History

Publication Date:
18 January 2018 (online)

Permissions and Reprints

Summary

Objectives: The objectives of this study were to investigate the transitional probability distribution of medical term boundaries between characters and to develop a parsing algorithm specifically for medical texts.

Methods: Medical terms in Japanese computed tomography (CT) reports were identified using the ChaSen morphological analysis system. MeSH-based medical terms (51,385 entries), obtained from the metathesaurus in the Unified Medical Language System (UMLS, 2005AA), were added as a medical dictionary for ChaSen. A radiographer corrected the set of results containing 300 parsed CT reports. In addition, two radiologists checked the medical term parsing of 200 CT sentences.

Results: We obtained modified inter-annotator agreement scores for the text corrected by the radiologists. We retrieved the transitional probability as the conditional probability of a uni-gram, bi-gram, and tri-gram. The highest transitional probability P(Ci | Ci - 2*Ci - 1) was 1.00. For an example of anatomical location, the term “pulmonary hilum” was parsed as a tri-gram.

Conclusions: Retrieval of transitional probability will improve the accuracy of parsing compound medical terms.

Keywords

Prediction by partial matching - transitional probability - natural language processing - Unified Medical language System - compound terms

References
1 Charniak E, Hendrickson C, Jacobson N, Perkowitz M. Equations for Part-of-Speech Tagging. Proceedings of the Eleventh National Conference on Artificial Intelligence; 1993, July 11-15, Washington, DC, USA. Menlo Park:AAAI Press; 1993

PubMed Search in Google Scholar
Download RIS citation
2 Huang Y, Lowe HJ, Hersh W. A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports. J Am Med Inform Assoc 2003; 10 (06) 580-587.

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc 1994; 1 (02) 161-174.

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Coden AR, Pakhomov SV, Ando RK, Duffy PH, Chute CG. Domain-specific language models and lexicons for tagging. J Biomed Inform 2005; 38 (06) 422-430.

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Liu K, Chapman W, Hwa R, Crowley RS. Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger. J Am Med Inform Assoc 2007; 14 (05) 641-650.

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Jiang G, Ogasawara K, Endoh A, Sakurai T. Context- based ontology building support in clinical domains using formal concept analysis. Int J Med Inform 2003; 71 (01) 71-81.

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Cimino JJ, Zhu X. The practical impact of ontologies on biomedical informatics. IMIA Yearbook of Medical Informatics 2006. Methods Inf Med 2006; 45 (01) 124-135.

Search in Google Scholar
Download RIS citation
8 Gu H, Perl Y, Geller J, Halper M, Liu LM, Cimino JJ. Representing the UMLS as an object-oriented database: modeling issues and advantages. J Am Med Inform Assoc 2000; 7 (01) 66-80.

Crossref PubMed Search in Google Scholar
Download RIS citation
9 RSNA.org (homepage on the Internet).. RSNA.org: RadLex: A Lexicon for Uniform Indexing and Retrieval of Radiology Information Resources.; ©2008 (cited 2008, Jan 22). Available from: http://www.rsna.org/radlex/.

Download RIS citation
10 Huang Y, Lowe HJ, Klein D, Cucina RJ. Improved identification of noun phrases in clinical radiology reports using a high performance statistical natural language parser augmented with the UMLS Specialist Lexicon. J Am Med Inform Assoc 2005; 12 (03) 275-285.

Crossref PubMed Search in Google Scholar
Download RIS citation
11 Tsujimura N. An introduction to Japanese language. 1st ed. Malden: Blackwell; 1996

Search in Google Scholar
Download RIS citation
12 Nishimoto N, Satoshi T, Jiang G, Uesugi M, Terashita T, Tanikawa T. et al. Semantic distribution study of noun*noun compounds in the Japanese CT clinical reports. AMIA. Annual Symposium proceedings; 2006, Nov 11-15; Washington, DC, USA. Bethesda: AMIA; 2006

Search in Google Scholar
Download RIS citation
13 Teahan WJ, Cleary JG. The entropy of English using PPM based models. Proc Data Compression Conf ;1996, March 31-April 3; Los Alamitos, CA, USA. **: IEEE Computer Society Press; 1996

Search in Google Scholar
Download RIS citation
14 Computational Linguistics Laboratory, Nara Institute of Science and Technology (homepage on the Internet).. ChaSen legacy (cited 2008, May 19). Available from: http://chasen-legacy.sourceforge.jp/.

Download RIS citation
15 U.S. National Library of Medicine (homepage on the Internet). Unified Medical Language System (UMLS) Documentation (cited 2008, May 19). Available from: http://www.nlm.nih. gov/research/umls/documentation.html.

Download RIS citation
16 Takeuchi K, Matsumoto Y. HMM parameter learning for Japanese morphological analyzer. Transactions of the Information Processing Society of Japan. 1997; 38 (03) 500-509.

Search in Google Scholar
Download RIS citation
17 Brants T. TnT – a statistical part-of-speech tagger. ANLP-NAACL proceedings; 2000 April 29-May 4; Seattle, WA, USA. San Francisco: Association for Computational Linguistics; 2000

Search in Google Scholar
Download RIS citation
18 Cleary JG, Teahan WJ. Unbounded Length Contexts for PPM. The Computer Journal 1997; 40 (02) 67-75.

Crossref Search in Google Scholar
Download RIS citation
19 Witten IH, Bell TC. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE transaction on Information Theory 1991; 37 (04) 1085-1094.

Crossref Search in Google Scholar
Download RIS citation
20 Cleary JG, Witten IH. Data compression using adaptive coding and partial string matching. IEEE transaction on communications 1984; 32 (04) 396-402.

Crossref Search in Google Scholar
Download RIS citation
21 Moffat A. Implementing the PPM Data Compression Scheme. IEEE Transactions on Communications 1990; 38 (11) 1917-1921.

Crossref Search in Google Scholar
Download RIS citation
22 Yoshimura Y. Kooza Nihongo to Nihongo Kyooiku 13 Kanji no Shidoo. Tokyo: Meejishoin; 1989

Search in Google Scholar
Download RIS citation
23 Pakhomov SV, Coden A, Chute CG. Developing a corpus of clinical notes manually annotated for partof- speech. Int J Med Inform 2006; 75 (06) 418-429.

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Teahan WJ, Wen Y, McNab R, Witten IH. A compression- based algorithm for Chinese word segmentation. Computational Linguistics 2000; 26 (03) 375-393.

Crossref Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Development of a Medical-text Parsing Algorithm Based on Character Adjacent Probability Distribution for Japanese Radiology Reports

Authors

Publication History

Summary

Keywords

References