Methods Inf Med 2008; 47(06): 513-521
DOI: 10.3414/ME9127
Original Article
Schattauer GmbH

Development of a Medical-text Parsing Algorithm Based on Character Adjacent Probability Distribution for Japanese Radiology Reports

N. Nishimoto
1   Department of Medical Informatics, Hokkaido University, Sapporo, Japan
,
S. Terae
2   Department of Radiology, Hokkaido University, Sapporo, Japan
,
M. Uesugi
1   Department of Medical Informatics, Hokkaido University, Sapporo, Japan
,
K. Ogasawara
3   Department of Health Sciences, Hokkaido University, Sapporo, Japan
,
T. Sakurai
1   Department of Medical Informatics, Hokkaido University, Sapporo, Japan
› Institutsangaben
Weitere Informationen

Publikationsverlauf

Publikationsdatum:
18. Januar 2018 (online)

Summary

Objectives: The objectives of this study were to investigate the transitional probability distribution of medical term boundaries between characters and to develop a parsing algorithm specifically for medical texts.

Methods: Medical terms in Japanese computed tomography (CT) reports were identified using the ChaSen morphological analysis system. MeSH-based medical terms (51,385 entries), obtained from the metathesaurus in the Unified Medical Language System (UMLS, 2005AA), were added as a medical dictionary for ChaSen. A radiographer corrected the set of results containing 300 parsed CT reports. In addition, two radiologists checked the medical term parsing of 200 CT sentences.

Results: We obtained modified inter-annotator agreement scores for the text corrected by the radiologists. We retrieved the transitional probability as the conditional probability of a uni-gram, bi-gram, and tri-gram. The highest transitional probability P(Ci | Ci - 2*Ci - 1) was 1.00. For an example of anatomical location, the term “pulmonary hilum” was parsed as a tri-gram.

Conclusions: Retrieval of transitional probability will improve the accuracy of parsing compound medical terms.

 
  • References

  • 1 Charniak E, Hendrickson C, Jacobson N, Perkowitz M. Equations for Part-of-Speech Tagging. Proceedings of the Eleventh National Conference on Artificial Intelligence; 1993, July 11-15, Washington, DC, USA. Menlo Park:AAAI Press; 1993
  • 2 Huang Y, Lowe HJ, Hersh W. A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports. J Am Med Inform Assoc 2003; 10 (06) 580-587.
  • 3 Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc 1994; 1 (02) 161-174.
  • 4 Coden AR, Pakhomov SV, Ando RK, Duffy PH, Chute CG. Domain-specific language models and lexicons for tagging. J Biomed Inform 2005; 38 (06) 422-430.
  • 5 Liu K, Chapman W, Hwa R, Crowley RS. Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger. J Am Med Inform Assoc 2007; 14 (05) 641-650.
  • 6 Jiang G, Ogasawara K, Endoh A, Sakurai T. Context- based ontology building support in clinical domains using formal concept analysis. Int J Med Inform 2003; 71 (01) 71-81.
  • 7 Cimino JJ, Zhu X. The practical impact of ontologies on biomedical informatics. IMIA Yearbook of Medical Informatics 2006. Methods Inf Med 2006; 45 (01) 124-135.
  • 8 Gu H, Perl Y, Geller J, Halper M, Liu LM, Cimino JJ. Representing the UMLS as an object-oriented database: modeling issues and advantages. J Am Med Inform Assoc 2000; 7 (01) 66-80.
  • 9 RSNA.org (homepage on the Internet).. RSNA.org: RadLex: A Lexicon for Uniform Indexing and Retrieval of Radiology Information Resources.; ©2008 (cited 2008, Jan 22). Available from: http://www.rsna.org/radlex/.
  • 10 Huang Y, Lowe HJ, Klein D, Cucina RJ. Improved identification of noun phrases in clinical radiology reports using a high performance statistical natural language parser augmented with the UMLS Specialist Lexicon. J Am Med Inform Assoc 2005; 12 (03) 275-285.
  • 11 Tsujimura N. An introduction to Japanese language. 1st ed. Malden: Blackwell; 1996
  • 12 Nishimoto N, Satoshi T, Jiang G, Uesugi M, Terashita T, Tanikawa T. et al. Semantic distribution study of noun*noun compounds in the Japanese CT clinical reports. AMIA. Annual Symposium proceedings; 2006, Nov 11-15; Washington, DC, USA. Bethesda: AMIA; 2006
  • 13 Teahan WJ, Cleary JG. The entropy of English using PPM based models. Proc Data Compression Conf ;1996, March 31-April 3; Los Alamitos, CA, USA. **: IEEE Computer Society Press; 1996
  • 14 Computational Linguistics Laboratory, Nara Institute of Science and Technology (homepage on the Internet).. ChaSen legacy (cited 2008, May 19). Available from: http://chasen-legacy.sourceforge.jp/.
  • 15 U.S. National Library of Medicine (homepage on the Internet). Unified Medical Language System (UMLS) Documentation (cited 2008, May 19). Available from: http://www.nlm.nih. gov/research/umls/documentation.html.
  • 16 Takeuchi K, Matsumoto Y. HMM parameter learning for Japanese morphological analyzer. Transactions of the Information Processing Society of Japan. 1997; 38 (03) 500-509.
  • 17 Brants T. TnT – a statistical part-of-speech tagger. ANLP-NAACL proceedings; 2000 April 29-May 4; Seattle, WA, USA. San Francisco: Association for Computational Linguistics; 2000
  • 18 Cleary JG, Teahan WJ. Unbounded Length Contexts for PPM. The Computer Journal 1997; 40 (02) 67-75.
  • 19 Witten IH, Bell TC. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE transaction on Information Theory 1991; 37 (04) 1085-1094.
  • 20 Cleary JG, Witten IH. Data compression using adaptive coding and partial string matching. IEEE transaction on communications 1984; 32 (04) 396-402.
  • 21 Moffat A. Implementing the PPM Data Compression Scheme. IEEE Transactions on Communications 1990; 38 (11) 1917-1921.
  • 22 Yoshimura Y. Kooza Nihongo to Nihongo Kyooiku 13 Kanji no Shidoo. Tokyo: Meejishoin; 1989
  • 23 Pakhomov SV, Coden A, Chute CG. Developing a corpus of clinical notes manually annotated for partof- speech. Int J Med Inform 2006; 75 (06) 418-429.
  • 24 Teahan WJ, Wen Y, McNab R, Witten IH. A compression- based algorithm for Chinese word segmentation. Computational Linguistics 2000; 26 (03) 375-393.