Abstract
Evaluating natural language processing (NLP) systems in the clinical domain is a difficult
task which is important for advancement of the field. A number of NLP systems have
been reported that extract information from free-text clinical reports, but not many
of the systems have been evaluated. Those that were evaluated noted good performance
measures but the results were often weakened by ineffective evaluation methods. In
this paper we describe a set of criteria aimed at improving the quality of NLP evaluation
studies. We present an overview of NLP evaluations in the clinical domain and also
discuss the Message Understanding Conferences (MUC) [1-41. Although these conferences
constitute a series of NLP evaluation studies performed outside of the clinical domain,
some of the results are relevant within medicine. In addition, we discuss a number
of factors which contribute to the complexity that is inherent in the task of evaluating
natural language systems.
Keywords
Natural Language Processing - Medical Language Processing - Evaluation of Clinical
Information Resources