Abstract
A method is presented for assigning classification codes to pathology reports by searching
similar reports from an archive collection. The key for searching is textual similarity,
which estimates the true, semantic similarity. This method does not require explicit
modeling, and can be applied to any language or any application domain that uses natural
language reporting. A number of simulation experiments was run to assess the accuracy
of the method and to indicate the role of size of the archive and the transfer of
document collections across laboratories. In at least 63% of the simulation trials,
the most similar archive text offered a suitable classification on organ, origin and
diagnosis. In 85 to 90% ofthe trials, the archive's best solution was found within
the first five similar reports. The results indicate that the method is suitable for
its purpose: suggesting potentially correct classifications to the reporting diagnostician.
Keywords
Natural Language Processing - Nomenclature - Information Storage and Retrieval