Methods Inf Med 1999; 38(02): 96-101
DOI: 10.1055/s-0038-1634180
Original Article
Schattauer GmbH

Term Domain Distribution Analysis: a Data Mining Tool for Text Databases

J. A. Goldman
1   Computer Science Department, University of California, Los Angeles, CA, USA
,
W. W. Chu
1   Computer Science Department, University of California, Los Angeles, CA, USA
,
D. S. Parker
1   Computer Science Department, University of California, Los Angeles, CA, USA
,
R. M. Goldman
1   Computer Science Department, University of California, Los Angeles, CA, USA
› Author Affiliations
Further Information

Publication History

Publication Date:
08 February 2018 (online)

Abstract

In this paper, we give a case history illustrating the real-world application of a useful technique for data mining of text databases. The technique, which we call Term Domain Distribution Analysis (TDDA), consists of keeping track of term frequencies for specific finite domains and announcing significant differences from standard frequency distributions over these domains as a hypothesis. TDDA is part of a larger framework, the Digital Filter Model, for data mining of text documents. In the case study presented, the domain of terms was the pair {right, left}, over which we expected a uniform distribution. In analyzing term frequencies in a thoracic lung cancer database, the TDDA technique led to the surprising discovery that primary thoracic lung cancer tumors appear in the right lung more often than the left lung, with a ratio of 3:2. Treating the text discovery as a hypothesis, we verified this relationship against the medical literature in which primary lung tumor sites were reported, using a standard χ2 statistic. We subsequently developed a working theoretical model of lung cancer that may explain the discovery. This discovery and our model may change how oncologists view the mechanisms of primary lung tumor location.

 
  • References

  • 1 Goldman JA. A Digital Filter Model for Data Mining of Text Documents. Ph. D. thesis, University of California; at Los Angeles: December 1998
  • 2 Goldman JA, Parker DS, Chu WW. Knowledge discovery in an earthquake text database: Correlation between significant earthquakes and the time of day. In 9th International Conference on Scientific and Statistical Database Management,. pages 12-21. IEEE, 1997
  • 3 Feldman R, Dagan I. Knowledge discovery in textual databases (KDT). Proceedings of the First International Conference on Knowledge Discovery and Data Mining,. 1995
  • 4 Lagus K, Honkela T, Kaski S, Kohonen T. Self-organizing maps of document collections: A new approach to interactive exploration. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 1996
  • 5 Lynch KJ, Chen H. Knowledge discovery from historical data: An algorithmic approach. Technical report, MIS Department of Mathematics, University of Arizona; 1994
  • 6 Harman D. Overview of the fourth text retrieval conference (TREC-4). The Fourth Text Retrieval Conference (TREC-4). 1995
  • 7 Salton G, Lesk M. Computer evaluation of indexing and text processing. Journal of the Association for Computing Machinery 1968; 15 (Suppl. 01) 8-36.
  • 8 Salton G. The SMART retrieval system: Experiments in Automatic Document Processing. Prentice Hall; 1971
  • 9 Fayyad UM, Piatetsky-Shapiro G, Smyth P. Advances in Knowledge Discovery and Data Mining. AAAI Press; 1996
  • 10 Piatetsky-Shapiro G, Frawley WJ. Knowledge Discovery in Databases. AAAI Press; 1991
  • 11 Piatetsky-Shapiro G, Matheus C, Smyth P, Uthurusamy R. KDD-93: Progress and challenges in knowledge discovery in databases. Artificial Intelligence Magazine 1994; 15 (Suppl. 03) 77-82 Fall.
  • 12 Grishman R, Kittredge R. editors. Analyzing Language in Restricted Domains: Sublanguage Description and Processing. New Jersey: Lawrence Erlbaum Associates, Publ; 1986
  • 13 Kittredge R, Lehrberger J. editors. Sublanguage: Studies of Language in Restricted Semantic Domains. Walter de Gruyter and Co.; New York: 1982
  • 14 Chu WW, Cárdenas AF, Taira RK. KMeD: A knowledge-based multimedia medical distributed database system. Information Systems 1995; 20 (Suppl. 02) 75-96.
  • 15 Hillerdal G. Malignant mesothelioma 1982: Review of 4710 published cases. Brit J Dis Chest 1983; 77: 321-43.
  • 16 Yates DH, Corrin B, Stidolph PN, Browne K. Malignant mesothelioma in south east England: clinicopathological experience of 272 cases. Thorax 1997; 52: 507-12.
  • 17 Hyman Jr NH, Foster RS, DeMeules JE, Costanza MC. Blood transfusions and survival after lung cancer resection. Am J Surg 1985; 149 (Suppl. 04) 502-7.
  • 18 Celikoglu SI, Aykan TB, Karayel T, Demirci S, Goksel FM. Frequency of distribution according to histological types of lung cancer in the tracheobronchial tree. Respiration 1986; 49: 152-6.
  • 19 Platt JF, Glazer GM, Gross BH, Quint LE, Francis IR, Orringer MB. CT evaluation of mediastinal lymph nodes in lung cancer. American Journal of Roentgenology 1987; 149 (Suppl. 04) 683-6.