Summary
Objectives: Graphical displays can make data more understandable; however, large graphs can challenge
human comprehension. We have previously described a filtering method to provide high-level
summary views of large data sets. In this paper we demonstrate our method for setting
and selecting thresholds to limit graph size while retaining important information
by applying it to large single and paired data sets, taken from patient and bibliographic
databases.
Methods: Four case studies are used to illustrate our method. The data are either patient
discharge diagnoses (coded using the International Classification of Diseases, Clinical
Modifications [ICD9-CM]) or Medline citations (coded using the Medical Subject Headings
[MeSH]). We use combinations of different thresholds to obtain filtered graphs for
detailed analysis. The thresholds setting and selection, such as thresholds for node
counts, class counts, ratio values, p values (for diff data sets), and percentiles
of selected class count thresholds, are demonstrated with details in case studies.
The main steps include: data preparation, data manipulation, computation, and threshold
selection and visualization. We also describe the data models for different types
of thresholds and the considerations for thresholds selection.
Results: The filtered graphs are 1%-3% of the size of the original graphs. For our case studies,
the graphs provide 1) the most heavily used ICD9-CM codes, 2) the codes with most
patients in a research hospital in 2011, 3) a profile of publications on “heavily
represented topics” in MEDLINE in 2011, and 4) validated knowledge about adverse effects
of the medication of rosiglitazone and new interesting areas in the ICD9-CM hierarchy
associated with patients taking the medication of pioglitazone.
Conclusions: Our filtering method reduces large graphs to a manageable size by re -moving relatively
unimportant nodes. The graphical method provides summary views based on computation
of usage frequency and semantic context of hierarchical ter -minology. The method
is applicable to large data sets (such as a hundred thousand records or more) and
can be used to generate new hypotheses from data sets coded with hierarchical terminologies.
Keywords
Data mining method - data filtering method - threshold setting - threshold selection
- data visualization - hierarchical terminology - data analysis - clinical data repository