Keywords
Circulating tumor DNA - proteomics - metabolomics - decision support techniques - algorithms - omics integration
The Continued Promise of Cancer Informatics
The Continued Promise of Cancer Informatics
The future of cancer informatics is predicated on the continued development of methodologies that can identify key altered pathways that are susceptible to molecular targeted or immunologic therapies[1]. The increasing customization of medical treatment to specific patient characteristics has been possible through continued advances in a) our understanding of the physiologic mechanisms of disease through the proliferation of omics data (e.g., proteomics, metabolomics), and b) computing systems (e.g., patient matching algorithms) that facilitate matching and development of targeted agents[2]. These advancements allow for improved outcomes and for reduced exposure to the adverse effects of unnecessary treatment. They can help us better decipher the inter- (between patients) and intra-(different tumors within the same patient) tumor heterogeneity that is often a hurdle to treatment success and can contribute to both treatment failure and drug resistance[3]. Importantly, omics-based cancer medicine is here. In 2017, nearly 50% of the early-stage pipeline assets and 30% of late-stage molecular entities of pharmaceutical companies involved the use of biomarker tests[4]. Further, over a third of recent drug approvals have had DNA-based biomarkers included in their original US Food and Drug Administration (FDA) submissions[5]. We are also thinking about cancer informatics differently. Algorithmically, there has been a shift from gene signatures to nonlinear approaches such as neural networks and advanced aggregative techniques to model complex relationships among patients[6]. Importantly, these approaches are the root of cohort matching algorithms that aim to find “patients like my patient.” Results of these algorithms are simpler to understand and have propelled the growth of clinical trials matching algorithms. National trials such as the NCI-MATCH[7], which pair patient tumors with specific tumor alterations to targeted medications, are a simplistic first step in this paradigm shift. The ability to perform complex matching, and matching rules, has relied on the growth of aggregated patient datasets and the ability to quickly assess tumor omics data.
This brief review focuses on three cancer omics data growth areas - proteomics, metabolomics, and circulating tumor and cell-free DNA. These omics approaches all try to enhance our current complex model of relationships among genes. We will also touch upon the paradigm shift from singular omics signatures to patient cohort matching - a shift that may potentially more readily take advantage of the large repositories of omics data that are being created. [Figure 1] underscores the foci of this review.
Fig. 1 Selected data and algorithm growth areas in cancer medicine.
Within the past several years, tumor om-ics technologies have been integrating into clinical practice. Concurrently, we have increased our understanding of the underlying pathophysiology of not only the tumor, but also the patient/tumor interaction through this omics data. Acquisition of this omics data, which is a focus of this review, has required improvements in detection techniques and data analysis. For example, assaying proteins using immunohistochemistry (IHC), the usage of singular antigens that bind to single proteins of interest in cancer tissue, is now being supplanted by mass spectrometry - which allows massively parallel identification of hundreds of proteins simultaneously. However, it has taken improved computer performance (and super computer clusters) to accurately identify this large number of proteins in a reasonable amount of time. This advancing field, proteomics, provides a far more accurate readout as compared to IHC - which is often subjective and difficult to parallelize.
Similarly, metabolite biomarkers have traditionally been singular molecules detected by immunoassay in the clinic. The chemo-therapeutic drug methotrexate, for example, has levels that are detected via immunoassay for quantification purposes[8]. However, immunoassays only measure singular known metabolites and it is well known that combinations of metabolites are more clinically relevant than singular metabolites. With this in mind, metabolomics has emerged as a new omics field of study that aims to measure abundances of all small molecules detectable in biospecimens including blood, tissue, urine, and breath, among others. Typically, mass spectrometry (MS) and nuclear magnetic resonance (NMR) techniques are applied for measuring hundreds to thousands of metabolites in a given sample.
Advanced DNA sequencing, which ushered in the genomic revolution, has also improved greatly. Our ability to perform DNA sequencing with trace amounts of starting material (low-passage reads) with improved idelity and detection is allowing us to detect circulating tumor DNA from the blood. Circulating tumor DNA (ctDNA) is tumor-derived fragmented DNA circulating in blood along with cell free DNA (cfDNA) from other sources (including normal cells) measuring about 150bp. ctDNA provides an overview of the genomic reservoir of different tumor clones and genomic diversity. ctDNA may finally provide a means of assaying in-tra-patient tumor heterogeneity - allowing us to get a sense of the relative abundance of genomic alterations across metastatic deposits within a patient. In the following sections, we will delve into each of these areas that have been introduced in the above.
Proteomics
Description of Technology: The field of molecular therapeutics is a relatively novel approach that targets abnormalities in signaling pathways that play critical roles in tumor development and progression. While the genetic abnormalities of many conditions have been studied intensively, they do not always correlate with the phenotype of the disease. One possible explanation of this phenomenon is the lack of predictable changes in protein expression and function based solely on genetic information. One gene can encode for multiple proteins; protein concentration is temporally dynamic and protein compartmentalization is paramount to function; proteins are post translationally modified. All of this complexity leads to the importance of studying the proteome. Proteomics is fundamentally the study of proteins and their structure, functional status, localization, and interactions. This has only been possible as our understanding of proteins and their post-translational landscape has been investigated. Kinases and phosphatases that control the reversible process of phosphorylation and are dysregulated in many diseases including cancer have been studied individually for many years. However, only with the application of larger scale technologies can we begin to understand the networks that control cellular phenotypes. Protein and lipid phosphorylation regulates cell survival signaling. Targeting kinases and phosphatases has proven to be paramount for improving therapeutic intervention in some diseases. In this regard, it is critical to define qualified cellular targets for cancer diagnosis and prognosis, as well as accurately predict and monitor responsiveness to therapies. Mutation proiling of selected genes or the whole exome can provide insights into possible activated pathways; however, to look at specific end points that can be targetable, one must examine the functional units of these mutational events, i.e., the protein.
Recent Advancements: There are multiple ways to examine events at the level of the protein. These range from Western blot level technologies which can examine a few proteins at a time, to mass spectrometry-based (MS-based) shotgun proteomics which can theoretically measure a very large subset, if not the entire, proteome. Broadly, most proteomic studies can be broken down into two categories: array-based and direct measurement. Array-based proteomic measurements are typically dependent on an antibody or substrate for a specific protein. Antibody-based proteomics platforms have been examined for the last 40 years and are still yielding exciting results. The most commonly used techniques for multiplexed assays are reverse phase protein arrays (RPPA), multiplexed immunofluorescence, and antibody-based chips/beads. These techniques provide a quantitative assay that analyzes nanoliter amounts of sample for potentially hundreds of proteins simultaneously. These antibody-based assays determine quantitative levels of protein expression, as well as protein modifications such as phosphoryla-tion, cleavage, and fatty acid modification[9]
[10]
[11]. Most techniques either array complex proteins samples and then probe with specific antibodies (e.g., RPPA), or array antibodies or specific ligands and then probe with a protein mixture. In essence, these assays have major strengths in identifying and validating cellular targets, characterizing signaling pathways and networks, as well as determining on and off target activity of novel drugs. One downside to array-based systems is the inherent reliance on quality antibodies or known substrates, which may or may not exist for all proteins of interest in a particular study. However, recent work has demonstrated a tissue-based map of the human proteome utilizing transcriptomic and multiplexed IHC-based techniques[12]. Similarly, The Cancer Protein Atlas (TCPA, http://tcpaportal.org/tcpa/) has examined samples collected during The Cancer Genome Atlas (TCGA) project and annotated selected samples with RPPA results. These initiatives provide a rich source of data at multiple levels from genes to transcripts to proteins[13].
Direct measurement techniques are based on the identification and quantification of the protein itself without utilizing analyte-based technologies that are solely dependent on the quality of the antibody. Most direct measurement techniques are based on MS approaches. MS-based proteomics techniques can be organized as bottom-up shotgun approaches which are able to accurately identify multiple proteins from complex mixtures. Complementary methods including stable isotope labeling (SILAC), tandem mass tags (TMT), and isobaric tags (iTRAQ) can be used in tandem with bottom-up approaches to measure relative or absolute concentrations of some or all proteins in complex mixtures. One of the inherent weaknesses of early MS-based approaches was the limited ability for absolute quantification of protein amounts. Given that many signaling events within cells are based upon changes in post-translational modifications with very small changes in total protein concentration, it was of particular interest to develop techniques that allowed for quantification of these changes. Perhaps one of the most exciting recent advances in MS-based proteomics techniques is the application of selected/multiple reaction monitoring (SRM/MRM) to quantify certain peptides of interest[14]. In contrast to array-based techniques described above, SRM-based methods can accurately measure multiple peptides from a single protein and theoretically measure multiple post-transla-tional modifications simultaneously, independent of reliance on antibodies. In a study of multiple MS-based platforms, strong quantitative correlation to an immunoas-say-based platform was observed for SRM using a synthetic peptide internal standard[15]. This contrasted to poor correlation for spectral counting, extracted ion chromato-grams (XIC), and non-normalized SRM. The inherent flexibility of the various sectors associated with MS-based assay systems also allows for multiple questions to be asked that may not be feasible with antibody-based systems. For example, MS imaging can provide a molecular resolution of 2D tissue specimens[16]. This will allow for not only identification but also spatial relationships between biomarkers within samples. This enhanced level of information may be critical for defining pathway interactions or even more accurate molecular diagnostics.
Clinical Utility: Proteomics has a significant role to play in the translational analysis of patient tissue samples. The US National Institutes of Health (NIH) has recognized the importance of clinical proteomics with the establishment of the Office of Clinical Cancer Proteomics Research (OCCPR). One of the largest working groups established through the OCCPR is the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Vasaikar et al., with support from the CPTAC, recently published a new data portal that links TCGA clinical genomics data with MS-derived proteomics data in a similar fashion to the work performed by the TCPA utilizing RPPA arrays[13]
[17]. Of note, the CPTAC initiative has produced a number of publications[18]
[19]
[20]
[21]
[22] and freely available datasets for use by the broader omics community. As an example, Mundt et al. have recently published an MS-based proteomics study on patient-derived xenografts to identify potential mechanisms of intrinsic and adaptive resistance to phosphoinositide 3-kinase inhibitors that will likely have clinical impact in the near future[21]. Much of the clinical utility of proteomics research will be driven by sample availability and quantity with accurate links to clinical data. While RPPA analysis requires very small amounts of sample, there is also a limited proteome sample space to test. MS-based techniques can test for a wide sample space of the proteome, but this requires a sample size that can hinder research. More recent MS techniques that are focused on quantitative analysis of a subset of the proteome can be done on smaller sample sizes and likely represent the future of MS-based clinical tests.
Data Challenges: Proteomics is a relatively new field in the world of large scale omics datasets. Older technologies that are array-based have relative straightforward datasets. Most RPPA datasets are reported in normalized linear or Log2 median centered scales based on the detection ranges of the specific equipment being utilized. There is minimal data manipulation outside of identifying the linear range of each sample being measured through the use of internal standards and then extrapolating absolute quantification through interpolation of a standard curve[23]. However, with the explosion of MS-based proteomic techniques, cross platform data analysis and sharing have been associated with their fair share of growing pains. The Human Proteome Organization (HUPO) has taken the lead in defining requirements for proteomic submission and repositories. Such tasks that our colleagues in the genomics world have taken for granted over the past 15 years are now being reinvented for the proteomics field. The inherent complexity of proteom-ics data and the multiple platforms utilized make sharing data a non-trivial affair. There is also an issue of technology outpacing our reporting ability. While peptide and spectral libraries have been and continue to be important for most MS proteomic analysis and deposition (PRIDE and PeptideAtlas being the major resources)[24]
[25]
[26], there is also a need for a common library of molecular transitions with the explosion of SRM/MRM techniques. PASSEL has been and continues to be the leading resource for SRM transition datasets[27]. Probably the most important advance in dataset submission and dissemination has been the continued development of the proteome exchange (PX) consortium. Beginning in 2011, the PX consortium has continually added on members to allow common data formats for all proteomic datasets. This will allow remarkable opportunities for data reanalysis and reinterpretation that our genomics colleagues have been enjoying for more than 10 years.
Future Directions: As more data from multiple tumor types become available, the ability to link genome to proteome, and ultimately phenotype and treatment choices, becomes less of a holy grail and more a clinical reality. The integrated information will display the potential therapeutic targets or biomarkers to accurately predict or rapidly define intra-cellular signaling networks and functional outcomes affected by therapeutics. Clinically, we are starting to see this through large basket-type trials incorporating genomic data matched to targeted drugs, independent of tumor type. The understanding of the proteomic context of a genomic alteration will be key to expanding the repertoire of successful biomarker-driven clinical trials. RPPA- and MS-based phosphoproteome investigation is already being explored in the context of pathway activation and targeted therapies. Similarly, utilizing targeted genomic mutation panels identifies a subset of ovarian cancer patients that may be sensitive to poly ADP ribose polymerase (PARP) inhibition, but incorporating proteomic analysis can also help identify possible responders in genomically unselected populations treated with cytotoxic chemotherapy and/or PARP inhibitors[28]
[29]
[30]
[31].
Metabolomics
Description of Technology: The overall aim in metabolomics studies is to measure levels of small molecules, less than 1,500 Daltons, in a given biospecimen (e.g., blood, tissue, urine, breath). The combination of various extraction (e.g., enrichment of lipids or protein-bound metabolites) and analytical techniques generates metabolic profiles that span many known and unknown metabolic pathways. Such metabolic profiles are a rich resource for defining phenotypes of distinct diseases such as cancer, and they reflect alterations in the genome, epigenome, proteome, and environment (exposures and lifestyle). For this reason, metabolomics is increasingly applied to complement other omics characterization of cells and clinical samples[32]
[33], and is invaluable for uncovering putative clinical biomarkers, therapeutic targets, and aberrant biological mechanisms and pathways that are associated with cancer[34]
[35]
[36]
[37]
[38]
[39]
[40].
Metabolites can be categorized as endogenous, naturally produced by the host or cells under study, or exogenous, including drugs, foods, and cosmetics among others. While the goal is to measure all metabolites in a given biospecimen, analogous to measuring all gene levels in transcriptomic studies, current analytical acquisition techniques can only capture a fraction of metabolites given one assay or platform[36]
[41]. For example, as of April 2018, the Human Metabolome Database[41]
[42]
[43] contains information on 114,100 metabolites, yet only 22,287 (19.5%) have been detected in human biospecimens. Also, unlike genomics and transcriptomics where one can measure genome-wide features (e.g., expression, variants) with one assay, metabolomics requires multiple analytical techniques and instrumentation for a broad coverage of metabolites (e.g., polar and nonpolar metabolites). In practice, a specific combination of sample preparation (e.g., enrichment of nonpolar metabolites) and analytical technique is often optimized for a certain class of compounds (e.g., lipids)[36].
The two main analytical approaches for measuring metabolites are NMR and MS[44]
[45]
[46]
[47]. Abundance detection by MS is typically preceded by a molecule separation technique such as liquid (LC) or gas (GC) chromatography. While NMR is considered the gold standard for compound identification (when analyzing singular compounds in pure form) and produces quantitative measures, MS-based methods are more sensitive (e.g., able to detect low abundance metabolites) and detect more (e.g., several hundreds to thousands) metabolites[48].
Of note, metabolomics studies can be classified as targeted[49] or untargeted[50]. In targeted studies, a small (∼1-150) panel of metabolites with known chemical characteristics and annotations are measured and the sample preparation and analytical platform used are optimized to minimize experimental artifacts. Examples of artifacts are fragmentation and adduct formation (e.g., addition of sodium or hydrogen ions) in electrospray ionization[51]. Measurements can be performed using standards and thus produce quantitative or semi-quantitative measurements. In contrast, untargeted metabolomics aims to detect all possible metabolites given a biospecimen. Untargeted approaches yield relative measurements of thousands of signals that represent known metabolites, experimental artifacts (e.g., adducts), or unidentified metabolites[52]. While many more metabolites can be captured with untargeted approaches, it is very challenging to annotate signals and identify metabolites[51]. Verification of metabolite identity requires prediction of elemental composition from accurate masses, and eventually, further experimentation (NMR being the gold standard) that requires the use of a purified standard for the metabolite of interest[45]
[53]
[54]
[55]
[56]. If a purified standard is not commercially available, one must be synthesized in-house and thus this validation process could take several years. Ultimately, a targeted approach is favorable if there is a priori knowledge on the biological system or disease under study because measurements are quantitative and the data quality is high[52]. However, despite the high level of noise and the increased complexity in data analysis, untargeted approaches are favorable for discovering novel biomarkers or generating data-driven hypothesis[52].
Recent Advancements: As metabolomics strategies are increasingly being applied in biomedical research, advances in automation and improved quantification of NMR- and MS-based methods are producing high throughput, reproducible data[44]
[57]
[58]. Integration of NMR and LC-MS techniques is increasingly applied to enhance reproducibility, metabolite identification, and to ensure measurement integrity[59]. Such improvements in data acquisition techniques are critical for expanding the coverage of metabolites that can be reliably measured. At the same time, these advances are producing larger data, requiring the construction of databases and the development of data analysis methods, tools, and pipelines. Currently, the two major sources of publicly available data are the Metabolomics Workbench[60] and MetaboLights[61]. The Metabolomics Workbench, sponsored by the NIH Common Fund, also provides access to analytical protocols (e.g., sample preparation and analysis), metabolite standards, computational tools, and training. While metabolomics data is very informative, for example, to uncover putative clinical biomarkers, understanding how metabolites are produced and their function further deepens our understanding of disease phenotypes and mechanisms. In turn, this mechanistic understanding can guide the search for putative drug targets. With this in mind, integration of metabolomics data with other omics datasets, including genome, proteome, and microbiome, is increasingly performed[62]
[63]. Integration of omics datasets includes numerical integration techniques such as canonical correlations or multivariate modeling, and network/pathway based approaches[64]
[65]
[66]
[67]
[68]
[69]. Furthermore, open-source user-friendly software for metabolomics analysis and interpretation through pathway analysis has been critical for guiding analysis and interpretation of the data. Examples include XCMS[70]
[71], MetaboAnalyst[72]
[73], and Metabox[74].
Data Challenges: Unlike genomics, a reference metabolome does not exist and it is currently impossible to measure all metabolites in a given biospecimen. This lack of reference causes many data analysis issues, particularly for untargeted metabo-lomics studies where the identification of metabolites is difficult to pin down[75]. The field also suffers from multiple metabolite naming conventions. In fact, different naming conventions are more appropriate for certain types of data acquisition techniques. For example, while untargeted metabolo-mics approaches cannot resolve differing stereochemistry or double bond position/ geometry, other approaches can identify metabolites with more or less granularity[60]. Translation services, including Refmet[60] and the Chemical Translation Service[76] help in that regard. Also, the multitude of data acquisition techniques makes it difficult to organize the data in a standardized fashion[77]; instrumentation vendors have specific data formats that are tied to proprietary software and conversion of these ile formats to open source formats can require specific operating systems or software licenses. Differences in how the data was generated also make it difficult to compare results across studies. With many missing identities and different resolution of identification, it is difficult to map a metabolite from one study to another. Standardization is thus critical to handle such challenges but is in their nascence[60]
[77]. Standard protocols for downstream data analyses, including quality control, transformation/ normalization, and differential analysis, are also difficult to establish, namely due to differences in experimental study design and data acquisition. Although publicly available tools and software aim to provide standard approaches[70]
[71]
[72]
[73]
[74]
[78], detailed descriptions of parameters (e.g., mass divided by charge number [m/z] range allowed for binning features) and cutoffs used are often lacking in published work and makes reproducibility of results difficult.
Clinical Utility: Metabolomics plays an increasing role in clinical and translational research as large initiatives such as the Consortium of Metabolomics Studies (https://epi.grants.cancer.gov/comets/) and the NIH Common Funds Metabolomics Program (https://commonfund.nih.gov/ metabolomics) are generating large-cohort metabolomics datasets (>1,000 participants). Because metabolomics profiles help define disease phenotypes and reflect alterations in the genome, epigenome, proteome, and environment (exposures and lifestyle), metabolites are ideal candidates for biomarker discovery in many diseases including cancer[37]
[38]
[39]
[79]
[80]
[81]. With this in mind, metabo-lomics is playing a larger role in precision medicine, requiring continued efforts in data acquisition and analysis[82]. Metabolomics is also increasingly integrated with other omics information and is analyzed in the context of biological pathways and networks, with the aim of identifying mechanisms that underlie diseases and finding novel therapeutic targets[34].
Future Directions: In October 2017, the NIH Common Fund has released funding opportunities to promote efforts in public accessibility and reuse of metabolomics data, development of computational tools for analysis (including omics integration) and interpretation of metabolomics data, and development of approaches to identify unknown metabolites. Thus, we anticipate further development in open-source, publicly available computational tools and infrastructures to facilitate metabolomics analysis. Since metabolomics is increasingly applied to biospecimens from large (>1000) cohorts and consortia, it is now possible to integrate other omics data, as well as clinical and environmental contexts in the analyses. The complexity of harmonizing data across cohorts and incorporating clinical and environmental data necessitates further standardization and computational infrastructure. Of special interest, the impact of alterations in the microbiome (dysbio-sis) on metabolic pathways is particularly relevant since these dysbiosis-metabolome relationships can be causative or indicative of a myriad of human diseases[83]
[84] including obesity and diabetes[85]
[86]
[87], cardiovascular diseases[88]
[89]
[90], inflammatory diseases[91], and cancer[64]
[92]
[93]. We thus suspect an increase in multi-tiered studies that apply a holistic approach to understanding diseases, including integration of molecular information from host and environment. Concurrently, as pathway information and identification of metabolites increases, strategies that take into account the kinetics of metabolites (e.g., metabolic lux and networks) will become more and more applicable for clinical metabolomics studies. Lastly, while the classical view of the molecular dogma is that metabolites levels are modulated by the epigenome, genome, and proteome, there have also been examples where metabolites regulate epigenetic events (i.e., going against the grain of the molecular dogma direction)[94]
[95]
[96]. The future of metabolomics and its potential in uncovering biomarkers and deciphering mechanisms will surely necessitate modeling of complex bi-directional relationships within omics and environmental context information.
Cell-free DNA
Description of Technology: Both normal and malignant cells shed DNA into the circulation and next-generation sequencing technologies are capable of detecting small amounts of cfDNA, making the blood a potential repository for tumor genomic profiling. „Liquid biopsy“, once validated, could enable the detection of cancer as a screening tool, track evidence for residual disease after cancer treatment, monitor patients for response to therapy, and discover meaningful mechanisms of resistance to cancer therapies. With this wealth of previously unavailable information, liquid biopsy could lead to the development of new assays, biomarkers, and targeted treatments to help cancer patients live longer, better lives. It is important to note that cell-free/circulating tumor DNA is only one aspect of ‘liquid biopsies,’ and there are multiple advances with other assays outside of the scope of this review, including circulating tumor cells[97]
[98]
[99]
[100], other nucleic acids[101], exosomes and other extracellular vesicles[102]
[103]
[104], and integrated biomarkers[105].
The presence of cell-free nucleic acids in the blood was first described in 1947 by Mandel and Metais[106] and three decades later, Leon et al. demonstrated that cancer patients had greater amounts of cfDNA relative to healthy controls[107]. Stroun et al. demonstrated both that tumor DNA was detectable specifically in plasma[108]
[109] and that specific genomic alterations could be identified[110]. Of note, cfDNA is distinct and not derived from circulating tumor cells, although they are correlated and both increased in patients with advanced cancer[111]. Major advancements in cfDNA were first made in the field of perinatology, leading to the early minimally invasive detection of fetal chromosomal anomalies from maternal plasma in widespread clinical use today[112]. The remarkable advances in sequencing technology over the past two decades, from Sanger sequencing to allele-specific PCR to the advent of massively parallel sequencing ('next generation sequencing')[113]
[114], along with advances in bioin-formatics analysis[115] and rapid reduction in cost have facilitated increasing ability to interrogate cfDNA to profile tumors.
Recent Advancements: To date, most clinical applications of cfDNA sequencing have focused on tracking specific mutations[111]
[116]
[117]
[118]
[119]
[120]
[121]
[122] or sequencing targeted panels of cancer-related genes[123]
[124]
[125]
[126]
[127], particularly in metastatic cancer. In general, cfDNA is present in a greater proportion of patients and in larger amounts in metastatic cancers relative to primary tumors. In the metastatic setting, particularly in cancer types that are in many cases inaccessible (e.g., lung primary or metastases) or are higher risk lesions to sample in terms of potential complications, cfDNA genomic approaches may offer potential benefits relative to tumor biopsy. Tumors are known to be heterogeneous and biopsies inherently only sample a small localized region of a single metastatic site[128], introducing potential bias that may be overcome by cfDNA as a ‘sink’ of all metastatic sites in a patient[129]. Taking a patient-centered approach in the metastatic setting is critical - avoiding painful and inconvenient biopsies has the potential to improve quality of life. In one study, 34% of breast cancer patients undergoing metastatic biopsy described anxiety pre-biopsy and 59% described post-biopsy pain[130].
Clinical Utility: The only FDA-approved ‘liquid biopsy’ companion diagnostic to date is the cobas® EGFR Mutation Test v2 for the detection of exon 19 deletions or exon 21 (p.L858R) substitution mutations in the epidermal growth factor receptor (EGFR) gene to identify patients with metastatic non-small cell lung cancer eligible for treatment with erlotinib[131]. However, in cancers harboring mutations that are known to be prognostic or predictive, plasma-based cfDNA assays have demonstrated utility in disease management and are increasingly used clinically[132]
[133]
[134]. In addition, cfDNA targeted panel sequencing assays of cancer-related genes are used in lieu of metastatic tumor biopsy sequencing in clinical practice, including commercial tests such as Guardant360® and FoundationACT®[126]
[135]. In the clinical setting, genomic profiling via cfDNA has been associated with more rapid turnaround of genomic results than tissue biopsies, frequently due to delays in accessing or obtaining tissue[136]. In the non-metastatic setting, there is great interest and excitement around the potential to develop patient tumor-specific panels of mutations for the highly sensitive detection of minimal residual disease after initial cancer treatment[137]
[138]. In addition, multiple groups and commercial ventures are pursuing whether cfDNA could be used as a novel screening approach for cancer diagnosis[139], including the STRIVE Breast Cancer Screening Cohort (NCT03085888) supported by Grail, Inc. However, the optimal technical approach for cfDNA as a detection methodology remains unclear and large studies to assess sensitivity and specificity are only recently underway. Another approach is to incorporate cfDNA into a multi-analyte assay for cancer screening, such as CancerSEEK[105]. The CancerSEEK assay integrates a cfDNA PCR-based assay for a panel of common cancer mutations with established circulating protein biomarkers.
The promise of cfDNA is immense yet there remain several key unresolved challenges, including how well tumor-derived cfDNA mirrors tissue-derived tumor DNA, how to analyze tumor-normal DNA admixture present in circulation, how to better assess tumor-derived fraction of cfDNA, and how to account for clonal hematopoiesis of indeterminate potential (CHIP). While cfDNA appears to demonstrate overall high concordance with tumor biopsies[140]
[141]
[142], it is unclear whether cfDNA can serve as a comprehensive proxy for tumor biopsy in all contexts. Further, assays vary in their detection and reporting of genomic alterations from plasma[143].
Data Challenges: Circulating DNA in plasma is an admixture of both normal DNA shed primarily from leukocytes as well as tumor DNA, which presents challenges for analysis and interpretation of sequencing data. In the context of a large amount of tumor-derived DNA in the circulation (high ‘tumor fraction’), for example tumor fraction greater than 10%, standard next-generation sequencing approaches may be applied. However, in many contexts, tumor fraction is incredibly low particularly at diagnosis, in the setting of minimal residual disease, or in some ‘low cfDNA shedding’ cancer types and patient tumors. Highly specific assays may detect tumor fractions as low as 0.02% for panel sequencing and 0.00025% using such as digital droplet PCR for specific known alterations[138]. A major remaining challenge is to understand the sensitivity of assays for mutation detection to ensure that a negative test truly reflects the absence of tumor-derived DNA and not a limitation of assay or bioinformatic approaches.
As cfDNA assays seek to expand the breadth of sequencing (e.g., whole genome sequencing), efficient and cost-effective methods to screen blood samples for adequate amounts of tumor-derived DNA will be critical. Although sequencing costs continue to decline, identifying samples unlikely to provide usable sequence data should improve efficiencies. Most assays that determine tumor fraction depend on prior knowledge of tumor-specific mutations. Recent efforts suggest that low-coverage (approximately 0.1X) whole genome sequencing of cfDNA may offer the ability to quantify tumor fraction without the need for prior knowledge of tumor mutations[140].
Another challenge involves deconvo-lution of genomic alterations present in leukocytes as a consequence of CHIP from tumor-specific alterations[144]
[145]. CHIP is the expansion of a clonal hematopoietic progenitor identified through common genomic alterations, present in increasing frequency as individual's age. Typically, ‘normal’ DNA to distinguish germline from somatic tumor mutations is derived from peripheral blood cells and the frequency of CHIP - potentially more than 10% of patients over the age of 65 - suggests that methods to identify and account for CHIP will be critical.
Although most efforts to date have focused on tracking specific alterations known to be present in a tumor biopsy or sequencing targeted panels of cancer-related genes, there is growing evidence that cfDNA offers the potential to obtain exome- and genome-level tumor sequencing data. Works from several groups have demonstrated the feasibility of genome-wide copy number analysis in cancer patients from plasma via shallow or low-coverage sequencing of cfDNA[140]
[146]
[147]
[148]
[149]. Further efforts in this regard demonstrate feasibility of exome sequencing of cfDNA in the context of adequate tumor fraction[140]
[141]
[142]
[146]. Comprehensive proiling is useful; particularly as blood can readily be collected serially, enabling tracking of the evolution of resistance as patients are on therapy. As we gain a greater understanding of the importance of non-driver mutations and regulatory elements in carcinogenesis and cancer progression, more comprehensive tumor genomic proiling from blood offers the potential for discovery in addition to detection, response tracking, or biomarker identification. In addition, more sensitive methods of detecting and isolating tumor-derived DNA or alterations from plasma may improve assay sensitivity[150]
[151].
Future Directions: cfDNA is increasingly prevalent in oncology practice, from the first FDA-approved cfDNA biomarker to commercial cfDNA targeted panel sequencing assays. However, a recent American Society of Clinical Oncology (ASCO) and College of American Pathologists joint review reinforced that widespread use in clinical practice is not yet recommended until there is evidence of clinical validity and utility[152]. Despite this, there is growing evidence that personalized, highly sensitive mutation-based assays will be feasible for assessment of minimal residual disease and potentially tracking for early recurrence detection. These advances may translate to cfDNA assays that could be used for screening and early primary detection as well yet require clinical validation first. Finally, technological and computational advances are facilitating comprehensive genomic proiling exclusively from plasma. There remains the hope that new minimally invasive ‘liquid biopsy’ assays could improve outcomes by identifying cancer earlier and more specifically while also facilitating a greater understanding of novel susceptibilities and targets.
Cohort Matching Algorithms
Cohort Matching Algorithms
Description of Technology: Traditional biomarker analysis focuses on trying to figure out what distinguishes one patient from another patient. Broadly speaking, cohort-matching algorithms are either centered around similar features, or on similar outcomes. Using feature selection methods, biomarkers with the strongest association to the feature of interest are identified and then validated in an independent test set. These biomarker selection processes universally assume that there is a global, ground truth regarding the biomarker-phenotype relationship that is stable across multiple settings[153]. Unfortunately, this biomarker selection paradigm results in a tendency to divide patients into increasingly small subsets that may have no clinical relevance. Moreover, this fragmentation of previously “common” diseases results in a collection of “rare” subtypes that are then progressively challenging to study[154]
[155] as there are an endless number of biomarker-subtype-therapy combinations. An alternative to this biomarker proliferation is the idea of trying to bin patients together based on potential outcome similarity – pattern-matching at a patient level. In other words, rather than focus on how patients are dissimilar, focus on how sets of patients respond similarly to a medication. In other words, one can leverage omics/ phenomics comparisons at a patient level through more holistic pattern matching. This allows any number of omics technologies to define a patient-patient similarity strength.
Recent Advances: There currently is not a standard means of patient matching using omics data. There are an assortment of varied heuristics and cohort matching metrics[156]
[157]
[158]. Feature matching algorithms assume that retained features are critical determinants of outcomes such as survival and are optimal for situations where the biomarker is directly linked to the outcome. A straightforward approach to feature matching is to assign matches based on exact feature overlap -- for two patients to be a match they must share all features. Foundation Medicine's PatientMatch tool[159] is an example of this exact matching approach. More complex feature matching schemes have also been developed using Bayesian approaches[160]. Other feature matching algorithms include the PHenotypic Interpretation of Variants in Exomes (PHIVE) that matches human phenotypic profiles to glean the variants found in whole exome sequencing in mouse models[157] and DECIPHER[161] that enables international querying of karyotype, genetic, and phenotypic information for matches. In contrast to feature matching, the outcome-matching approach allows features to be weighted based on their discriminatory power. Frequently used algorithms are weighted K-nearest neighbor, random forest techniques, and deep learning (e.g., artificial neural networks)[162]
[163]
[164]. Outcomes matching attempts to match patients with other patients who may have a similar outcome to the same therapy based on phenotypic and omic predictors. A patient's features could potentially be compared not just from patient to patient (e.g., Patients Like Me) to infer outcomes but also from patient to cell lines (e.g., the Connectivity Map project[165]) and from a patient's electronic health record (EHR) to separate patient's EHRs[166].
Clinical Utility: The landscape today is dominated by feature matching strategies. These have been applied to clinical trial recruitment most notably for such national endeavors as the NCI-MATCH trial. Much of feature matching algorithms today have been focused on improving clinical trials accrual by prompting physicians to generate referrals[167]. Although seemingly simple, clinical trials matching algorithms have shown up to a 90% reduction in time spent identifying trial candidates[168]. GeneYenta matches phenotypically similar patients in regard to rare diseases[169] by weighting predictive features. Algorithms have been written to evaluate single nucleotide variant (SNV) frequencies between patients and non-small cell lung cancer cell lines to predict chemotherapeutic response[156]. Startup efforts such as MatchTX (http://match-tx.com) are attempting to reimagine social networking tools to help clinicians find best patient matches. Although the data sources, data types, and methods are heterogeneous, matching techniques at their core employ heuristic approaches to discover and vet the best profiles from large clinical databases.
Data Challenges: Cohort matching algorithms need to be capable of subsuming disparate data types and methods of comparison. Unfortunately the data types used in the matching process are varied and can be subjective or objective phenotypic measurements. Definitions of pathogenicity[170] remain a huge problem as do incomplete datasets and datasets lacking standardized ontologies. Preprocessing steps will need to be developed to organize the data into viable features to be used by matching algorithms. A further complication is the possibility that predictive models may require subsuming disparate unstandardized data-types simultaneously[63]
[171]. EHR and omics interoperability remain a primary impediment to more robust algorithm generation. This will require a concerted standardization among data sets including vocabulary mapping and normalization.
Future Directions: As interoperability is a key impediment to the omics revolution, this has spurred efforts such as the Genomic Data Commons[172] which aims to “provide a uni-fied data repository that enables data sharing across cancer genomic studies in support of precision medicine.” Other consortia efforts such as the Global Alliance for Genomics and Health (GA4GH)[173] and Health Level Seven International's Fast Healthcare Interoperability Resources (FHIR)[174] are enabling the development of application programming interfaces (APIs) and standards convergence. For example, the GA4GH Beacon Project allows federated queries to detect the existence of specific genomic variants across a variety of genome resources. Coalescing large datasets such that meaningful matching can occur has also been a thrust of recent developments. ASCO has also built a learning system called CancerLinQ[175] to help facilitate integration of data from multiple participating community oncology practice sites in an attempt to standardize data, facilitate research, and provide personalized cancer care through patient matching. Academic and selected larger oncology groups are participating in consortia such as ORIEN[173], GENIE[176], and the International Cancer Genome Consortium (ICGC)[177] and are building their respective frameworks for identifying patient cohorts. The “Sync for Science”[178] endeavor sponsored by the NIH and the Office of the National Coordinator for Health Information Technology is going to permit patients to directly donate their data to be used to support innovative match-based algorithms for predictive purposes and thus contributing to precision medicine research. Sync for Science is also an integral part of the patient engagement portion of the NIH ‘All of Us’ initiative (https://allofus.nih.gov). Enhancing and perhaps complicating the field further, individual hospital systems such as the Swedish Cancer Institute and the Henry Ford Hospital system are developing their own precision medicine repositories. Commercial pathology laboratories – such as Caris and Foundation Medicine – have their internal datasets to mine. Other efforts like Syapse's Open Precision Oncology Network[179] allow aggregated cancer genomics data to be pulled from all participating health systems. These consortia and businesses all rely on patient matching as part of their core strengths.
Conclusion
The sequencing of the genome has ushered in a new era of personalized cancer informatics. But the DNA genome is simply a first layer in a complex biological environment from which many omics data can be overlaid. We are in a time of growth. Metabolomics and proteomics are driving us closer to the tumor phenotype, and importantly, its response to treatment in real-time. ctDNA/cfDNA may help understand the clonal tumor evolution using non-invasive methods with the patient. These new omics datatypes will more than certainly help us tailor and adjust therapy for oncology patients. With these new datatypes and the understanding that data must be centralized, we are witnessing, too, an explosion of clinical/omics datasets aggregated by consortia and industry partners. As these datasets grow, so too, will be the need for more sophisticated cohort matching algorithms to bring clarity and useful actionable insights. These are exciting times. The cancer omics revolution continues to march forth rapidly and hopefully continues to improve our ability to practice precision oncology.