Skip to main content

Currently Skimming:

3 Inference About Discoveries Based on Integration of Diverse Data Sets
Pages 13-29

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 13...
... Last, Jeffrey Morris (MD Anderson Cancer Center) discussed incorporating biological knowledge into statistical model development to reduce the parameter space and increase the biological coherence of results.
From page 14...
... Hero described the potential benefits of integrating diverse data sets, including the development of better predictors and better descriptive models. However, real izing these benefits is difficult because assessment of bias and replicability is chal lenging, especially in high-dimensional cases, and may require more sophisticated methods.
From page 15...
... . There are also practical challenges for using big data, such as the tremendous increase in the amount of information stored on the cloud, said Hero.
From page 16...
... The novel aspect of this method, Hero explained, is that the positive sum-to-one constraint in the factor model avoids known problems of masking and interference faced by principal component analysis. The novel factor analysis method was more effective for predicting onset of symptoms than other methods in the literature and was validated with additional data sets (Huang et al., 2011; Bazot et al., 2013)
From page 17...
... in what Hero called "the blessing of high dimensionality." He noted, however, that other inference tasks -- for example, full uncertainty quantification -- are more demanding in terms of sample size. Hero concluded by emphasizing the importance of rightsizing the inference task to available data by first detecting those network nodes with edges and prioritizing them for further data collection and estimation (Firouzi et al., 2017)
From page 18...
... In such large data sets, frequently data are missing and the analyst does not have coverage across all data modalities for all subjects. For example, Nobel described data from the Genotype-Tissue Expression (GTEx)
From page 19...
... Nobel described a similar procedure to identify variable sets that are differentially correlated in two sample groups, illustrating the method using cancer subtypes from the TCGA data set. Identifying differentially correlated variables is a useful form of exploratory analysis to generate hypotheses worthy of further study, said Nobel.
From page 20...
... An online participant asked if model development, analysis, and results might be impacted by the fact that much of the data was collected years ago and there are likely differences and improvements in data curation and storage practices over time. Nobel answered first, saying that in his experience there is no such thing as a final data set; rather, the data used in analyses are always evolving.
From page 21...
... STATISTICAL DATA INTEGRATION FOR LARGE SCALE MULTIMODAL MEDICAL STUDIES Genevera Allen, Rice University and Baylor College of Medicine Genevera Allen provided an overview of data integration for large-scale multi­ modal medical studies. Large-scale medical cohort studies typically have many types of data -- including clinical evaluations; EHRs; images; gene and protein expression; and social, behavioral, and environmental information -- and the objective of data integration in this context is to combine all of these to better understand complex diseases.
From page 22...
... Similarly, structural neuroimaging data collected in the ROS and MAP studies before 2012 relied on a 1.5 tesla magnet that was replaced with a 3 tesla magnet to provide greater resolution; how to reconcile these two batches of imaging data remains an open question, said Allen. Another critical challenge is that not all data modalities are measured for every subject, which creates miss ing or misaligned data and can result in a very limited sample size if the analysis is restricted only to patients for whom complete data are available (Figure 3.3A)
From page 23...
... More challenging are data-driven discoveries that provide new biological knowledge; for instance, the ROS and MAP studies not only aim to predict if a patient will get Alzheimer's, but also seek to know why, said Allen. She then discussed a novel method for integrating mixed multimodal data using network models and exponential families to model a joint multivariate distribution.
From page 24...
... Referring back to the ROS and MAP studies, she said longitudinal studies with mixed multimodal data also present open statisti cal problems related to aligning data collected at different times. In the bigger picture, the ROS and MAP studies are two of many ongoing projects looking at aging and cognitive health, which creates the opportunity for meta-analysis across similar integrative studies to increase statistical power, which is an objective of the A ­ ccelerating Medicines Partnership–Alzheimer's Disease project.
From page 25...
... Morris summarized three main classes of statistical tasks described in the preceding presentation by Genevera Allen: 1. Building of predictive models that integrate diverse data and allow a larger set of possible predictors to search over, which is difficult with mixed multi­ modal data sets; 2.
From page 26...
... A reasonable middle ground to aim for is identifying and treating cancer subtypes that share many biological characteristics, said Morris. To develop consensus regarding molecular subtypes of colorectal cancer, Morris and colleagues participated in an international con sortium led by SAGE Biosystems that combined information from 18 different studies with mRNA data from approximately 4,000 patients.
From page 27...
... gene set enrichment, and (C) genelevel methylation with known biological information allowed researchers to infer that methylation drives differential gene expression in one colorectal cancer subtype.
From page 28...
... Allen responded that a lot of good work has been done using Bayesian networks to integrate mixed multimodal data, and the framework she presented using exponential families to represent a multivariate distribution could be applied with Bayesian networks and priors. Many Bayesian approaches model dependencies between mixed data types in the latent hierarchical structure of the model; this avoids challenges related to scaling of data across modalities but is often more difficult to interpret.
From page 29...
... Regarding the second question, Allen agreed that there is an insufficient sample size relative to the number of parameters to rely on likelihoodbased inference even in large medical cohort studies with thousands of subjects. In the cases that Allen has applied this approach, she relied on biological knowledge to filter the data, as discussed by Morris, before fitting the network.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.