Skip to main content

Currently Skimming:

8 Large Databases and Consortia
Pages 105-126

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 105...
... The speakers were Charles Danko of Cornell University, Alexis Battle of Johns Hopkins University, Rahul Satija of the New York Genome Center, Saurabh Sinha of the University of Illinois at Urbana-Champaign, and Genevieve Haliburton of the Chan Zuckerberg Initiative. The second session, which focused more on consortia, had no presenters but instead was made up entirely of a panel discussion.
From page 106...
... One of the major goals of his lab has been to "deconvolve what these chromatin states might look like." FIGURE 8-1  Representation of ChRO-seq data and interpretation of those data. SOURCE: Charles Danko presentation, slide 2.
From page 107...
... Because of this, Danko noted that "transcription is nearly identical to the histone modification for the marks they were studying." Another important question, said Danko, is whether the relationship between transcription and each histone modification varies with cell type. They compared models trained on K562 cells with data from a variety of
From page 108...
... " He closed with a series of challenges for this work: First, cell types that people think of as static actually vary a lot. This can create problems when researchers do not take the presence of biological variation into account.
From page 109...
... To address these issues Battle has become involved in the GenotypeTissue Expression (GTEx) project, which is a database focused on tissuespecific gene expression data (GTEx Consortium, 2017; Aguet et al., 2019)
From page 110...
... Even though this was a study with a small sample size on a single-cell type, she said, "we're starting to see explanation of some GWAS hits that are not explained by static tissue." Next, Battle turned to the issue of the effects of rare variants, which, she said, are completely missed by the sorts of studies she had been describing with GWAS hits and eQTLs. Each individual has about 50,000 variants in its genome that appear with a minor allele frequency (MAF)
From page 111...
... By adding RNA from personal transcriptomic data, it is possible to take any state-of-the-art method and make it "better at identifying variants that actually look functional." Watershed can be used to inform disease association for rare variants in large studies. The bottom line, she concluded, is that "gene expression is very helpful for interpreting disease variants." AN ATLAS OF ATLASES To begin his presentation, Rahul Satija noted that in recent years the quantity and types of data available to researchers in functional genomics have grown dramatically.
From page 112...
... The datasets should share some of the same underlying biological cell populations, but there can be some populations that are present in one dataset but not in the other. The first step is to use a method called canonical correlation analysis to project cells into a common cellular space.
From page 113...
... FIGURE 8-2  Representation of how to identify "anchors" across datasets. SOURCES: Rahul Satija presentation, slide 6; Stuart et al., 2019.
From page 114...
... They found an exact match between the von Economo neurons and a particular cell type in mouse brains. Because there was something known about what that cell type does in mice, it offered some insights into the human von Economo neurons and opened the way to a series of experiments with which to learn more (Hodge et al., 2020)
From page 115...
... This alignment can also facilitate comparisons between different experimental models, such as aligning human data to mouse data, or to data from other species to perform comparative genomics. A CLOUD-BASED PLATFORM FOR GENOMICS DATA MINING The rate at which genomics data are generated is rapidly increasing each year, said Saurabh Sinha of the University of Illinois at UrbanaChampaign (UIUC)
From page 116...
... There are many public databases that provide extremely valuable information about genes, proteins, and their properties and relationships, and the people developing KnowEnG wanted the information in these databases to be available to inform the analysis of a user's data. To do that, they captured all that information in a massive heterogeneous network, where the nodes were genes, proteins, and their properties, and the edges represented the relationships that these knowledge bases captured between nodes, such as protein–protein interactions, gene ontology information, pathways, and so on.
From page 117...
... . The input was gene expression data from multiple cell lines, and the knowledge network was used to "smooth the gene expression data so that each gene's expression not only reflects its measured expression, but also the activity or expression of its network neighbors." The smoothed expression profiles of each of the cell lines were then correlated with drug response data on those cell lines to identify genes whose expression was most predictive of cytotoxicity in response to a particular drug.
From page 118...
... SUPPORTING DEVELOPMENT OF METHODS AND TOOLS Genevieve Haliburton of the Chan Zuckerberg Initiative (CZI) described what the organization is doing to support biological research.
From page 119...
... The technology side of CZI is helping develop the data coordination platform for the HCA, which is intended to be robust and scalable with a high-quality user interface, and that will house all of the data that are being generated. The goal is to build a platform that people can interact with on multiple levels, she said, "not necessarily just computational biologists but also people who have a hypothesis and want to go look at it." Currently, the platform allows only downloads of individual study data, but there are plans to include standardized multi-study, multimodal integration.
From page 120...
... As an example, she described a "normjam," where methods developers gathered to discuss high-level questions regarding normalization. "When I say normalizing here," she explained, "I'm just talking about removing the technical variation from, say, a single-cell RNA-seq" in order to focus on the biological variation and not any variation due to differences in the equipment or methods.
From page 121...
... "We can order different events with extremely high precision and actually get to the point where we can almost understand causality or at least know that A comes before B, and therefore, B did not cause A." One of the first papers from his lab included data from an adult mouse brain and from a developing mouse brain, Satija said, "and the data from the developing mouse brain looked kind of like a cloud. It looked like the cells hadn't differentiated yet." So using the anchoring technique he described in his talk, his team found anchors between the developing cells and the differentiated cells, which allowed them to use the adult dataset, where there was plenty of diversity, to guide their analysis of the early embryo.
From page 122...
... The discussion panel consisted of Felicity Jones of the Friedrich Miescher Laboratory of the Max Planck Society, Alexis Battle of Johns Hopkins University, Saurabh Sinja of UIUC, Rahul Satija of the New York Genome Center, and Sean Hanlon of the National Cancer Institute. On the first day of the workshop, Aviv Regev offered her thoughts on the importance of large initiatives (see Chapter 2 for further description of Regev's keynote address)
From page 123...
... The first draft of the atlas is expected to span at least 100 million cells, including most major tissues and systems from healthy donors of both sexes with geographic and ethnic diversity and some age diversity. Ultimately, the comprehensive atlas is expected to have up to 10 billion cells representing all tissues, organs, and systems as well as full organs, again from a diverse group of healthy donors but also with mini-cohorts representing various disease conditions.
From page 124...
... Some consortium grants explicitly require multiple principal investigators with complementary expertise. In his case there was a technology developer, an immunologist, and a computational biologist.
From page 125...
... We have trainees that lead a number of the working groups and really contribute." An audience member suggested that with the costs of collecting genomics data dropping and the ease of collecting those data increasing, large consortia are likely to become less important. Battle agreed, but said that there are still some areas, particularly in human genetics, where the necessary datasets are too large for a single lab to collect at this time, "and if we are waiting for it to get cheap enough for one lab to do that or even have the time to do it, we would be waiting quite a long time." Furthermore, a consortium like GTEx makes a major contribution by enforcing the uniformity of its data, which contributes to increased signal during processing.
From page 126...
... Hanlon added that the Human Tumor Atlas program was funded through the Cancer Moonshot Program, which has a goal of making data accessible to a broader community -- not just biometricians, but also cancer biologists, and even patients and clinicians. Marc Halfon of the University at Buffalo argued that the agencies who fund the development of databases should be ready to provide the necessary funding to maintain the databases over time.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.