Webinar: GDC Bioinformatics Pipelines

Date: Monday, September 30, 2019

Time: 2:00 PM - 3:00 PM (EDT)

Location: Web Conference (See WebEx information below)


Dr. Zhenyu Zhang, Ph.D, GDC Bioinformatics Manager, University of Chicago

Colin Reid, GDC User Services, University of Chicago

The GDC bioinformatics pipelines support the alignment of DNA and RNA sequence data against a common reference genome build, and the generation of derived data. GDC pipelines are implemented using data processing software and algorithms selected in consultation with the expert genomics community. This webinar will provide an overview of GDC bioinformatics pipelines and demonstrate how generated data is made available through GDC analysis tools.

Summer Internships

We are accepting applications for our 2019 Summer Internships! Interns will contribute toward biomedical research through analytical solutions and will develop technical skills across data engineering, data science, bioinformatics, and software engineering. Interns will have opportunities to learn from staff mentors with experience building petabyte-scale research infrastructure.

How Data Commons Can Support Open Science


How Data Commons Can Support Open Science

April 23, 2019

By Robert L. Grossman

In the discussion about open science, we refer to the need for having data commons. What are data commons and why might a community develop one? I offer a brief introduction and describe how data commons can support open science.

Data commons are used by projects and communities to create open resources to accelerate the rate of discovery and increase the impact of the data they host. Notice what data commons aren’t: Data commons are not designed for an individual researcher working on an isolated project to ignore FAIR principles and to dump their data to satisfy data management and data sharing requirements.

More formally, data commons are software platforms that co-locate: 1) data, 2) cloud-based computing infrastructure, and 3) commonly used software applications, tools and services to create a resource for managing, analyzing and sharing data with a community.

The key ways that data commons support open science include:

  1. Data commons make data available so that they are open and can be easily accessed and analyzed.

  2. Unlike a data lake, data commons curate data using one or more common data models and harmonize them by processing them with a common set of pipelines so that different datasets can be more easily integrated and analyzed together. In this sense, data commons reduce the cost and effort required for the meaningful analysis of research data.

  3. Data commons save time for researchers by integrating and supporting commonly used software tools, applications and services. Data commons use different strategies for this. The commons themselves can include workspaces that support data analysis, other cloud-based resources can be used to support the data analysis, such as the NCI Cloud Resources that support the GDC, or data analysis can be done via third party applications, such as Jupyter notebooks, that access data through APIs exposed by the data commons.

  4. Data commons also save money and resources for a research community since each research group in the community doesn’t have to create their computing environment and host the same data. Since operating data commons can be expensive, a model that is becoming popular is not charging for accessing data in a commons, but either providing cloud-based credits or allotments to those interested in analyzing data in the commons or passing the charges for data analysis to the users.

A good example of how data commons can support open science is the Genomic Data Commons (GDC) that was launched in 2016 by the National Cancer Institute (NCI). The GDC has over 2.7 PB of harmonized genomic and associated clinical data and is used by over 100,000 researchers each year. In an average month, 1–2 PB or more of data are downloaded or accessed from it.

The GDC supports an open data ecosystem that includes large scale cloud-based workspaces, as well as Jupyter notebooks, RStudio notebooks, and more specialized applications that access GDC data via the GDC API. The GDC saves the research community time and effort since research groups have access to harmonized data that have been curated with respect to a common data model and run with a set of common bioinformatics pipelines. By using a centralized cloud-based infrastructure, the GDC also reduces the total cost for the cancer researchers to work with large genomics data since each research group does not need to set up and operate their own large-scale computing infrastructure.

Based upon this success, a number of other communities are building their own data commons or considering it.

For more information about data commons and data ecosystems that can be built around them, see:

  • Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35 (2019) pp. 223–234, doi.org/10.1016/j.tig.2018.12.006. Also see: arXiv:1809.01699

  • Robert L. Grossman, Progress Towards Cancer Data Ecosystems, The Cancer Journal: The Journal of Principles and Practice of Oncology, May/June 2018, Volume 24 Number 3, pages 122–126. doi: 10.1097/PPO.0000000000000318

About: Robert L. Grossman is the Frederick H. Rawson Distinguished Service Professor in Medicine and Computer Science and the Jim and Karen Frank Director of the Center for Translational Data Science (CTDS) at the University of Chicago. He is also the Director of the not-for-profit Open Commons Consortium (OCC), which manages and operates cloud computing and data commons infrastructure to support scientific, medical, healthcare and environmental research.

Originally published at http://sagebionetworks.org.

Webinar: Gen3 Data Modeling

Data commons, data ecosystems, and science gateways are modern ways that scientists and researchers conduct analyses over petabytes of research data. Many of these are powered by Gen3, an emerging technology available to the community through an open-source software suite.

In this webinar, you will learn how Gen3 services allow users to submit, index, and query data based on a data model. We will show a breakdown of the steps needed to create a data dictionary and use Gen3 dictionary tools to create YAML schemas for the data model. You will receive a review of Sheepdog and Peregrine services and how they work to check your data submissions against the data model and facilitate your data queries.

AACR Annual Meeting 2019

AACR Annual Meeting 2019

Exhibit Booth #2111

NCI Data Sharing & Informatics

Experts from the Center for Translational Data Science were at the AACR Annual Meeting again this year representing the NCI Genomic Data Commons (GDC) at Exhibit Booth #2111. The GDC is a public resource for sharing and analyzing big data to support precision medicine. Michael Fitzsimons, PhD, Director of User Services, led Meet the Experts Sessions where he showed a live demo of the genomic and clinical data analysis visualization and exploration tools. Researchers at the conference learned about how they can use GDC data and tools to support their cancer research.

New GDC Community Tool

Colin Reid, a Scientific Support Analyst in the Center for Data Intensive Science at the University of Chicago, built a new community tool for the Genomic Data Commons.  The RNASeq Tool downloads and merges individual RNASeq files from the GDC into a matrices identified by TCGA barcode.  Learn more at https://gdc.cancer.gov/access-data/gdc-community-tools.  

Supercomputing Conference

It's our favorite time of year!  We're in Denver this week for the annual Super Computing conference with the Open Commons Consortium showcasing our work on: 

  • innovative applications of data science in biology, medicine, health care, and the environment;
  • new releases of open-source data commons and data peering technology that support research communities, including specialized commons for cancer genomic data, liquid biopsy research, brain disorders, pediatric cancer and birth defects, weather data, and satellite imagery;
  • data intensive computing systems;
  • high performance analytics;
  • and a Thursday Birds-of-a-Feather session on Data Commons led by Robert Grossman in room 405-406-407.

Stop by booth #1653, ask for a demonstration, and learn more about how we've been working hard to make data-driven research easier and more accessible. We look forward to chatting with you!

NIH Biomedical Data Sharing Cloud Pilot

Two UChicago Groups Join NIH Biomedical Data Sharing Cloud Pilot

Two University of Chicago research groups will help build the pilot phase of an ambitious new National Institutes of Health initiative to make U.S. biomedical research data and tools accessible to more scientists.

The NIH Data Commons, a shared virtual space where scientists can work with the digital objects of biomedical research, will launch a 4-year pilot phase, the agency announced today. Globus, the UChicago-based non-profit research data management platform, and the Center for Data Intensive Science at UChicago are both part of the multi-institutional consortium receiving 12 awards totaling $9 million to implement this powerful new platform.

“Harvesting the wealth of information in biomedical data will advance our understanding of human health and disease,” said NIH Director Francis S. Collins, M.D., Ph.D. “However, poor data accessibility is a major barrier to translating data into understanding. The NIH Data Commons Pilot Phase is an important effort to remove that barrier.”

Researchers in medicine and biology increasingly work with massive datasets to better understand disease, find new treatments, and decode the basics of life. These data are rich with information, but create technical challenges due to their size, complexity, privacy requirements, and the specialized analytic tools needed for their analysis.

A “data commons” helps eliminate these barriers by creating a virtual, cloud-based platform where researchers can easily access and work with otherwise intractable datasets. For example, scientists at multiple institutions could share and compare patient genetic sequences to find potential new drug targets for a disease. Scientists can also extract more value from federally-funded research, as data collected by a single laboratory will be available for others to discover and build upon in their own work.

Other data-heavy sciences, such as astronomy and climate research, have constructed data commons, and last year the National Cancer Institute -- one of 27 centers at the NIH -- announced their Genomic Data Commons, built and managed by CDIS and the University of Chicago.

But building a data commons for the nearly $30 billion of research funded by the NIH each year is an even larger enterprise. The 4-year pilot phase for the NIH Data Commons will explore the feasibility and best practices for making digital objects available through collaborative platforms, applying the FAIR principles (Findable, Accessible, Interoperable, and Reusable) to more biomedical research data and tools.

Globus, a widely-used platform for transferring, sharing, and discovering research data developed by University of Chicago and Argonne National Laboratory, will partner with USC Information Sciences Institute to provide cloud-based services that enable key capabilities for the NIH Data Commons pilot. Those services include new privacy and security measures for controlled-access data, leveraging tools for managing Protected Health Information Globus is concurrently developing in an NCI-funded project. Globus also led the creation of the Materials Data Facility, a commons-like environment that enables researchers in the Materials Genome Initiative to share datasets.

“Globus is used by thousands of researchers in other scientific fields with intensive computational and data needs, and our platform is ready to help support the architecture of the new NIH Data Commons,” said Ian Foster, co-founder and director of Globus and Arthur Holly Compton Distinguished Service Professor of Computer Science at UChicago. “We’re excited to bring our mission of accelerating research to this important effort that will unlock new discoveries.”

The Center for Data Intensive Science (CDIS), led by Jim and Karen Frank Director Robert L. Grossman, will partner with the University of California Santa Cruz and the Broad Institute for their contribution to the pilot phase. Each institution has a strong track record of developing production-grade software platforms that currently support flagship scientific efforts, including the CDIS-developed NCI Genomic Data Commons at the University of Chicago. They will align these individual efforts in a collaboration called the Commons Alliance so that data commons can be the foundation for an open ecosystem of software applications and services developed by a research community.

“We have developed eight data commons that are used by thousands of researchers each day and that all interoperate with each other,” said Robert L. Grossman, the Frederick H. Rawson Professor of Medicine and Computer Science at the University of Chicago. “For this project, the Commons Alliance will be building an open platform so that researchers anywhere in the world can easily build their own custom applications over the NIH Data Commons to advance their own research.”

Three NIH-funded data sets on genotype-tissue expression, trans-omics for precision medicine, and model organism genomes will serve as test cases for the NIH Data Commons Pilot Phase. More data resources will be added once the pilot phase has achieved its primary objectives, the NIH announced in their news release. The trans-NIH Data Commons Pilot Phase receives funding from multiple NIH Institutes and Centers and is managed by the NIH Common Fund within in the NIH Office of the Director.