New GDC Community Tool

Colin Reid, a Scientific Support Analyst in the Center for Data Intensive Science at the University of Chicago, built a new community tool for the Genomic Data Commons.  The RNASeq Tool downloads and merges individual RNASeq files from the GDC into a matrices identified by TCGA barcode.  Learn more at  

Supercomputing Conference

It's our favorite time of year!  We're in Denver this week for the annual Super Computing conference with the Open Commons Consortium showcasing our work on: 

  • innovative applications of data science in biology, medicine, health care, and the environment;
  • new releases of open-source data commons and data peering technology that support research communities, including specialized commons for cancer genomic data, liquid biopsy research, brain disorders, pediatric cancer and birth defects, weather data, and satellite imagery;
  • data intensive computing systems;
  • high performance analytics;
  • and a Thursday Birds-of-a-Feather session on Data Commons led by Robert Grossman in room 405-406-407.

Stop by booth #1653, ask for a demonstration, and learn more about how we've been working hard to make data-driven research easier and more accessible. We look forward to chatting with you!

NIH Biomedical Data Sharing Cloud Pilot

Two UChicago Groups Join NIH Biomedical Data Sharing Cloud Pilot

Two University of Chicago research groups will help build the pilot phase of an ambitious new National Institutes of Health initiative to make U.S. biomedical research data and tools accessible to more scientists.

The NIH Data Commons, a shared virtual space where scientists can work with the digital objects of biomedical research, will launch a 4-year pilot phase, the agency announced today. Globus, the UChicago-based non-profit research data management platform, and the Center for Data Intensive Science at UChicago are both part of the multi-institutional consortium receiving 12 awards totaling $9 million to implement this powerful new platform.

“Harvesting the wealth of information in biomedical data will advance our understanding of human health and disease,” said NIH Director Francis S. Collins, M.D., Ph.D. “However, poor data accessibility is a major barrier to translating data into understanding. The NIH Data Commons Pilot Phase is an important effort to remove that barrier.”

Researchers in medicine and biology increasingly work with massive datasets to better understand disease, find new treatments, and decode the basics of life. These data are rich with information, but create technical challenges due to their size, complexity, privacy requirements, and the specialized analytic tools needed for their analysis.

A “data commons” helps eliminate these barriers by creating a virtual, cloud-based platform where researchers can easily access and work with otherwise intractable datasets. For example, scientists at multiple institutions could share and compare patient genetic sequences to find potential new drug targets for a disease. Scientists can also extract more value from federally-funded research, as data collected by a single laboratory will be available for others to discover and build upon in their own work.

Other data-heavy sciences, such as astronomy and climate research, have constructed data commons, and last year the National Cancer Institute -- one of 27 centers at the NIH -- announced their Genomic Data Commons, built and managed by CDIS and the University of Chicago.

But building a data commons for the nearly $30 billion of research funded by the NIH each year is an even larger enterprise. The 4-year pilot phase for the NIH Data Commons will explore the feasibility and best practices for making digital objects available through collaborative platforms, applying the FAIR principles (Findable, Accessible, Interoperable, and Reusable) to more biomedical research data and tools.

Globus, a widely-used platform for transferring, sharing, and discovering research data developed by University of Chicago and Argonne National Laboratory, will partner with USC Information Sciences Institute to provide cloud-based services that enable key capabilities for the NIH Data Commons pilot. Those services include new privacy and security measures for controlled-access data, leveraging tools for managing Protected Health Information Globus is concurrently developing in an NCI-funded project. Globus also led the creation of the Materials Data Facility, a commons-like environment that enables researchers in the Materials Genome Initiative to share datasets.

“Globus is used by thousands of researchers in other scientific fields with intensive computational and data needs, and our platform is ready to help support the architecture of the new NIH Data Commons,” said Ian Foster, co-founder and director of Globus and Arthur Holly Compton Distinguished Service Professor of Computer Science at UChicago. “We’re excited to bring our mission of accelerating research to this important effort that will unlock new discoveries.”

The Center for Data Intensive Science (CDIS), led by Jim and Karen Frank Director Robert L. Grossman, will partner with the University of California Santa Cruz and the Broad Institute for their contribution to the pilot phase. Each institution has a strong track record of developing production-grade software platforms that currently support flagship scientific efforts, including the CDIS-developed NCI Genomic Data Commons at the University of Chicago. They will align these individual efforts in a collaboration called the Commons Alliance so that data commons can be the foundation for an open ecosystem of software applications and services developed by a research community.

“We have developed eight data commons that are used by thousands of researchers each day and that all interoperate with each other,” said Robert L. Grossman, the Frederick H. Rawson Professor of Medicine and Computer Science at the University of Chicago. “For this project, the Commons Alliance will be building an open platform so that researchers anywhere in the world can easily build their own custom applications over the NIH Data Commons to advance their own research.”

Three NIH-funded data sets on genotype-tissue expression, trans-omics for precision medicine, and model organism genomes will serve as test cases for the NIH Data Commons Pilot Phase. More data resources will be added once the pilot phase has achieved its primary objectives, the NIH announced in their news release. The trans-NIH Data Commons Pilot Phase receives funding from multiple NIH Institutes and Centers and is managed by the NIH Common Fund within in the NIH Office of the Director.

New Data Commons for Brain Health

We're thrilled to partner with Cohen Veterans Bioscience and the Open Commons Consortium to establish the Brain Commons – a one-of-a-kind cloud-hosted platform for unleashing big data that will be critical for the understanding of brain conditions. The data commons platform is uniquely positioned to aggregate and manage large-scale imaging data, genomic data, wearables data, and clinical data, as well as enable machine learning and analytics at state-of-the-art computing speeds to accelerate our understanding of brain conditions and brain health.

GDC Workshop on Oct 12

Date: Thursday, October 12, 2017

Time: 12:00 PM - 1:00 PM (EDT)

Location: Web Conference (See WebEx information below)

Speaker: Michael Fitzsimons, Ph.D, GDC User Services Manager, University of Chicago


The Analyzing Data using GDC Data Analysis, Visualization, and Exploration (DAVE) Tools workshop will help introduce users to GDC tools for analyzing data from cancer genomic studies. As an example, we will explore most frequently mutated genes and mutations and perform a survival analysis for cases with and without these mutations, view the distribution of particular mutations and mutated genes across the GDC and visualize associated transcripts in a protein viewer, build custom gene sets for targeted analysis, perform integrated analysis on the most mutated cases in an OncoGrid, and analyze cases within and across projects.

Included Topics

  • Visualize most frequently mutated genes and view most frequent somatic mutations for a project
  • Perform a survival analysis for cases with a mutated form of a certain gene and cases without the mutation
  • Visualize mutations and their frequencies across protein domains
  • Build custom gene sets for targeted analysis
  • Plot all cases for a project in an OncoGrid and visualize the top 50 mutated genes affected by high impact mutations
  • View the number of cases affected by particular mutation across all projects

New platform for children's health issues

Investigators from the University of Chicago Medicine will play a central role in a five-year, $14.8 million effort by the National Institutes of Health, contingent upon available funding, to improve the understanding of inherited diseases.

The project, known as the Gabriella Miller Kids First pediatric data resource center, will be a multi-centered effort led by investigators at the Center for Data Driven Discovery in Biomedicine at the Children’s Hospital of Philadelphia (CHOP).

Two crucial components of the Kids First project, however, are the teams led by Robert L. Grossman, PhD, and Sam Volchenboum, MD, PhD, at the University of Chicago. Grossman and Volchenboum will play a central role in the technical underpinnings of the large-scale processing and sharing of genomic and clinical data for this important initiative.

Grossman, the Frederick H. Rawson Professor in Medicine and Computer Science and director of the Center for Data Intensive Science at the University of Chicago, heads up an operations center that runs numerous data commons, supporting more than 20,000 researchers across the world every month.

“Platforms that enable researchers to analyze securely large amounts of de-identified clinical and genomic data are one of our most powerful tools for making discoveries that improve children’s lives,” Grossman said.

Grossman’s team is known for its work on the NCI’s Genomic Data Commons (GDC), a federally funded, unified data system that promotes sharing of cancer genomic and clinical data between researchers. The GDC is a core component of the National Institutes of Health’s Precision Medicine Initiative.

Grossman will work closely with Volchenboum, an expert in pediatric cancers and director of the Center for Research Informatics at UChicago. Volchenboum’s team developed the world’s first international pediatric cancer data commons, housing data on more than 19,000 neuroblastoma patients from around the world.

Under their leadership, the Chicago team of engineers and scientists will design and operate the cloud-based, open-source software needed to establish the data coordination center within the Kids First data resource center.

“This is a critical step forward for the pediatric oncology community,” Volchenboum said. “The Kids First data resource center will provide a much-needed resource for pediatric researchers to leverage a large set of genomic and clinical data on children. These data will help us understand why some children develop cancer and how to best stratify and treat their disease


Engine for Precision Medicine

The NCI Genomic Data Commons as an engine for precision medicine

Jensen MA, Ferretti V, Grossman RL, Staudt LM. (2017). The NCI Genomic Data Commons as an engine for precision medicine. Blood. 130(4), 453-459. doi:10.1182/blood-2017-03-735654.


The National Cancer Institute Genomic Data Commons (GDC) is an information system for storing, analyzing, and sharing genomic and clinical data from patients with cancer. The recent high-throughput sequencing of cancer genomes and transcriptomes has produced a big data problem that precludes many cancer biologists and oncologists from gleaning knowledge from these data regarding the nature of malignant processes and the relationship between tumor genomic profiles and treatment response. The GDC aims to democratize access to cancer genomic data and to foster the sharing of these data to promote precision medicine approaches to the diagnosis and treatment of cancer.

Figure 3. 

User workflow. Diagram indicating user steps to authenticate and download GDC data. Red panels indicate the 3 means for accessing data: the Web-based Data Portal, the standalone Data Transfer Tools, and the programmatic API. “Token” is a short text file provided to an authenticated user that acts like a password to enable secure transfer of authorized controlled data, such as sequence alignments.

GOES-16 Data Now Available

Provisional data from the GOES-16 satellite from NOAA is now available through the OCC Environmental Data Commons (EDC).  This is a joint project using datasharing technology developed by the CDIS team at the University of Chicago.

The GOES-16 data is generated from three types of instruments: Earth sensing, solar imaging, and space environment measuring. In addition to the data, Docker images, Python notebooks and GDAL tools have also been made available as community tools.  

Learn more at

New GDC Data Analysis Tools

The latest Genomic Data Commons release now allows you to look at mutations and other genomic variants across all data in the GDC.  This new set of tools is transforming the GDC from a cancer genomics data repository into an interactive knowledge base.  Users can now interact intuitively with the data, and no downloading is necessary!

ASCO Annual Meeting 2017

The GDC team was at the ASCO Annual Meeting again this year, engaging feedback from oncologists and offering demonstrations of the Genomic Data Commons.  The GDC was showcased along with numerous other innovations in the field to over 30,000 oncology professionals from around the world that come together for this meeting every year.  Michael Fitzsimons gave a Meet the Experts talk and showed an early preview of the upcoming data analysis, visualization, and exploration tools.

Learn more at

BioIT World '17

We'll be at the annual BioIT World Conference again this year talking about our work building digital ecosystems to use and share biomedical data at scale.  Michael Fitzsimons, PhD will be speaking on the Data Commons panel on Thursday, May 25th.

Bio-IT World Conference & Expo is building a global network for precision medicine by uniting the BioIT community.  They bring together more than 3,300 attendees from 41 countries to navigate the new era of precision medicine and build collaboration across the industry.

Learn more

Translational Data Science Workshop

The TDS17 Workshop is an important step towards developing a community around translational data science.  Translational data science is a new term that is being used for an emerging field that applies data science principles, techniques, and technologies to challenging scientific problems that hold the promise of having an important impact on human or societal welfare.  The term is also used when data science principles, techniques and technologies are applied to problems in different domains in general, including—but not restricted to—science and engineering research.   The workshop will bring together a group focus on this field and collaborate to write a white paper on translational data science.

Learn more

BloodPAC Milestone

We recently achieved a milestone in building out data commons technology for the Blood Profiling Atlas in Cancer (BloodPAC) by adding the first set of users.  BloodPAC is a consortium effort working to accelerate the development and validation of liquid biopsy assays to improve the outcomes of patients with cancer.   We are contributing our data commons technology to build out a collaborative infrastructure that enables sharing of information between stakeholders in industry, academia, and regulatory agencies. 

AACR Annual Meeting April 1-5

GDC Demonstrations

We will be at the AACR Annual Meeting again this year offering demonstrations of the Genomic Data Commons at the NCI Exhibit #1407.  Stop by and see we've been working on!

Meet the Expert Session

Date: Monday, April 3rd

Time: 10:15AM – 10:45AM ET

Location: Exhibit Booth #1407

Title: Genomic Data Commons Live Demonstration

Presenter: Michael Fitzsimons, Genomic Data Commons