We believe that data must be open and accessible within the research community to collectively achieve the critical mass of data necessary to power data-driven research, insight, and discovery.
We believe that collaboration creates a knowledge pool that not only drives better software development, but also connects us to an active community in pursuit of shared social impact. We have long benefitted from open-source software and are committed to contributing to future generations of software and scholars.
We believe that rapid innovation is most effectively achieved through an open infrastructure environment where portability and compatibility are maximized and knowledge is distributed broadly.
Chicago Data Commons Model
1. Permanent Digital IDs
The data commons must have a digital ID service, and datasets in the data commons must have permanent, persistent digital IDs. Associated with digital IDs are access controls specifying who can access the data and metadata specifying additional information about the data. Part of this requirement is that data can be accessed from the data commons through an API by specifying its digital ID.
2. Permanent Metadata
There must be a metadata service that returns the associated metadata for each digital ID. Because the metadata can be indexed, this provides a basic mechanism for the data to be discoverable.
3. API-based Access
Data must be accessed by an API, not just by browsing through a portal. Part of this requirement is that a metadata service can be queried to return a list of digital IDs that can then be retrieved via the API. For those data commons that contain controlled access data, another component of the requirement is that there's an authentication and authorization service so that users can first be authenticated and the data commons can check whether they are authorized to have access to the data.
4. Data Portability
The data must be portable in the sense that a dataset in a data commons can be transported to another data commons and be hosted there. In general, if data access is through digital IDs (versus referencing the data's physical location), then software that references data shouldn't have to be changed when data is rehosted by a second data commons.
5. Data Peering
By “data peering,” we mean an agreement between two data commons service providers to transfer data at no cost so that a researcher at data commons 1 can access data commons 2. In other words, the two data commons agree to transport research data between them with no access charges, no egress charges, and no ingress charges.
Because, in practice, researchers' demand for computing resources is larger than available computing resources, computing resources must be rationed, either through allocations or by charging for their use. Notice the asymmetry in how a data commons treats storage and computing infrastructure. When data is accepted into a data commons, there's a commitment to store and make it available for a certain period of time, often indefinitely. In contrast, computing over data in a data commons is rationed in an ongoing fashion, as is the working storage and the storage required for derived data products, either by providing computing and storage allocations for this purpose or by charging for them. For simplicity, we refer to this requirement as “pay for computing,” even though the model is more complicated than that.