April 28, 2015 | Vas Vasiliadis

We've talked a lot recently about data publication and inevitably the conversation becomes one of definition and semantics. For many, data publication means making publicly available the data used to support the results described in a published paper. For some, it means making data selectively available to other investigators within a research discipline or community. For others, it means sharing research findings with collaborators at multiple institutions throughout the course of a project. And, as you might imagine, there are myriad scenarios in between these three. So to guide these conversations we developed a simple framework that defines the dimensions to be considered as we build data publication capabilities into the Globus service.

The framework is summarized in the figure below and comprises data identification, description, curation, access, and preservation. When considering a publication use case, we should understand the requirements along each of these dimensions and then decide on how best to implement the system and related workflows.

Globus Data Publication Framework

Identification is arguably the most critical dimension. In its simplest form this may be a URL that points to a page somewhere on your department's web server; while this may be sufficient for informal publication of data within a project team, it seldom works for long-term preservation. Using a persistent identifier such as a DOI is usually the way to go in a more formal publication context.

Description usually comprises metadata associated with the data, and it spans the gamut from no metadata to extensive, domain specific metadata. Standard schemas such as the Dublin core are often sufficient for data supporting a published paper, but adding more customized metadata may facilitate reproducibility, increase re-use, and improve provenance (and in some cases may be required by specific preservation mandates).

Curation is sometimes defined by broader policies that require review and validation of data before an institution is willing to put its name on it—this tends to be the case in digital preservation efforts managed by libraries. In less formal settings, a cursory look by a project team may be all that's required before the data are made available to others. Needless to say, curation workflows can be quite complex!

Access is another dimension that is dictated by policy. For example, many federally funded research projects are required to make data publicly available. At the other end of this scale, an investigator may restrict access to just a few colleagues while research in ongoing. The ability to define arbitrarily complex (or simple!) access policies is a key aspect of the Globus service.

Preservation actually has a number of underlying considerations, including durability of the storage medium, format of the published data, and reliability of the organization hosting the publication repository. Inevitably, there are tradeoffs among these considerations (unless money is no object!) but many robust options exist. For instance, in genomics research an investigator may choose to preserve raw output from a next generation sequencer using a highly durable service like Amazon Glacier. At the same time, the investigator may publish intermediate analyses on a campus storage system, which is less durable but much more accessible to collaborators.

In the Globus service we strive to address many (but certainly not all) of these data publication scenarios. Our initial release supports a subset but we can easily address a much broader set of requirements through customization. More info on Globus data publication functionality...