Science and Technology Facilities Council (STFC) workshop on Data Management and Transfer
On 4 February 2015, the Science and Technology Facilities Council (STFC) held a workshop on Data Management and Transfer, which set out to address some of the fundamental challenges relating to the long-term management of scientific research data.
The meeting was initiated to establish the relationship of core PERICLES technologies to a number of complementary data management projects undertaken at STFC, as well as those for archiving and managing Magnetic Confinement Fusion data at Culham Centre for Fusion Energy (CCFE/EPSRC) (currently around 10 Petabytes of data). Focus at CCFE is currently being directed towards longer term data volumes planned for the Big-Data ITER experiment being constructed in Cadarache, France (40GB/s data rates, 400s pulses).
The central challenge, identified by STFC, is to implement a long-term data preservation environment that can manage large collections of scientific data, replicated across national research projects, which may form the basis of international collaborations.
The overall aim is not simply to place data in a data repository, but to share large quantities of raw data, pre-processed data, and post-processed data across many laboratories and partners on a regular basis. This requirement has highlighted many challenges relating to access of scientific data, which require sophisticated access control policies; and also require innovative methods of capturing the various steps of scientific experiments (workflows), which can be re-run in future to reproduce results with some degree of authenticity and integrity (“reproducible science”).
CCFE and STFC have already developed a range of core technologies that help address many of these challenges, including a “Provenance Metadata Gathering System”, which is designed to capture and record derived data processes; the Integrated Data Access Management (IDAM) data access tool, for analysis, visualization, modeling, and accessing data; and the JET toolkit for data acquisition, analysis, and archive. Of particular interest is current work in capturing the provenance of scientific workflows, such that experiments can be replicated in future over different storage technologies.
The STFC and CCFE presentations resulted in the articulation of “aspirations”, which focused on the identification of an object storage technology and a data archive for nuclear fusion. This led into discussion of the PERICLES project, in going further than object storage technologies, with their incorporation of tables and fact stores.
John Burns, Paul Watry, and Jerome Fuselier’s presentation focused on the PERICLES Entity store, which is envisaged as a federated system, providing access to distributed data repositories of scientific and experimental data. The presentation demonstrated the PERICLES evolution from a mainly iRODS-based system, to the latest version, which can support potentially any storage technology, including CEPH (a requirement for STFC & CCFE). The presentation also demonstrated the use of the PERICLES tool to manage the data life cycle, or continuum, across shared collections in ways that might foster collaborations and data re-use. Part of this discussion was based on the project’s information model to depict the semantics of the data and their interrelationships, which may be used as the basis for automatically deriving and annotating links (LRM); and the QA approach to ensure that the right data is accessible easily, and the obsolete data gets removed or refreshed.
Talks are now taking place to determine how many of the complementary technologies can be integrated, as a means of rapidly developing innovations in the curation of scientific data; and to explore the extensions of this discussion in ways that promote observational and space science data, which are common to each.