The Entity Registry and Model Repository (ERMR)
A central objective of the PERICLES project is to develop and implement new ways of managing the digital assets in evolving ecosystem, across different technologies and software infrastructures, in order to promote long-term sustainability of collections and to foster collaborative research on shared data.
In this brave new world, greater focus is now being laid on the much larger challenges of exploiting big data and how to produce transformational breakthroughs across the physical and life sciences: in drug research and genomics, energy research, arts and social sciences (and even for uncovering the secrets of the universe).
Such data exploitation requires the running of complex workflows, which are needed to turn data from computer simulation into high impact information.
This situation has created new challenges: many scientific and engineering workflows are now becoming highly complex, carried out by large and diverse teams of expertise dispersed around the world, which use powerful computers to assimilate huge amounts of data needed to drive scientific discovery, artistic invention, and innovation.
Information is often independently captured, curated, and transformed across the data life cycle with little communication between the parties in the chain. More often than not, the scientists, engineers, artists who are involved become part of the data provenance, although their expertise and knowledge is seldom captured.
When these individuals move on, how do we capture their knowledge and expertise, so we can be assured that scientific results or artistic products are preserved with accuracy and context over time?
As part of the PERICLES project, we have constructed a sandbox infrastructure that informs the fabric of the project, which serves not only as the registry of digital entities to be preserved, but also as the long-term repository for the linked resource models that constitute the project’s primary innovation.
We call this infrastructure the Entity Registry and Model Repository (ERMR). The ERMR is a realistic, scalable data management system based on open standards that can be used to capture the data and provenance of data and workflows.
As such, it represents our best attempt to create a service-oriented system that will support different communities, providing multiple federation layers for effective and scalable data sharing and preservation.
The system itself provides distributed and extensible storage capabilities - including object stores, data base based stores, and triple file based stores. These capabilities are introduced in order to support the modelling of data lifecycle management infrastructure. In PERICLES this is facilitated through the project’s Linked Resource Model (LRM) and related tools, across different communities of practice.
There are, of course, a number of different preservation systems and approaches available, but the ERMR is designed to support the standards-based Cloud Data Management Interface (CDMI) that will allow applications to create, read, use, and delete data elements from the cloud. Based on the Apache Cassandra system, the ERMR platform is modern and built with heterogeneous and distributed data in mind.
Although the ERMR was initially designed for the PERICLES project, we have expressions of interest from a number of scientific and engineering laboratories seeking to take it forward for their own uses. Among the most prominent of these is the virtual engineering community, which is now working out how to design and build the world’s first fusion reactor to put power into the grid.
The legacy of the PERICLES project lies not only in a better understanding of how we can capture and curate the complex workflows which can turn data into knowledge, but also in the creation of tools and a sandbox infrastructure (ERMR) that can assist data managers who need to ensure the trustworthiness of complex data and workflows over time.