Space Science Data Portal
In this blog post we introduce the Space Science Data Portal, an application developed in the context of PERICLES to fit a preservation system in the workflow of the Belgian User Support and Operations Centre (B.USOC) with a particular focus on the SOLAR experiment that is operated by B.USOC.
This case study prototype spawned from PERICLES Work Package (WP) 6, which aims at producing technologies for demonstrating workflows and components that would allow the introduction of a preservation layer into existing repository systems. The prototype brings together several technologies developed in Work Packages 3 (Dependency and change management) and 4 (Capturing content semantics and environment), as it allows to ingest digital objects into abstract domain-specific models while facilitating reuse of this data in various existing and new contexts.
We interviewed Rani Pinchuk, Division Manager at Space Applications Services (SpaceApps) to find out more about the Space Science Data Portal.
What is the Space Science Data Portal?
The Space Science Data Portal is an online application that allows a user to visualize, explore, query and augment space science experiment data. It is template- and query-based, allowing full application specific configuration of both the semantic content that is represented and also the look and feel of the web application itself.
This sounds very complex, what is it actually?
In practice, the portal is a website that can be configured to provide a page per digital resource. For example, if the objects to be preserved are concerned with a space science experiment, you may have specific scientific datasets of different types and also any related information (or metadata). For example, an operator (the person who operates the experiment) runs a script to process raw data using certain software and certain scripts, and the output is the higher level data. The operator then may document this work by writing operation reports. The raw data, the processed data, the scripts used, the reports written and even the information about the operator – are all information to be kept in the preservation system. And each of these datasets has a specific type, and many links with other datasets. The portal allows the user to see each of those in dedicated pages that are designed for that type of data.
So will I have a page for each dataset?
Exactly. And such a page may include much information about the data. This information, the metadata, can include two things: (1) simple facts about the data, such as the type of the data or the date in which this data was created, and, typically more important, (2) other relevant digital resources that are related to the data presented in the page, the script that was used, the reports about the creation of the data and so on.
It sounds useful to see all the information about one digital resource in one place, in one web page. Now I understand why it is called a portal. However, I am not sure how these pages are created.
Each page is created using a template. And a template is created per digital resource type. In our space science example, you can define a template for a script that is used to process raw data and output higher level data, which is stored in our so called knowledge-base database, created by ingesting digital resources from various application specific data silos. This template may include all the details that are interesting about such a script (filename, notes, time it was created, …) and other relevant digital resources (raw datasets that were processed by the script, higher level datasets that were the output of the script, the operation reports about running the script).
How can I use the portal then? How can I visit the pages of the different digital resources?
The other relevant digital resources listed in each page, include links to the respective pages describing these other digital resources. So practically, you can navigate between the portal pages of the different digital resources by following links, as shown in the picture below.
Is the portal useful only to see the information, or can I also update the contents?
This is a good question, as obviously, the best approach for data preservation is to keep the data alive – meaning that comments and data about the digital resources or new digital resources can be added to the system while retaining traceability and versioning. The feature of updating the data in the portal is in development. The idea is that new pages can be created automatically according to existing templates, and then the different fields in them can be populated with the result of updating the metadata in the knowledge base.
The only tricky issue here is that I have to organize my data in a certain way – in what you call the knowledge base.
Well, organizing the data is one thing that is a must when speaking about preservation of data. One cannot expect that messy data will be understood by others, today or in the future. Actually, we believe that data preservation is in many aspects knowledge management for the long term.
The model that the portal relies on, though, is simple. And it is very close to the way people think. There are things of different types. If we continue the example of the space science, we can consider that a specific script is such a thing, and it has a type “script”. All the things of certain type have the same metadata. For example, all scripts have a filename, notes and creation date. And all things of certain type may have links to other relevant things. For example, a specific script can be linked to the raw data it processed, the higher level data that resulted, etc.
Modelling the data is simple, but it may require much work – as from our experience, in most cases, the initial situation is quite messy.
And don’t you have ways to facilitate this work?
Well, the feature, in development, of updating the model through the portal is intended to facilitate the work, as it provides an environment for managing and editing the metadata.
But in addition, we do have software components that facilitate the ingestion process into the knowledge base that is used by the portal. These software components are called connectors, and they connect to existing data stores to import the data from them into the knowledge base. For configuring such a connector, the general data model has to be defined first – what types of data are available in the datastore, and what metadata and links to other types can be found.
If we spoke about ingestion, I wonder how preservation packages are being dealt with.
This is a good question, as it is both related to the data model that was described, but also to the portal. Usually we want to package the different digital resources together with relevant data, so that in the long run, when the data is maybe transferred to other places, the relevant information stays with the data. But the data model described above documents the relevant information of each digital resource, and this relevant information is shown in each portal page. Actually, these links between the digital resources represent, in most cases, dependencies – for example, one cannot reuse or learn from the high level data if the script that processed the raw data is not available, or if the raw data itself is missing. These dependencies are very important to document as they help us to understand the risks in the data model – what digital resource is needed to understand and reuse other digital resources. And therefore a model that includes these dependencies is very useful to design the packages.
So in a way, a package of a certain digital resource can be seen as a portal page that include together the digital resource and other relevant digital resources according to a certain template that is defined for the type of the item we package. Navigating in the portal, therefore, can help us to define the package templates, and the templates themselves are similar to the portal templates.
Finally, and obviously, one can include a link from a portal page to the relevant package(s).
So a last question – how does this relate to the OAIS model?
You mainly asked about the portal, but it might be useful to explain shortly about the data model we use. We call the data model the graph-oriented preservation architecture. The main premise of the graph-oriented preservation architecture is that semantics, dependencies, workflows & procedures, and even policies and constraints are all modelled in a single unified graph-based representation. In such graph the concepts and data are represented in nodes, and their interrelationships are represented as named edges.
As can be seen in the picture above this architecture incorporates the OAIS model.
- Digital resources can be linked in a meaningful way creating a semantic model.
- Some of the semantic links between resources can represent dependencies. Part of the graph can therefore form a dependency graph.
- Policies are hierarchical in nature - and therefore form a graph as well. Policies and processes relate to digital resources and therefore the policy graph is merged with the semantic and the dependency graphs.
- Workflows are built from related steps which, again, form a graph. Moreover, workflows are related to both policies and digital resources.
- The connectors populate the graph (by copying the data or creating references to the data in the graph).
- The relations between digital resources allow to automatically extract the context of a resource for preservation packages.
The portal exposes the graph, and therefore allows to navigate between the digital resources, and understand the relationships and dependencies between them.
The portal is a proprietary and tailored software. If you are interested to find out more, please contact David De Weerdt from Space Applications Services at email@example.com or by telephoning +32-(0)2-721.54.84