Pericles project
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics

Blog

Exploring the potential of Information Encapsulation techniques



Exploring the potential of Information Encapsulation techniques

Information Encapsulation (IE) means the aggregation of information that belongs together. It can be carried out for diverse purposes at many different stages of the information lifetime. In the field of Digital Preservation (DP) the main IE scenario is the aggregation of a Digital Object (DO) with its metadata. IE techniques are usually domain independent, but they can be differentiated based on their features, to find a technique which fits best for the requirements of a scenario.

This is one of the areas of work of PERICLES and we are pleased to announce the release of PeriCAT, an open-source tool which provides a set of IE techniques and a mechanism to capture the user scenario and to suggest the best fitting technique based on it.

PeriCAT [1] can be used in a sheer curation scenario. In this scenario a tool such as PERICLES Extraction Tool (PET) [2] extracts information from an environment in which Digital Objects are created, to support the reusability of these objects. IE ensures that the information remains accessible even if the Digital Object leaves its creation environment. Therewith it is a tool to support the creation of self-describing objects and facilitates the long-term reusability of information.

Overview of the techniques

IE techniques can be divided into two main categories: Information Embedding and Packaging. Packaging refers to the aggregation information, like files or streams, as equal entities stored in an information container. As opposed to this, information embedding needs a carrier information entity in which the payload information will be embedded.

Packaging techniques

simple.png

https://www.flickr.com/photos/stoolhog/8153406149 (CC BY-SA 2.0)

Simple archive formats used for packaging are, for example, ZIP and tar. Such container file formats aggregate other files in a single package. File containers are often combined with compression or encryption methods.

metadata.png

https://www.flickr.com/photos/quinnanya/7206706162 (CC BY-SA 2.0)

Especially for Digital Preservation it is common to add metadata files with a well defined schema to archive packages. Examples are aggregations of METS and OAI-ORE schemata, together with Digital Objects, in a container file.

structured.png

https://www.flickr.com/photos/oskay/2716368532 (CC BY 2.0)

Structured packaging encompasses standards which provide a well defined directory and file structure. A popular example is the BagIt format.

There are more specialised structured file containers for specific purposes, e.g. for the aggregation of all files which belong to a movie, or for the aggregation of all sources of a software application.

Embedding techniques

All information embedding techniques demonstrate a distinction between the information that serves as carrier, and the payload information which is embedded into the carrier. However, the embedding procedures and the features of the techniques are very diverse.

 

watermark.png

https://www.flickr.com/photos/ayca13/13742383254 (CC BY 2.0)

Digital Watermarking means the attachment of a payload image to a carrier image. The payload can either be attached visibly atop of the carrier, or fragile and  invisible inside the carrier.

A watermark provides additional information to the carrier image, or serves as proof for authenticity and origin of the image.

steganography.png

https://www.flickr.com/photos/70006548@N04/6359363281 (CC BY 2.0)

Steganography is the hiding of messages and other information into carrier information, whereby the focus lies on the payload, and the carrier is of less importance.

Examples for steganographic algorithms are F5, and Least Significant Bit changing algorithms for images.

features.png

https://www.flickr.com/photos/vancouverfilmschool/4839166526 (CC BY 2.0)

Features of file formats can be used for the embedding of information. These can either be intended features, as the mechanism to attach files to a PDF file, or features from structural conditions of the file which can be exploited, as the possibility to add text at the end of JPEG files.

frame.png

https://www.flickr.com/photos/oskay/156242299 (CC BY 2.0)

Information frames extend the carrier medium with additional space, in which the payload will be embedded. This can be additional pixels added to an image, closing credits for movies, or appending of text to text files.

These techniques can be combined with other embedding techniques, e.g. to embed information in the information frame using steganography.

The IE techniques have different features and the technique to be used should be chosen based on the scenario requirements.

Choice of a technique

Many criteria have to be considered to choose the best fitting IE technique for a user scenario. Technical criteria must be fulfilled to be able to use an IE technique for a given dataset, e.g. the IE algorithm must be able to handle the file formats, and some techniques have constraints on the payload capacity.

In contrast to the technical constraints there are scenario criteria that are requirements specified by the user scenario. Scenario criteria encompass algorithm requirements regarding the:

  • processability, robustness
  • time and space complexity
  • used disk space
  • restorability
  • risk of data loss
  • visibility, detectability
  • location of the encapsulated information
  • spreading, standards
  • security, confidentiality, authenticity

Have a look at the PERICLES deliverable D4.2 “Encapsulation of environmental information” on IE for a detailed description of the criteria [3].

The most desired requirement for aDP scenario is that the DO remains consistent in every bit. This means for IE techniques, that they have to ensure that the DO remains either untouched, or can be restored correctly.

Information Decapsulation is the process to separate encapsulated information entities from each other. We identified a set of metadata to facilitate the correct restoration of the original information, called Restoration Metadata. This metadata set encompasses checksums of carrier and payload information, original file paths, storage location, encoding and algorithm configuration parameter.

 

Decapsulation.png

 

Packaging techniques mostly fulfill the requirement to ensure that the packed objects can be restored in a way that the restored object is equal to original object in all data bits, verifiable by a checksum. This is extremely different for steganography techniques, in which the carrier is usually altered irreversible. How can such a technique be used in a scenario, which requires a checksum valid restoration?

One approach for a solution is a technique developed in PERICLES called Information Frames. It is in principle similar to the closing credits of a movie - the object is expanded with additional space in which information can be embedded.

 

imageImageInfoFrame.png

 

The figure above shows an information frame used for encapsulating two images. The pixels of the carrier image remain untouched, but the image is expanded by additional pixels which contain the payload image, and pixels for the restoration metadata. The restoration metadata is embedded with steganography in the blue pixel frame. This information ensures that carrier as well as payload files can be restored during the decapsulation process correctly in every bit.

PeriCAT - A framework for Information Encapsulation

PeriCAT - The PERICLES Content Aggregation Tool - is a framework for IE techniques. It integrates a set of IE techniques from various domains, which can be used from within the framework for the encapsulation and decapsulation of information. PeriCAT provides a decision mechanism to suggest the best fitting IE technique for a given user scenario. For this it captures the scenario with a questionnaire, and calculates virtual distances from the scenario to the different techniques. The technique with the lowest distance to the scenario is the suggested one. All techniques are displayed with their distances in a high score. The calculations are easily adjustable, so that the distance high score is updated live during user scenario changes.

PeriCAT_screenshot_highscore.png


PeriCAT is released on GitHub under Apache v.2. open source license. More information about the tool can be found in the  GitHub wiki , such as an installation guide and a quick start guide. The deliverable about Information Encapsulation can be accessed at the PERICLES website.

 

 

[1] https://github.com/pericles-project/PeriCAT

[2] https://github.com/pericles-project/pet

[3] http://www.pericles-project.eu/deliverables/59 

Add a comment