Pericles project
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics

Blog

How much metadata is too much (and how little is too little)?



How much metadata is too much (and how little is too little)?

How much metadata is too much (and how little is too little)?
Practical Preservation and People,
December 3rd 2015, Public Records Office of Northern Ireland

The DPC metadata event brought together a spirited collection of people from projects and heritage institutions, all interested in seeking an answer to the question posed in the workshop's title: when collecting metadata, how much is too much?

It sounds like a dry subject. What is metadata but digital dust, fluff that settles on the surface of any archived object that is left lying around for long enough?  What could be duller? Are metadata people not all library types in sensible shoes and ill-fitting trouser suits? (Spoiler warning: the answers are, respectively, well, yes they are, but Sherlock Holmes would have had a field day with all that dust and lots of things and no they are not). During the day, we elicited that metadata people sometimes have OAIS tattoos and the urge to design and play metadata card games: whether that confirms or confounds one's previous assumptions about the digital curator can safely be assigned to what PG Wodehouse used to call 'the psychology of the individual'.

 OAIS tattoo

OAIS tattoo photograph courtesy of Flickr user wlef70 / Creative Commons Licensed

Archived or catalogued objects, be they digital or physical or purely ghosts of a lost physical presence, are objects often shorn of the context that carry much of their significance. To some extent, metadata supplies that lost context. A logical reductio ad absurdum for this process is to imagine handing out every object with a free side plate holding a detailed model of its place in the world and its relation to all other objects. There are some objections to this approach: one, a complete model of the universe would seem a little difficult to capture; two, even if one did capture it, it would definitely be 'big data'.

Consider the question famously examined in the fairy-tale of Goldilocks and the Three Bears, in which the daddy bear's porridge was Too Hot, Mother's was Too Cold, and Baby Bear's was Just Right. An absolute lack of metadata is too little under many or most ordinary circumstances. An infinite quantity of metadata is reasonably definable as too much. So which bits should we keep? The key point about this judgement is that just right is not an objective judgement: one can judge it only by consulting the little girl in question about her dietary requirements.

In a spirited keynote, Christian Keitel explained that OAIS knows this, which is why the concept of the segment of public or, later, Designated Community came to hold the mysterious significance in practical data preservation that it enjoys today. The question is, who are archives for? Where does one stop: with a specific, constrained, well-understood segment of the community? With the general public at large?

There is a tension here between the urge to broaden accessibility as widely as possible and the practical limitation of accessing and parameterising a real-world community that is not only impossibly broad but which also, in the context of digital preservation, very possibly does not yet exist at all. Clearly there is a risk of entering into an infinite loop of supposition and definition if pragmatically defensible boundaries are not drawn.  Of course, there's also a risk of ending up in the world of futurism: science-fiction prototyping may have a place, but spirited discussions continue about whether or not this sort of speculation should be considered part of the archivist's day-job. 

In the end, Keitel questioned the overall relevance of the concept of the designated community for archives, especially archives for which usage patterns are more difficult to predict. Designated communities are like a compass: they point the way, but that's all they can do.

The next set of presentations dealt with specific case studies.

Kathryn Cassidy described metadata use in the Digital Repository of Ireland, describing the evaluation of other repositories' metadata practice carried out by the DRI in order to inform their own practice. Part of the puzzle from the perspective of the DRI is the problem of overloading the data provider. An overly large or complex set of metadata can place too great a burden on the creator or depositor of the data, making it less likely that data will be deposited at all. More metadata would often be nice – but if it comes at the expense of reduced data deposit rates, then it may come at too high a cost. Finally, Cassidy stresses the importance of practical activities, such as the publication of guidelines and briefing papers for a general audience.

Alex Green from the National Archives reinforced the message of practicality: metadata that can be 'supplied by the producer in an automated way' would simply have to suffice, although such metadata is often cryptic in nature, representative of parts of the story that the general public never ordinarily get to see. The provision of such data can sometimes lead to unforeseen results, or to complaints from those used to the relatively presentable nature of formally produced cataloguing records. Alex discussed the use of tools such as DROID, and of digital forensics tools in general, to briefly characterise sets of born-digital data. Again, part of the story is one of reviewing one's results to allow for issues of practicality: for example, do not ask for complex file formats such as XML when one can ask for XSL or CSV instead.

The final presentation of the morning dealt with the experiences of the Archaeology Data Service: Katie Green explained the specific needs of the ADS: the organisation accepts that, unlike other prior examples, they 'ask for a lot': not just the few 'core' metadata elements from previous cases, but a whole set of specific requirements supported in each case with detailed guidance. There are complaints from some depositors: the cost of metadata creation is high and the suggestion is often made that a shorter schema would suffice, yet when a workshop was held to investigate this, the result was to identify an additional metadata field that it was felt should be included, but wasn't. To mitigate the issue the ADS began to look at automated metadata generation from available data.

In the afternoon, five presentations were given; of these, the first and second were delivered remotely. The first was an introduction to the concept of minimal effort ingest, given by Bolette Ammitzboll Jurik  and Asger Askov Blekinge: in short, this approach follows the guiding principle of ingest first, digest later, thus safeguarding the raw data object but raising some interesting problems as to how (or whether) to present or permit access to the material. Angela Dappert then presented on the subject of PREMIS and the THOR project. The presentation was unfortunately bitten by the old maxim 'Never work with children, animals or computers' – that is, the remote conferencing software in use somehow contrived to eat her slides, reluctantly allowing us access to a few at the end: however, the in-depth discussion of PREMIS was valuable and interesting.

Of the final three presentations, the first, by Herve L’hours, gave an unabashedly pragmatic viewpoint characterised by the phrase 'Metadata, at its best, is a solution to a problem'. Stakeholder management belongs in the equation: tick boxes are fun but they are no replacement for a real community of practice. Or, to put it another way, we have to actually want to fix it, and be engaged enough to know what that activity would imply.

The second, my own contribution to the day, dealt with the topic of semi-automated metadata extraction and why tracking of semantic and community change such as that explored within PERICLES may allow a more fluid approach to definition of a designated community (slides available here).

Yunhyong Kim spoke on a related topic, presenting the eventual goal of allowing rigid standards to develop over time by connecting case studies, data collection and analysis and (eventually) machine learning flexible models. The ideal endpoint of connecting formal standards and practical experience is an ongoing evolution in line with pragmatically monitored realities.

Finally, Kim asked a question which serves as an ideal sign-off for this report, getting to the heart of metadata  as a living subject, rather than a dry exercise in tickboxes and observation. Here it is:

             Is metadata a love-letter to the future?

My view?… Maybe. Sometimes. And maybe it ought to be.

 

Storify: https://storify.com/emmatonkin/dpc-metadata

Add a comment