Monitoring the semantic drift of index terms by an evolving vector field: a novel experimental approach
The evolution of semantics refers to possible shifts in the interpretation of a concept over time and/or contrasts in its interpretation between user communities, due to linguistic, socio-cultural or psychological factors. In literature, the above phenomenon is covered by expressions like “semantic change”, “semantic shift”, “semantic decay”, and sometimes as “concept drift”, all of which are categorized under the umbrella term of “semantic drift”.
As a fallout from inevitable language change due to an interplay of factors, semantic drift becomes highly relevant for an LTDP project like PERICLES, since it critically compromises access to DOs for future users, either (a) because the concepts and/or the words as their labels will have changed, or (b) because the same concept may have different labels across separate user communities.
To address this problem, we have now carried out our first scalable experiment, to be followed by further experiments later this year. The yardstick against which we evaluated our statistics-based methodology was the semantic consistency of the results. Clearly, regardless of the direction of evolution, such results must remain as close as possible to the way we humans understand our words.
In an attempt to address the above concerns, we designed an experiment to demonstrate the following objectives:
- Evaluate the semantic consistency of terms included in an input textual corpus within consecutive time periods;
- Investigate whether semantic drift can be detected among the term groups by analysing the changes in their semantic consistency.
Thus, our respective methodology consisted of the following steps:
- Temporal splitting of a selected textual corpus;
- Building a vector space model for tracking the changes of the evolving text collection;
- Projecting that space onto a two-dimensional surface, by means of the emergent self-organizing maps algorithm, so that term clusters and shifts between them become more apparent;
- Validating semantic consistency among term clusters in the map, based on established semantic similarity metrics and an ontology defining the meanings of terms in the extracted vocabulary.
We applied the above methodology to the Amazon book reviews data set, which is publicly available as part of Stanford University's SNAP project. The data set spanned a period of 18 years and included approximately 12.8 million book reviews up to March 2013. With regards to training the emergent self-organizing maps, we used Somoclu, an open source software being developed in PERICLES (from a pre-project core). WordNet was deployed as the underlying taxonomy, while the implementation of all semantic similarity metrics was based on WS4J. Below, a slideset shows how the structure of term classifications changed over the three periods in analysis (Fig. 1).
Fig. 1. From left to right, three phases of the evolving term classification space of the Amazon book review dataset. Green areas are inactive regions of the vector field, with terms about to become important, whereas blue zones indicate actually important, protruding content. The regional boundaries show the outlines of changing category structure, i.e. the placeholders where the terms as best matching units (BMU) are mapped.
Our overall aim was to investigate whether there is a relationship between term proximity in the map and the semantic similarity of the same terms, based on their relative position in the WordNet taxonomy. The first results were statistically significant and gave us valuable insight on how to improve our experiment and evaluation design.
Such experiments add new ideas to the toolkit developed by PERICLES. For instance, by replacing semantics of digital objects with their functional dependencies in the same vector field, one can hopefully model developing functionalities among components of a digital ecosystem from a statistical perspective. For comparison, Fig 2 displays a map of technological changes in the preservable critical features of eight software-based art (SBA) items between 1981-98 by block clustering. The data are hypothetical only, in order to show that the importance of features is drifting over two-year periods, with black cells indicating the absence, and green ones the presence of a feature in a particular SBA.
Fig 2. Block clustering map of hypothetical digital preservation data: as items of software-based art technologically evolve, so do their critical preservable features change. The drift in terms of critical features separates the 8 items into four groups, overlapping with chronological periods.
This could be a helpful parallel track to our ongoing and planned work on the semantic drift which involves (a) the use of evolving ontologies, (b) a study of the differences in terminologies adopted by different user communities, and (c) measuring the semantic drift by means of visual recognition.
Authors: S. Kontopoulos, S. Darányi, P. Wittek