Pericles project
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics


Visualizing data to detect semantic change - Interview with Peter Wittek

Visualizing data to detect semantic change - Interview with Peter Wittek

In 2015 an open-source tool called Somoclu was developed by Peter Wittek from project partner University of Borås as an enabling key tool in PERICLES WP4 to study semantic drifts on scalable data collections, providing content maps in the hundreds for visual analysis. Semantic drift (also often referred to as “semantic change” or “evolving semantics”) is an active and growing area of research that observes and measures the phenomenon of changes in the meaning of concepts within knowledge representation models, along with their potential replacement by other meanings over time. Understanding this phenomenon and being able to anticipate trends play a central role in the management and long-term access and reuse of digital collections. Since its release under GNU Public License V3, Somoclu has encountered a vivid interest within the data scientist community and aggregated downloads over 20,000 on Python Package Index and Machine Learning Open Source Software. The machine learning tool can be used for a variety of purposes including anomaly detection, semantic and concept drift detection, or the study of contextual correlations. We interviewed software developer and research fellow Peter Wittek on the tool’s success.

Q: Can you tell us something about the domains that downloaded most, and why Somoclu is so popular in that area?

It is hard to gauge what people actually use it for since users only contact the developers if they are stuck with something. I know for certain that a scientist in Hawaii used it to group the genome of bacteria species living on the ocean floor and another researcher from California used to reconstruct genome sequences. I find these two examples representative because they highlight an unusual characteristic of emergent self-organising maps: they operate both as discriminative learners, but also as generative learners. This curious duality has also been exploited in our work on studying semantic drifts. I also get a question every now and then from people working in the pharmaceutical industry, but given the secretive nature of their work, I never actually had a glimpse on what they use Somoclu for.

Q: What type of work would you describe as the main beneficiary of Somoclu?

The main use is undoubtedly interactive data exploration. You get a visual representation of your data that faithfully reproduces the local topology of the high-dimensional feature space. You get an idea of which of your data instances group together, which are far away, what are the outliers, and how the different clusters relate to one another. Furthermore, since the map has points to which no data instance belongs, it gives you an idea what could be there – this is the generative aspect of the learning algorithm. All of this is true for any emergent self-organising map implementation. Somoclu's main advantage is scale: it is a massively parallel implementation that scales from a laptop to a GPU-accelerated cluster. The second advantage is that the computational engine is exposed in programming languages that are popular in the machine learning and scientific communities: Python, R, and MATLAB.

Q: Can you envisage other domains or types of work where Somoclu would be useful for?

We already came a long way from its original purpose of training maps on sparse data we get from text mining, but unfortunately I am not much of a visionary to come up with new domains of use. I conjecture that the pharmaceutical industry uses it to find ideas for new chemical compounds - this can be interesting in other industries where you have to mix many components, but only few of the mixtures make any sense. The construction industry comes to my mind and the search for novel materials. A map can be trained further as we acquire new data points. We used this property to study how semantics of words evolve over time, but it would be interesting to see it in other domains where dynamics are important, yet changes are not abrupt – that is, the shifting data instances have some momentum. You can think of the housing market and how the appeal of neighbourhoods shifts over time: for instance, you could train maps to figure out where hot spots are or where they are likely to show up next. The sky is the limit: the methodology is straightforward, the scale is there, now it is a matter of finding exciting applications.

To give you an idea what it takes to use it, once you installed either variant (command line, Python, R, or MATLAB), you only need your data and two parameters: the size of the map in either dimension. Somoclu has sane defaults for the rest of the parameters. Like with any learning algorithm, it is a good idea to try it on a small subset of your data of which you have some prior knowledge or understanding, so you can get a sense of what it does and how to interpret the outcome. Then you can experiment with the rest of the parameters: planar versus toroid topology, rectangular or hexagonal grid, neighbourhood functions, learning parameters, and so on. Self-organising maps are very intuitive once you have a grasp on what you actually see on the maps.


Somoclu visual map

Figure 1: Analyzing semantic proximity: blue basins host content, brown ridges indicate tensions. For more details, refer to S. Darányi, P. Wittek, K. Konstantinidis, S. Papadopoulos, E. Kontopoulos. A Physical Metaphor to Study Semantic Drift. Proceedings of SuCCESS'16, 1st International Workshop on Semantic Change & Evolving Semantics, September 2016.

Add a comment
  • Albert Meroño Peñuela
    Vrije Universiteit Amsterdam
    Aug 17, 2016
    10:04 AM
    It was very easy to use and install. In a way if you have an interesting dataset (version dataset or time-span dataset) and you want to get out these visual maps telling you about how terminology is changing overtime it’s really useful, especially with the scarce knowledge I have on data-mining and self-organizing maps techniques in particular. So the experiment I did was using a dataset that is called ‘The Dynamic Linked Data Observatory’ and this is a compilation of more than 2000 snapshots of traversable linked-data on the web, which is quite a lot (about 2 terabytes of compressed data). I wanted to do a bit of data crunching over this dataset and come out with something that was usable by Somoclu, so come out with these sparse format matrices essentially telling you how frequent terms are in documents. There is something else we need to think about: how to translate this notion of term and document to linked-data because in linked-data we only have URIs, literals and named graphs. So that was a very interesting experiment and it was also a good test to the scalability of Somoclu in general and it worked very well. So this experiment was about 300,000 dimensions so unique terms occurring in around 5,000 different documents. In the end I got these visual maps. Another interesting remark is the new two different kind of expertise to interpret those. First you need somebody who is good at interpreting the visualization itself, because you need to be able to read what the maps are telling you. The other one is that you also need somebody who is good at the dataset itself since you need to relate stuff that is going on the map with phenomenon that you happen to know is present or happened in the source dataset. So this was more or less my experience using it.