Visualizing data to detect semantic change - Interview with Peter Wittek
In 2015 an open-source tool called Somoclu was developed by Peter Wittek from project partner University of Borås as an enabling key tool in PERICLES WP4 to study semantic drifts on scalable data collections, providing content maps in the hundreds for visual analysis. Semantic drift (also often referred to as “semantic change” or “evolving semantics”) is an active and growing area of research that observes and measures the phenomenon of changes in the meaning of concepts within knowledge representation models, along with their potential replacement by other meanings over time. Understanding this phenomenon and being able to anticipate trends play a central role in the management and long-term access and reuse of digital collections. Since its release under GNU Public License V3, Somoclu has encountered a vivid interest within the data scientist community and aggregated downloads over 20,000 on Python Package Index and Machine Learning Open Source Software. The machine learning tool can be used for a variety of purposes including anomaly detection, semantic and concept drift detection, or the study of contextual correlations. We interviewed software developer and research fellow Peter Wittek on the tool’s success.
Q: Can you tell us something about the domains that downloaded most, and why Somoclu is so popular in that area?
It is hard to gauge what people actually use it for since users only contact the developers if they are stuck with something. I know for certain that a scientist in Hawaii used it to group the genome of bacteria species living on the ocean floor and another researcher from California used to reconstruct genome sequences. I find these two examples representative because they highlight an unusual characteristic of emergent self-organising maps: they operate both as discriminative learners, but also as generative learners. This curious duality has also been exploited in our work on studying semantic drifts. I also get a question every now and then from people working in the pharmaceutical industry, but given the secretive nature of their work, I never actually had a glimpse on what they use Somoclu for.
Q: What type of work would you describe as the main beneficiary of Somoclu?
The main use is undoubtedly interactive data exploration. You get a visual representation of your data that faithfully reproduces the local topology of the high-dimensional feature space. You get an idea of which of your data instances group together, which are far away, what are the outliers, and how the different clusters relate to one another. Furthermore, since the map has points to which no data instance belongs, it gives you an idea what could be there – this is the generative aspect of the learning algorithm. All of this is true for any emergent self-organising map implementation. Somoclu's main advantage is scale: it is a massively parallel implementation that scales from a laptop to a GPU-accelerated cluster. The second advantage is that the computational engine is exposed in programming languages that are popular in the machine learning and scientific communities: Python, R, and MATLAB.
Q: Can you envisage other domains or types of work where Somoclu would be useful for?
We already came a long way from its original purpose of training maps on sparse data we get from text mining, but unfortunately I am not much of a visionary to come up with new domains of use. I conjecture that the pharmaceutical industry uses it to find ideas for new chemical compounds - this can be interesting in other industries where you have to mix many components, but only few of the mixtures make any sense. The construction industry comes to my mind and the search for novel materials. A map can be trained further as we acquire new data points. We used this property to study how semantics of words evolve over time, but it would be interesting to see it in other domains where dynamics are important, yet changes are not abrupt – that is, the shifting data instances have some momentum. You can think of the housing market and how the appeal of neighbourhoods shifts over time: for instance, you could train maps to figure out where hot spots are or where they are likely to show up next. The sky is the limit: the methodology is straightforward, the scale is there, now it is a matter of finding exciting applications.
To give you an idea what it takes to use it, once you installed either variant (command line, Python, R, or MATLAB), you only need your data and two parameters: the size of the map in either dimension. Somoclu has sane defaults for the rest of the parameters. Like with any learning algorithm, it is a good idea to try it on a small subset of your data of which you have some prior knowledge or understanding, so you can get a sense of what it does and how to interpret the outcome. Then you can experiment with the rest of the parameters: planar versus toroid topology, rectangular or hexagonal grid, neighbourhood functions, learning parameters, and so on. Self-organising maps are very intuitive once you have a grasp on what you actually see on the maps.
Figure 1: Analyzing semantic proximity: blue basins host content, brown ridges indicate tensions. For more details, refer to S. Darányi, P. Wittek, K. Konstantinidis, S. Papadopoulos, E. Kontopoulos. A Physical Metaphor to Study Semantic Drift. Proceedings of SuCCESS'16, 1st International Workshop on Semantic Change & Evolving Semantics, September 2016.