Bringing tools for media monitoring to the public: outcomes of the ODYCCEUS summer school on ‘Democracy in the Age of Big Data and AI’ (FETFX Future Tech Week contribution)

Introduction

Social science is undergoing a revolution. The rise of social media, smart cities, digital archives, quantified selving, and personal agents is providing an unprecedented amount of data about the social relations and behaviors of very large groups of people, creating an opportunity to a more empirically grounded social theory. But this opportunity can only be realized by using powerful tools: for analyzing textual data, for finding patterns in data, and for visualizing those patterns in a way that brings out their meaning. These tools increasingly rely on complex systems science and Artificial Intelligence.

 At the same time we see that collective decision-making in contemporary societies is also undergoing profound change. Social media, Big Data, and AI are already having a profound impact on political and social processes. They have empowered ‘citizen-driven’ political movements such as the Arab spring or the French ‘Gilets Jaunes’ protests. They also provide new means to strategically influence public opinion as it was seen before the the Brexit refendum and the US presidential elections in 2016. Similarly, they could lead to rapid polarizations in political debates, such as those concerning migration or identity, and accelerate the spreading of fake news or hate speech.

These dynamics were the focus of the recent ODYCCEUS summer school on ‘Democracy in the Age of Big Data and AI’. In honour of the FETFX Future Tech Week, this page presents some of the summer school’s outcomes, demonstrating the on-going development and potential applications of media-monitoring tools in the Penelope ecosystem.    

Building (social) media observatories

The ODYCCEUS summer school mixed theoretical lectures with practical hands-on sessions and ateliers to examine how novel tools of computational social science can help us understand these phenomena and possibly facilitate future democratic decision processes. It introduces social scientists and media researchers to the latest methods, tools, and techniques and introduces AI researchers and complex systems scientists to the approaches and issues of social science so that they can come up with new tools or refine existing ones.


More specifically, participants learned how they could build Opinion Observatories that tap into social media to collect information about how certain actors are trying to manipulate political opinion in elections, how fake news gets fabricated and spreads, or how opinions get polarized and shift. To this end, participants engaged in a series of ateliers guided by tutors, each of which addressed a case study. Case studies considered a specific contentious issue using specific data sources.

Case 1: Climate Change (tutored by Artificial Intelligence Lab, Vrije Universiteit Brussel)

In this workshop, participants used data from the news website of the Guardian and Penelope components to develop a ‘science tracker’: a pipeline to reveal and track references to scientific publications and research institutions in news articles and comments related to climate change. Participants have thus built components to trace the relationships between scientific references and articles and those in news website commentaries, the occurrence of scientific references in articles over time, etc. As such, they have laid a basis for a further exploration of the role of scientific literature within the climate change debate.   

Case 2: Antagonisms, coalitions and populism (tutored by Max Planck Institute for Mathematics in the Natural Sciences, Leipzig)

The participants of the workshop decided to focus on a very specific discourse, which took place in the shadow of the big Brexit topic dominating UK politics:  The debate about badger cullings in the UK due to the high number of cases of bovine tubercolosis among cattle. The illness is partly spread by the badger population in the UK and Department for Environment, Food and Rural Affairs has hence initiated culling of badgers to reduce the reservoir of infection in wildlife.

Two data sources were used in the workshop: Twitter posts of the week of the Summer School, and all parliamentary speeches by the UK House of Commons since 2016. The aim was to produce an analysis allowing and making use of both close and distant reading to gain a comprehensive view of the debate.

Starting off with relatively close reading of the culling debate in the UK parliament, a new component was introduced and applied to the speeches: Statement graphs. There, sentences including the same words are linked to each other, which yields an overview about the language and topic structure of speeches and can be used to visualize differences and similarities in the use of expressions and words of different parties and politicians, or at different times.

The participants then analyzed the language of the different parties in the UK parliament, and found that the two major parties of the UK talk differently about the cullings: Labour politicians in a more detailed and emotional way, Tories in more abstract and rational terms.

Twitter was used to get an impression about the discourse on social media. Traffic on Twitter was accelerated by the announcement of extending the culling to eleven new areas on September 11. Analysis of the retweet network of posts including ‘badger‘ and ‘cull‘ showed that, to a big extent, only the voices opposing the culling measures were present and shared on Twitter. Statements in favour of the cullings were largely missing. An interesting follow-up question hence arose: How do we deal with data that is not there?

The results showed that even for a ‘minor‘ political topic (compared to and overshadowed by the very heated Brexit debate going on at the same time), the methods and pipelines the participants used and developed provided valuable insights on different scales of data aggregation and visualization, from close to distant reading. This is what should be pursued further in the on-going development of Penelope. The simple visualization techniques of the statement and retweet graphs turned out to provide a valuable systematic perspective on the data. Also more qualitatively oriented researchers can profit from those tools, that we plan to provide to Penelope soon.

Click here for the full presentation

Click here for a demonstrator with the statement graph

Case 3: Hate and excitable speech (tutored by the University of Amsterdam)  

This group’s project was aimed at mapping and tracing the spread of extreme speech and hate speech across online platforms. This involved a number of datasets across the one-year interval surrounding the election of Donald Trump as US president in November 2016. Two datasets were of particular interest here, one sourced from 4chan, an anonymous and ephemeral web forum that over the past few years has become increasingly prominent as a breeding ground of far right conspiracy theories and discourse. The other consisted of all comments left by users on Breitbart, a more traditional right-wing news site which emerged as an important supporter of Donald Trump’s campaign in 2016.

One strand of the research focused on a semantic analysis of these datasets. Using a word2vec analysis based on an expert list of known ‘antagonistic’ phrases on the platform – phrases used to identify either the ingroup (e.g. 4chan and its users) or the outgroup (those outside the forum’s subculture, ‘normies’ in 4chan vernacular), the full dataset could then be processed to see how particular phrases or topics could be mapped on this ingroup-versus-outgroup axis. This gave coherent results for a range of topics, from news outlets (where the right-wing Fox News ranked consistently ‘closer’ to the right-wing datasets than neutral or left-wing outlets) to political candidates (where Donald Trump was, unsurprisingly, closer, while Democrat Bernie Sanders was the furthest outward), providing an empirical and quantified impression of the self-described political identity of these ‘exciteable’ discussion spaces.

A second strand of research focused on tracing of one particular ‘phrasal meme’, the far right conspiracy theory of ‘white genocide’, which suggests that the ‘white race’ is under constant threat from immigration and miscegenation. By first subsetting all posts from the 4chan and Breitbart datasets containing this phrase, and then looking for ‘neologisms’ – that is, words not found in other language spaces – we formed an overview of topics discussed in relation to white genocide in these spaces. This produced two interesting insights. First, the semantic context of white genocide, as defined by these neologisms, is remarkably different between the spaces; while on 4chan the discussion is clearly focused on traditional antisemitic theories, Breitbart is quite specifically concerned with the ‘kalergi plan’, a related conspiracy theory. This shows how a current far right talking point can manifest differently in different online discussion spaces. Secondly, the discussion of ‘white genocide’ on especially Breitbart seemed to be completely divorced from the articles published by the outlet, suggesting that the comments section on the site is not so much a place to react to whatever it publishes, but rather a more general right-wing discussion forum where people simply continue an on-going ideological discussion.

Major parts of this research were conducted using 4CAT, a Penelope-based research tool, which allowed for a rapid iterative approach to data analysis. In this way, it also served as a testbed for these technologies, and demonstrated that through the use of a unified interface on top of existing analytical pathways, researchers from various backgrounds can effectively engage with large datasets.

Case 4: Migration and borders (tutored by the University of Paris, 7)

The outcomes of this workshop were documented in Claude Grasland (1), Romain Leconte (1), Etienne Toureille (1), Robin Lamarche-Perrin (2), Hong-Lan Botterman (2), Camilla Ferri( 3), Marloes Geboers (4), Audrey Lisador (1), Remy Poulain (2), Armin Pournaki (5), Thomas Rosenthal (6), ‘A multidimensional perspective on the salience analysis of migration in the press The case of 14 European newspapers (2014-2018)’ Socius (in preparation).

Abstract

The idea of “migrant crisis” has been criticized by migrations studies as an artefact produced by the public sphere. Nonetheless, this idea has spread worldly. This article proposes to study the salience of news related to migrations in a corpus of European press during a fiveyears period (2014-2018). When it comes to the comparison between different sources, between different languages and national contexts, methodological obstacles arise. Salience analysis should take in charge the social complexity by a cross-dimensional perspective. First, we propose a methodology for building a multi-sources, multinational and multi-languages corpusand to identify the topic. Then, the analysis reveals that the salience of migration in the press is not as global as supposed. Important differences of coverage exist. Thanks to an online platform, observations of salience patterns can be made by crossing time variables (period and span) with source attributes (language, editorial scope, location).

URL to related tool: https://claudegrasland.shinyapps.io/newsmig2/

Penelope at the ODYCCEUS case study workshop in Amsterdam

From May 15-17 2019, the ODYCCEUS consortium assembled in Amsterdam for a three-day conference. Over the course of a series of presentations and work sessions, progress on the ODYCCEUS project, including the Penelope infrastructure, was shared and discussed. On May 15, the delegates presented their ongoing work on the case studies to the group. In preparation for an upcoming volume on the project, these presentations were followed by a series of discussion sessions in which the epistemological and methodological implications of this work were explored in more depth. This for instance included reflections on the tensions between hermeneutic and causal approaches to knowledge production with digital methods.

A series of four work sessions on May 16 en 17 provided further opportunities to collaborate on the Penelope infrastructure, including its components and the observatories tied to the case studies. By engaging with the ODYCCEUS consortium in the informal setting of the Amsterdam Institute for Advanced Study, the various teams thus gained some valuable input on the future design of the Penelope platform. As such, the meeting sparked concrete plans for future collaborations among consortium members.

Tom Willaert presenting progress on the Penelope infrastructure

Semantic Frame Extractor online

A new Penelope Component, called Semantic Frame Extractor is now online!

Frame semantics is commonly used as a methodology for representing the meaning of linguistic utterances. While semantic frames have successfully been formalised on a large scale, it is still a major challenge to automatically extract them from raw text. This Penelope component overcomes this challenge by using precision language processing techniques. Concretely, the component takes a sentence (or a list of texts) and a frame of interest (e.g. ‘Causation’) as input and returns all instances of this frame, and its frame elements, that occur in the sentence (or list of texts). The language processing part of the semantic frame extractor has been developed within the Fluid Construction Grammar (FCG) framework. 

The OpenAPI specification of the component is available at https://app.swaggerhub.com/apis/EHAI/Semantic-Frame-Extractor-API/1.0.0. As all components, it can be used form any programming languages are via Penelope interfaces such as the Penelope Workbench.

Inspiration day of the Flemish Radio and Television (VRT)

The main ideas of the ODYCCEUS project were introduced to a delegation of the Flemish Radio and Television (VRT) at the inspiration day organised at the VUB AI Lab. Paul Van Eecke and Katrien Beuls demonstrated the semantic frame extractor, which detects semantic frames in news paper articles and tries to fill in their slots. An example of such a semantic frame is the Causation frame, with the verb “cause” that evokes the frame and phrases such as “climate change” and “rising sea levels” as the cause and effect slots.  You can read more about a first version of the semantic frame extractor in the online web demo.

A further follow-up meeting is planned with the VRT News editorial team to discuss a potential collaboration.

Workbench piloted at JRC in Ispra

The Penelope workbench was presented and piloted today at the EU Commission’s Joint Research Centre (JRC) in Ispra during the third Resonances Summer School, this year devoted to the theme of “Big Data”. The pilot took place in a session on “New tools for social media and fake news”, in which Luc Steels gave a talk and Michael Anslow (Sony CSL Paris) demonstrated the workbench. The Resonances Summer School is an initiative of the Sci-Art project, which is joining art and science to support EU policy making.

Guest lecture at VUB-SMIT

Katrien Beuls gave an invited talk at the VUB centre for Studies in Media, Innovation and Technology (SMIT) where she introduced the ODYCCEUS project. She talked about the goals of the project to create conceptual and social maps from online media and use these maps to make social discourse and social dynamics visible. The semantic frame extractor was also demonstrated, as it was used as an augmented reading tool for a newspaper article about climate change.

First pilot study of the Penelope workbench

The Penelope workbench, which is the first Penelope interface that was released to the public and allows social scientists to build pipelines with available Penelope components, has been piloted at the first ODYCCEUS conference in Leipzig this week. The results of the pilot study show that the participants rated the workbench as useful (8/10) and state that it met their expectations (7.5/10). The workbench can be tested on https://penelope.vub.be/workbench.