Components – The Penelope Platform

This page lists all components that have been made available as part of the Penelope platform. For more general information on Penelope components and examples of how to use them, see the Getting Started page. For contributing a component, see the Contributing page.

Spacy Natural Language Processing Tools

Contributed by EHAI – Vrije Universiteit Brussel.

This component provides access to a wide variety of natural language processing (NLP) tools:

Tokenization
Lemmatization
Noun chunking
Part-of-speech tagging
Named entity recognition
Dependency parsing
Word embeddings
Sentencization

With each API call, these tools can analyse either a single sentence or an array of text documents (for faster performance). The tools are available in 6 languages (en, de, es, pt, it, nl, fr). The NLP tools rely on the Spacy Python library.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/EHAI/vub-spacy-services/1.0.0 or can be downloaded in JSON format here.

Margot Argument Miner

Contributed by University of Bologna and University of Modena e Reggio Emilia.

Argument mining refers to the automated analysis of unstructured textual documents with the goal of extracting text segments that express claims or premises (also known as argument components), and possibly clustering them into fully-fledged arguments and inferring support or attack relations. Argument mining-enabled systems can help analysing the structure and dynamics of public debate and developing sophisticated conversational agents that are able to sustain debates with human counterparts.

This component addresses the first stages of argument mining in English texts. It takes as input any text, such as a paragraph from Wikipedia or an oped from an online newspaper, and identifies premises and claims (indicating the degree of certainty).

A grphical user interface is available at http://margot.disi.unibo.it/

The OpenAPI specification is available at https://app.swaggerhub.com/apis/EHAI/penelope-margot-service/1.0.0

Multidimensional Outlier Explorer

Contributed by the Complex Networks team at LIP6 (UPMC / CNRS).

This component allows for the exploration of multidimensional datasets and for the detection of statistical outliers within. Hence, it is mainly a tool for data exploration allowing to have a first glance at the data and to formulate research hypotheses to be later tested.

The component takes as input a list of numeric observations each described according to several categorical dimensions. For example, in the case of Twitter data, it can be the number of tweets (numeric observation) that have been published by a given user (first dimension) about a given topic (second dimension) at a given date (third dimension). The input data hence takes the form of a list of quadruplets (user, topic, date, number of tweets) in a JSON format. Statistical outliers are then identified by first selecting some dimensions of interest, that is by subsetting or by aggregating the input dimensions. If needed, observations can also be normalised according to the marginal values along the selected dimensions, thus comparing the observed value to an expected value obtained by the uniform redistribution of the selected marginal values. Different statistical tests can then be chosen to measure the deviation between the observed and the expected values. The component finally returns a list of positive outliers in a JSON format, that is observations that are significantly higher than expected.

The OpenAPI specification is available at https://app.swaggerhub.com/apis-docs/Lamarche-Perrin/outlier-explorer/1.0.1.

Semantic Frame Extractor

Contributed by EHAI – Vrije Universiteit Brussel.

Frame semantics is commonly used as a methodology for representing the meaning of linguistic utterances. While semantic frames have successfully been formalised on a large scale, it is still a major challenge to automatically extract them from raw text. This Penelope component overcomes this challenge by using precision language processing techniques. Concretely, the component takes a sentence (or a list of texts) and a frame of interest (e.g. ‘Causation’) as input and returns all instances of this frame, and its frame elements, that occur in the sentence (or list of texts). The language processing part of the semantic frame extractor has been developed within the Fluid Construction Grammar (FCG) framework.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/EHAI/Semantic-Frame-Extractor-API/1.0.0.

Language Innovation Tracker

Contributed by EHAI – Vrije Universiteit Brussel.

This component allows extracting neologisms from texts, which is particularly helpful to gain insight into discussions on platforms such as 4chan, where language is highly innovative and new words are often used to convey a highly non-neutral meaning, and to distinguish between the in-group and the out-group. It also has visualization functionalities, including word clouds and tracking the popularity of neologisms through time.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/EHAI/language-innovation-tracker/1.0.0.

Guardian Climate Change Data

Contributed by EHAI – Vrije Universiteit Brussel.

Penelope’s Guardian Climate Change Data Web API allows you to retrieve articles and reader comments from The Guardian newspaper that are tagged for climate change. You can also search through articles and comments using regular expressions.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/EHAI/GuardianClimateChangeData/1.0.0.

4CAT Capture and Analysis

Contributed by UvA – DMI

This component provides endpoints through which search queries for any of 4CAT’s datasets (comprising 4chan, reddit, 8chan and more) may be queued. The resulting datasets, which are identified with a unique ID, may be manipulated via other endpoints to generate new data, and so on. Example transformations include URL co-link analysis, tokenisation, activity graphs, and image walls.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/oilab/4cat-tool/1.0.0

4CAT Standalone Data Processors

Contributed by UvA – DMI

This API may be called to transform provided data through one of several data processors offered by 4CAT. A list of available data processors may be acquired via a separate endpoint. Data is supplied in a standardised CSV format; the return data format depends on the chosen processor. Example processors include tokenisation, URL co-link analysis and activity graphs.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/oilab/4cat-standalone/1.0.0

4CAT Platform 4chan Data API

Contributed by UvA – DMI

This API offers endpoints through which post data for given 4chan thread IDs or boards may be retrieved.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/oilab/4cat-data/1.0.0

Network Tools

Contributed by MPI MIS Leipzig

This component provides network science related tools to the Penelope ecosystem:

Statement Graph Generator: Create a co-occurence graph from a list of statements with metadata. Each node is a statements and a link is created for every co-occuring word between nodes. Includes lemmatization and part-of-speech-tagging using spacy. Supports all current spacy languages.

Louvain: Find the communities of a graph using the Louvain algorithm. Relies on python-louvain.

Giant component: Return the largest connected component of the input network.

Interactive Visualiser: Generate an interactive force-directed graph from a list of nodes and edges. Essentially a Python wrapper for the d3js-based force-graph library.

The OpenAPI specification is available at https://app.swaggerhub.com/apis/pournaki/PenelopeNetworkComponents/1.0.0.

Try out the component using this example Jupyter notebook.

Parliament Dataset

Contributed by MPI MIS Leipzig

Penelope’s Parliament Data Web API allows to retrieve speeches from parliamentary discussions with relevant metadata (speech id, speaker, party, date, discussion title), based on a keyword search. There are two modes of retrieval: you can get only the speeches containing the search query (get-speeches), or get an aggregate of all the speeches in the same discussion, if one of them contains the search query (get-speeches-agg).

So far, the following datasets are included:

German Bundestag speeches (from September 2017)
UK House of Commons (from January 2016)

The OpenAPI specification is available at https://app.swaggerhub.com/apis/pournaki/PenelopeParliamentData/1.0.0.