Navigation auf


Institut für Computerlinguistik

Dr. Johannes Graën

Johannes Graën, Dr.

  • Academic Associate
+41 44 635 67 21

Current project

Short bio

I studied Computational Linguistics at the IMS in Stuttgart.  During my studies I spent two years in Barcelona where I had the opportunity to participate in the PATExpert project at the TALN group and, later on, write my diploma thesis about dialogue interaction in role-playing video games.

Following my graduation, I led the development of digital cinema content logistics in an international firm (now Ymagis), resuming work that I had carried out for several years as student.  From 2013 to 2017, I worked as PhD student in the SPARCLING project, investigating methods for assembling and querying large multiparallel corpora with a special focus on multilingual alignment.

Having learned programming in high school, I used to write small applications to support my own language acquisition (vocabulary and inflection) and later on continued developing language learning software with schoolmates, which brought us several prizes in the Jugend forscht competition.  In 2017, I spent two months at Språkbanken in Gothenburg to learn about their approaches to corpus-based language learning applications.

Teaching and supervision

Starting in fall semester 2014 (HS14), I have been giving the last lecture in the programming techniques for computational linguistics series (PCL III), in HS15 together with Peter Makarov and in HS16 with Fabio Rinaldi.  This lecture is centered around designing and implementing interactive applications (in contrast to processing pipelines) with complex data structures.  Another key aspect of the lecture is software development in teams employing version control.

In spring semester 2014 (FS14), Martin Volk and I held a research seminar about methods and tools for large parallel corpora.

I (co-)supervised and worked with these students:

  • Stéphanie Lehner (2014) – Deutsche Substantivkomposita in parallelen Korpora: Erstellung und Evaluation eines multilingualen Goldstandards zur Optimierung der automatischen Übersetzungsbestimmung (Lizentiatsarbeit ≈ master thesis) [thesis] [publication] [gold standard]
  • Raphael Stöcklin (2015) – Implementation einer Mehrwortsuche in grossen parallelen Korpora anhand von Bilingwis (Facharbeit ≈ scientific project over at most six months)
  • Phillip Ströbel (2017) – The “Raison d’Être” of Word Embeddings in Identifying Multiword Expressions in Bilingual Corpora (master thesis) [thesis]
  • Selena Calleri & Barbara Pejkovic (2017) – Creation of a gold standard for hierarchical multilingual word alignment in six languages (English, French, German, Italian, Slovene, Spanish) [data]
  • Dominique Sandoz (2017) – web frontend and middleware for Multilingwis2 (programming project) [publication] [git repo]
  • Christopf Bless (contribution to the SPARCLING project as student assistant from 2016 to 2017):
    • Youquery – web frontend for the exploration of corpus association measures [publication]
    • SentStructure.js – D3.js-based library for visualizing annotation and alignment of corpus examples [git repo]
    • HAT (hierarchical alignment tool) – tool for efficient multilingual hierarchical word and sentence alignment [git repo] [example1] [example2]
  • Sarah Zurmühle (2018) – Erweiterung der Abfragesprache für das grosse multi-parallele Textkorpus des Multilingwis2 Projektes (bachelor thesis)
  • Tannon Kew & Anastassia Shaitarova (2018–2019) – definition of a flexible corpus format for multiparallel corpora, conversion of existing corpora into that format and export to Multilingwis [publication] [web page]
  • Jonathan Schaber (2019) – development of search interface for database in the DiFuPaRo research project


My main interest are parallel and multiparallel corpora, and their exploitation for multilingual phraseology and CALL applications. As regards the topic of language learning, I have mostly worked together with Gerold Schneider.

These applications require multiingual alignment on different levels, in particular alignment of units larger than single tokens (e.g. phrases), a problem I dealt with in my dissertation.

Other aspects include efficient corpus storage and query systems for multiparallel corpora and visualization of their results.

Other interests and proficiencies

I previously have been working for several years as SysAdmin for Linux servers and DevOp for applications based on PostgreSQL.  During my time at the Department of Computational Linguistics, I had the chance to build our new IT infrastructure from scratch, based on virtualization (Proxmox) and distributed services. The architecture of our infrastructure has proven useful for both development and providing services, that is, web applications and pure web services. Most of my skills in this area date back to my active time at Selfnet e.V. in Stuttgart.

During my studies, I used to regularly take language courses.  Besides German and English, I speak Spanish (C1), French, Portuguese, Italian (B1), Catalan and Swedish (A2).  I also took lessons in Russian, Polish, Czech, Turkish and Icelandic, but, up to now, I can merely read simple texts in these languages.

I enjoy garlic (like in Tzatziki or Gazpacho), volleyball (my previous team) and good movies (Uni-Film Stuttgart, Filmstelle Zürich, Texas cinema). My preferred red wines come from the Montsant region.


(see also thumbnails on the right)

  • Multilingwis2 – a web based search engine for exploration of word-aligned parallel and multiparallel corpora
  • Youquery – a web interface to explore properties of interlingual and intralingual corpus association measures
  • Cutter – frontend for tokenization web service in several languages
  • Alignment Overlap – tool for exploring translations shared between multiple terms
  • Constellations – syntactic queries on word-aligned parallel corpora


ZORA Publication List

Download Options


Weiterführende Informationen


Multilingwis – Multilingual Word Information System

Mehr zu Multilingwis – Multilingual Word Information System

Multilingual Corpus Queries

Mehr zu Multilingual Corpus Queries
Hierarchical Alignment Tool

Hierarchical Alignment Tool

Mehr zu Hierarchical Alignment Tool

Cutter – Multingual Tokenizer

Mehr zu Cutter – Multingual Tokenizer
Visual Assocation Measures

Visual Assocation Measures

Mehr zu Visual Assocation Measures
Alignment Overlap

Semantic Relatedness via Alignment Overlap

Mehr zu Semantic Relatedness via Alignment Overlap
German Particle Verbs

German Particle Verbs

Mehr zu German Particle Verbs
European Network for Combining Language Learning with Crowdsourcing Techniques

European Network for Combining Language Learning with Crowdsourcing Techniques

Mehr zu European Network for Combining Language Learning with Crowdsourcing Techniques