Johannes Graën

Johannes Graën, Dipl.-Ling.

Short Bio

I studied Computational Linguistics at the IMS in Stuttgart.  During my studies I spent two years in Barcelona where I had the opportunity to participate in the PATExpert project at the TALN group and, later on, write my diploma thesis about dialogue interaction in role-playing video games.

Following my graduation, I led the development of digital cinema content logistics in an international firm (now Ymagis), resuming work that I had carried out for several years as student.  Since 2013, I work as PhD student in the SPARCLING project (terminating in 2017), investigating methods for assembling and querying large multiparallel corpora with a special focus on multilingual alignment.

Having learned programming in high school, I used to write small applications to support my own language acquisition (vocabulary and inflection) and later on continued developing language learning software with schoolmates, which brought us several prizes in the Jugend forscht competition.  In 2017, I spent two months at Språkbanken in Gothenburg to learn about their approaches to corpus-based language learning applications.

Teaching and supervision

Starting in fall semester 2014 (HS14), I have been giving the last lecture in the programming techniques for computational linguistics series (PCL III), in HS15 together with Peter Makarov and in HS16 with Fabio Rinaldi.  This lecture is centered around designing and implementing interactive applications (in contrast to processing pipelines) with complex data structures.  Another key aspect of the lecture is software development in teams employing version control.

In spring semester 2014 (FS14), Martin Volk and I held a research seminar about methods and tools for large parallel corpora.

I (co-)supervised these students:

  • Stéphanie Lehner (2014) – Deutsche Substantivkomposita in parallelen Korpora: Erstellung und Evaluation eines multilingualen Goldstandards zur Optimierung der automatischen Übersetzungsbestimmung (Lizentiatsarbeit ≈ master thesis) [pdf] [gold standard]
  • Raphael Stöcklin (2015) – Implementation einer Mehrwortsuche in grossen parallelen Korpora anhand von Bilingwis (Facharbeit ≈ scientific project over at most six months)
  • Phillip Ströbel (2017) – The “Raison d’Être” of Word Embeddings in Identifying Multiword Expressions in Bilingual Corpora (master thesis) [pdf]
  • Selena Calleri & Barbara Pejkovic (2017) – Creation of a gold standard for hierarchical multilingual word alignment in six languages (English, French, German, Italian, Slovene, Spanish) [data]
  • Dominique Sandoz (2017) – web frontend and middleware for Multilingwis2 (programming project) [pdf] [git]
  • Christopf Bless (contribution to the SPARCLING project as student assistant from 2016 to 2017):
    • Youquery – web frontend for the exploration of corpus association measures [pdf]
    • SentStructure.js – D3.js-based library for visualizing annotation and alignment of corpus examples [git]
    • HAT (hierarchical alignment tool) – tool for efficient multilingual hierarchical word and sentence alignment [git] [example1] [example2]


My main interest are multiparallel corpora and their exploitation for multilingual phraseology and CALL applications for language learners with several L1's/L2's.

These applications require multiingual alignment on different levels, in particular alignment of units larger than single tokens (e.g. phrases), which is a key topic of my dissertation.

Other aspects include efficient corpus storage and query systems for parallel and multiparallel corpora and visualization of their results.

Other interests and proficiencies

I have been working for more than ten years as a SysAdmin (Linux) and DevOp (PostgreSQL).  During my employment at the department of computational linguistics, I had the chance to build our new IT infrastructure from scratch using low-level virtualization (LXC) and shared storages (NFS and GlusterFS) in a cluster environment based on Proxmox.  This architecture allows us to distribute work load between servers, both on an application and service level.  Most of my skills in this area originate from my active time at Selfnet e.V.

During my studies, I used to regularly take language courses.  Besides German and English, I speak Spanish (C1), French, Portuguese, Italian (B1), Catalan and Swedish (A2).  I also took lessons in Russian, Polish, Czech, Turkish and Icelandic, but, up to now, I can merely read simple texts in these languages.

I like garlic (like in Tzatziki or Gazpacho), volleyball (my team) and good movies (Uni-Film Stuttgart, Filmstelle Zürich).


  • Multilingwis2 – a web based search engine for exploration of word-aligned parallel and multiparallel corpora
  • Youquery – a web interface to explore properties of interlingual and intralingual corpus association measures
  • Cutter – frontend for tokenization web service in several languages