Software and data

This is a selection of software and datasets I am/was involved in creating, listed in roughly reverse chronological order.


Nematus - an attention-based encoder-decoder model for neural machine translation

subword-nmt subword segmentation scripts for neural machine translation, including byte-pair encoding (BPE).

Zmorge - Zurich Morphological Lexicon for German

clevertagger - morphologically informed POS-tagging

Bleualign - an MT-based sentence alignment tool

ParZu - The Zurich Dependency Parser for German online demo


x-stance, a multilingual multi-target dataset for stance detection

ContraPro, a large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation.

WMT 2017 systems Pre-trained neural models and training scripts for WMT 2017 shared translation task.

ContraWSD, a test set for NMT evaluation of word sense disambiguation.

code docstring corpus, a parallel corpus of Python functions and documentation strings.

LingEval97, a test set of contrastive translation pairs for NMT evaluation.

WMT 2016 systems Pre-trained neural models for WMT 2016 shared translation task.

WMT 2016 backtranslations Synthetic parallel data (back-translated monolingual data), used at WMT 2016.

WMT 2016 factors Linguistically annotated data sets (for factored neural MT).

WMT 2015 German treebank Dependency parses (with ParZu) of WMT 2015 training data.