Institute of Computational Linguistics

News
What is CL?
Studies
Research
Staff
Institute
Events  

SMULTRON - Stockholm MULtilingual TReebank

Current version: 4.0

Version 1.0

SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank first developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. Version 1.0 of the parallel treebank contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.

Version 2.0

The Institute of Computational Linguistics continues the work on the SMULTRON project. Version 2.0 is an extension of the original treebank with a new text type: 500 sentences from a user manual in English, German and Swedish.

Version 3.0

Yet another text genre and two new languages have been added to our parallel treebank: mountaineering reports in French and German as well as the Spanish version of the user manual.

This version 3.0 of the SMULTRON treebank contained around 2500 sentences in TIGER-XML format in 12 treebank files combined in 9 alignment files.

Version 4.0

The current release integrates the German, Spanish and Cuzco Quechua treebanks of the SQUOIA project to the updated and corrected treebanks of the previous version. The corpus texts are technical reports related to Latin America, especially Peru, and two chapters from the testimonial narrative of Gregorio Condori Mamani, an indigenous Quechua speaker from Southern Peru. There are 8 parallel and aligned German-Spanish treebanks, each in TIGER-XML format, for a total of 4000 sentences. The 4 Cuzco-Quechua dependency treebanks in PML format represent a parallel subcorpus of 2000 sentences.

Currently we are distributing the SMULTRON treebanks with around 6500 sentences (version 4.0) in TIGER-XML format in 28 treebank files combined in 17 alignment files, plus approximately 2000 sentences in 4 dependency treebank files in PML format.

We plan to extend the treebank with further texts in other languages and complete the alignments for all language pairs.

Download

Please register here.

Documentation

Reference

Please refer to:

@MISC{Smultron2015,
  author = {Martin Volk and Anne Göhring and Annette Rios and Torsten Marek and Yvonne Samuelsson},
  year = 2015,
  title = {{SMULTRON (version 4.0) — The Stockholm MULtilingual parallel TReebank}},
  note = {An English-French-German-Quechua-Spanish-Swedish parallel treebank
          with sub-sentential alignments},
  howpublished = {http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks_en.html},
  institution = {Institute of Computational Linguistics, University of Zurich}
}

Publications

This is a collection of the publications regarding the SMULTRON parallel treebank and its creation.