Domain-specific Statistical Machine Translation
The Department of Computational Linguistics investigates the use of small domain-specific corpora for Statistical Machine Translation (SMT). This research is motivated by our experiences with industry partners who wish to build translation systems for specific application areas, but only have little domain-specific training data at their disposal. We have a small parallel corpus of Alpine texts (5 million tokens) at our disposal: the publications of the Swiss Alpine Club (SAC) were digitized in the project Text+Berg digital, parts of the corpus being parallel (DE-FR). We investigated the combination of the Text+Berg corpus with other resources, for instance additional monolingual, parallel or comparable corpora, or other machine translation systems.
Focus of the research project
- Use of domain-specific parallel corpora for SMT: corpus creation, sentence alignment and cost-benefit-analysis.
- Extraction of domain-specific translations from comparable corpora.
- Combination of domain-specific and out-of-domain parallel corpora.
- Combination of domain-specific and general-purpose machine translation systems.
- Use and Improvement of NLP Resources (Name Classifiers, PoS-Taggers, Parsers) in Englisch, French and German in order to improve SMT.
- Building tools for multilingual terminology visualisation.
- Building a parallel treebank DE-FR for evaluation purposes.
Project head:
Researchers:
The project was funded by the Swiss National Science Foundation and ran 2010-2013.
Project results
- Demo SMT system - SMT systems for Alpine domain (password-protected).
- Bilingwis - a translation visualisation and search tool.
- Bleualign - an MT-based sentence alignment tool.
- ParZu - The Zurich Dependency Parser for German online demo
- some results of the project have been contributed to the Moses SMT toolkit
- a German-French parallel treebank from the Alpine domain has been added to the Smultron treebank
Publications
ZORA Publication List
Download Options
Publications
-
Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis In: Recent Advances in Natural Language Processing (RANLP 2013), Hissar, Bulgaria, 7 September 2013 - 13 September 2013, 601-609.
-
Promoting Flexible Translations in Statistical Machine Translation In: Proceedings of the XIV Machine Translation Summit, Nice, 2 September 2013 - 6 September 2013, 207-214.
-
A Multi-Domain Translation Model Framework for Statistical Machine Translation In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4 August 2013 - 9 August 2013, 832-840.
-
Dirt cheap web-scale parallel text from the Common Crawl In: 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, August 2013. Association for Computational Linguistics, 1374-1383.
-
Mining for Domain-specific Parallel Text from Wikipedia In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, August 2013 - August 2013, 112-120.
-
Using parallel treebanks for machine translation evaluation In: The 11th International Workshop on Treebanks and Linguistic Theories, Lisbon, Portugal, 30 November 2012 - 1 December 2012, 145-156.
-
Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation In: 16th EAMT Conference, Trento, Italy, 28 May 2012 - 29 May 2012, 185-192.
-
Towards a Wikipedia-extracted alpine corpus In: The Fifth Workshop on Building and Using Comparable Corpora, Istanbul, Turkey, 26 May 2012 - 26 May 2012.
-
Perplexity minimization for translation model domain adaptation in statistical machine translation In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 2012 - 2012. Association For Computational Linguistics, 539-549.
-
Digging for names in the mountains: Combined person name recognition and reference resolution for German alpine texts In: 5th Language & Technology Conference, Poznan, Poland, 25 November 2011 - 27 November 2011.
-
From historic books to annotated XML: Building a large multilingual diachronic corpus In: Conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011, Hamburg, Germany, 28 September 2011 - 30 September 2011, 75-80.
-
The UZH system combination system for WMT 2011 In: Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 30 July 2011 - 31 July 2011. Association For Computational Linguistics, 166-170.
-
Combining multi-engine machine translation and online learning through dynamic phrase tables In: EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation, Leuven, Belgium, 30 May 2011 - 31 May 2011.
-
Iterative, MT-based sentence alignment of parallel texts In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga, 11 May 2011 - 13 May 2011.
-
Strategies for reducing and correcting OCR errors In: Sporleder, Caroline; van den Bosch, Antal; Zervanou, Kalliopi . Language Technology for Cultural Heritage. Berlin: Springer, 3-22.
-
MT-based sentence alignment for OCR-generated parallel texts In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, 31 October 2010 - 4 November 2010.
-
Reducing OCR errors by combining two OCR systems In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), Lisbon, Portugal, 16 August 2010, 61-65.
-
Combining parallel treebanks and geo-tagging In: Fourth Linguistic Annotation Workshop (LAW IV), Uppsala, 15 July 2010 - 16 July 2010.
-
Challenges in building a multilingual alpine heritage corpus In: seventh international conference on Language Resources and Evaluation (LREC), Malta, 19 May 2010 - 21 May 2010.