Domain-specific Statistical Machine Translation

The Department of Computational Linguistics investigates the use of small domain-specific corpora for Statistical Machine Translation (SMT). This research is motivated by our experiences with industry partners who wish to build translation systems for specific application areas, but only have little domain-specific training data at their disposal. We have a small parallel corpus of Alpine texts (5 million tokens) at our disposal: the publications of the Swiss Alpine Club (SAC) were digitized in the project Text+Berg digital, parts of the corpus being parallel (DE-FR). We investigated the combination of the Text+Berg corpus with other resources, for instance additional monolingual, parallel or comparable corpora, or other machine translation systems.

Focus of the research project

Use of domain-specific parallel corpora for SMT: corpus creation, sentence alignment and cost-benefit-analysis.
Extraction of domain-specific translations from comparable corpora.
Combination of domain-specific and out-of-domain parallel corpora.
Combination of domain-specific and general-purpose machine translation systems.
Use and Improvement of NLP Resources (Name Classifiers, PoS-Taggers, Parsers) in Englisch, French and German in order to improve SMT.
Building tools for multilingual terminology visualisation.
Building a parallel treebank DE-FR for evaluation purposes.

Project head:

Martin Volk

Researchers:

The project was funded by the Swiss National Science Foundation and ran 2010-2013.

Project results

Demo SMT system - SMT systems for Alpine domain (password-protected).
Bilingwis - a translation visualisation and search tool.
Bleualign - an MT-based sentence alignment tool.
ParZu - The Zurich Dependency Parser for German online demo
some results of the project have been contributed to the Moses SMT toolkit
a German-French parallel treebank from the Alpine domain has been added to the Smultron treebank

Publications

ZORA Publication List

Download Options

Format for Download Link

Download asCSV Download asRIS Download asBIBTEX

Publications

Sennrich, R., Volk, M., & Schneider, G. (2013). Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis 601–609. http://www.aclweb.org/anthology/R/R13/R13-1079.pdf
Sennrich, R. (2013). Promoting Flexible Translations in Statistical Machine Translation 207–214. http://www.mtsummit2013.info/files/proceedings/main/mt-summit-2013-sennrich.pdf
Sennrich, R., Schwenk, H., & Aransa, W. (2013). A Multi-Domain Translation Model Framework for Statistical Machine Translation 832–840. http://www.aclweb.org/anthology/P13-1082
Plamada, M., & Volk, M. (2013). Mining for Domain-specific Parallel Text from Wikipedia 112–120. http://www.aclweb.org/anthology/W13-2514
Smith, J. R., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., & Lopez, A. (2013). Dirt cheap web-scale parallel text from the Common Crawl Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1374–1383. http://www.aclweb.org/anthology/P13-1135
Plamada, M., & Volk, M. (2012). Using parallel treebanks for machine translation evaluation 145–156. http://tlt11.clul.ul.pt/
Sennrich, R. (2012). Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation 185–192. http://hltshare.fbk.eu/EAMT2012/html/Papers/42.pdf
Plamada, M., & Volk, M. (2012, May 26). Towards a Wikipedia-extracted alpine corpus The Fifth Workshop on Building and Using Comparable Corpora, Istanbul. http://www.lrec-conf.org/proceedings/lrec2012/workshops/16.BUCC2012%20Proceedings.pdf
Sennrich, R. (2012). Perplexity minimization for translation model domain adaptation in statistical machine translation Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 539–549. http://www.aclweb.org/anthology/E12-1055
Ebling, S., Sennrich, R., Klaper, D., & Volk, M. (2011, November 27). Digging for names in the mountains: Combined person name recognition and reference resolution for German alpine texts 5th Language & Technology Conference, Poznan. https://doi.org/10.1007/978-3-319-08958-4_16
Jitca, M., Sennrich, R., & Volk, M. (2011). From historic books to annotated XML: Building a large multilingual diachronic corpus (No. 96). 75–80. http://www.corpora.uni-hamburg.de/gscl2011/downloads/AZM96.pdf
Sennrich, R. (2011). The UZH system combination system for WMT 2011 Proceedings of the Sixth Workshop on Statistical Machine Translation, 166–170. http://www.aclweb.org/anthology/W11-2120
Sennrich, R. (2011, May 31). Combining multi-engine machine translation and online learning through dynamic phrase tables EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation, Leuven.
Sennrich, R., & Volk, M. (2011, May 13). Iterative, MT-based sentence alignment of parallel texts NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.
Volk, M., Furrer, L., & Sennrich, R. (2011). Strategies for reducing and correcting OCR errors In C. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Language Technology for Cultural Heritage (pp. 3–22). Springer. https://doi.org/10.1007/978-3-642-20227-8_1
Sennrich, R., & Volk, M. (2010, November 4). MT-based sentence alignment for OCR-generated parallel texts The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver. http://amta2010.amtaweb.org/AMTA/papers/2-14-SennrichVolk.pdf
Volk, M., Marek, T., & Sennrich, R. (2010). Reducing OCR errors by combining two OCR systems 61–65. http://ilk.uvt.nl/LaTeCH2010/paperlist.html
Volk, M., Goehring, A., & Marek, T. (2010, July 16). Combining parallel treebanks and geo-tagging Fourth Linguistic Annotation Workshop (LAW IV), Uppsala.
Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., & Ruef, B. (2010, May 21). Challenges in building a multilingual alpine heritage corpus seventh international conference on Language Resources and Evaluation (LREC), Malta.

Additional Information

Teaser text

Zum UZH Portal

Quicklinks and available languages

Main navigation

Domain-specific Statistical Machine Translation

Focus of the research project

Project results

Publications

ZORA Publication List

Download Options

Publications

Additional Information

Title