NOAH's Corpus - Part-of-Speech Tagging for Swiss German
This project is supported by the Institute of Computational Linguistics, where it was founded in the context of a seminar in the spring semester 2012.
Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, when dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data and resources to serve as a stepping stone to automatically process the dialects.
We compiled NOAH's Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of-Speech tags. The first release from September 2014 contains 70'000 tokens, the current release from May 2015 115'00 tokens.
Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger (BTagger) and achieved an accuracy of 90%.
In addition, we are in the process of building a dialect identification system via a character n-gram approach. The developed baseline system for five major dialects reached an F-score of 0.66.
- Noëmi Aepli, Nora Hollenstein, Simon Clematide. NOAH 3.0: Recent Improvements in a Part-of-Speech Tagged Corpus for Swiss German Dialects. SwissText 2018: 116.
- Nora Hollenstein & Noëmi Aepli. A Resource for Natural Language Processing of Swiss German Dialects. GSCL 2015: 108.
- Nora Hollenstein & Noëmi Aepli. Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging. VarDial@COLING 2014: 85.
Dependency Parser for Swiss German
NOAH's Corpus was used as a resource for another NLP for Swiss German project: Universal Dependency Parsing for Swiss German.