NLP for Swiss German

NOAH's Corpus - Part-of-Speech Tagging for Swiss German

This project is supported by the Institute of Computational Linguistics, where it was founded in the context of a seminar in the spring semester 2012.

Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, when dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data and resources to serve as a stepping stone to automatically process the dialects.

We compiled NOAH's Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of-Speech tags. The first release from September 2014 contains 70'000 tokens, the current release from May 2015 115'00 tokens.

Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger (BTagger) and achieved an accuracy of 90%.

In addition, we are in the process of building a dialect identification system via a character n-gram approach. The developed baseline system for five major dialects reached an F-score of 0.66.

For downloads and more information visit the official website  Swiss German Language Processing

Publications

  • Noëmi Aepli, Nora Hollenstein, Simon Clematide. NOAH 3.0: Recent Improvements in a Part-of-Speech Tagged Corpus for Swiss German Dialects. SwissText 2018: 116. 
  • Nora Hollenstein & Noëmi Aepli. A Resource for Natural Language Processing of Swiss German Dialects. GSCL 2015: 108. 
  • Nora Hollenstein & Noëmi Aepli. Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging. VarDial@COLING 2014: 85.

Dependency Parser for Swiss German

NOAH's Corpus was used as a resource for another NLP for Swiss German project: Universal Dependency Parsing for Swiss German.