NLP for Swiss German

NOAH's Corpus

This project is supported by the Institute of Computational Linguistics, where it was founded in the context of a seminar in the spring semester 2012.

Swiss German is a dialect continuum whose dialects are very different from Standard German, the official language of the German part of Switzerland. However, when dealing with Swiss German in natural language processing, usually the detour through Standard German is taken. As writing in Swiss German has become more and more popular in recent years, we would like to provide data and resources to serve as a stepping stone to automatically process the dialects.

We compiled NOAH's Corpus of Swiss German Dialects consisting of various text genres, manually annotated with Part-of-Speech tags. The first release from September 2014 contains 70'000 tokens, the current release from May 2015 115'00 tokens.

Furthermore, we applied this corpus as training set to a statistical Part-of-Speech tagger (BTagger) and achieved an accuracy of 90%.

In addition, we are in the process of building a dialect identification system via a character n-gram approach. The developed baseline system for five major dialects reached an F-score of 0.66.

For downloads and more information visit Swiss German Language Processing

Project Heads: