Bulletin4Corpus

Bulletin4Corpus

The Institute of Computational linguistics is building a parallel corpus for German, French, Italian and English. The corpus contains the Credit Suisse Bulletin, which is published partially in four languages since 1895. The magazine contains articles on economic and socially relevant topics and is therefore neither a banking magazine nor a traditional corporate magazine. This makes the Bulletin interesting as a training corpus for applications such as machine translation since it provides access to another genre, which is suitable for newspapers and magazines for instance.

Since the first edition of 1998, the books are published as PDF, older versions are available combined as books. Text is extracted from the PDFs and annotated as well as aligned on article and sentence level. The books are scanned and will be integrated in further releases.

Furthermore, we have built a corpus with the “News” of the Credit Suisse website. It contains some 500 articles in German, French, Italian and English.

Project head:

Assistants: