Data

Here I would like to share with you data that is free to use to replicate experiments I have done on my own or together with my colleagues. Please mention the use of the data appropriately if it leads to a publication of yours :-).

Ground truth for Neue Zürcher Zeitung black letter period

NZZ title page

The Neue Zürcher Zeitung (NZZ) had been publishing in black letter from its very first issue in 1780 until 1947. From this time period, we randomly sampled one front page per year, resulting in a total of 167 pages. We chose front pages because they typically contain highly relevant material and because we wanted to make sure not to sample pages containing exclusively advertisements or stock information. During certain periods, the NZZ was published several times a day, and there were supplements, too. Due to incomplete metadata, the sample includes front pages from supplements.

We then manually corrected the text of these 167 pages. The so-created ground truth can serve several purposes, but was initially intended to improve OCR.

The ground truth can be downloaded from several sources:

  • GitHub
  • DOI
  • DOI