Temporal entity extraction from historical texts

NLP for historical texts

Motivated by research questions from the humanities, automatic text processing of historical texts is an emerging subfield of the NLP, dealing with a set of unique problems.

The main issues of the automatic processing of historical texts are:

Spelling variation

Spelling in historical text is not only different from today's orthography, but it can also vary in the same text written by the same author. Unfortunately for NLP researchers, standardized orthography is a relatively recent development. Due to this, most of the available NLP tools cannot be directly applied to historical texts.
Unavailability of large corpora

Large corpora of historical text practically do not exist. Any aquisition of historical corpora requires a long process of digitization. The size of the aquired historical corpora cannot be compared to the size of corpora in a modern language. This limits the possibility of statistical text processing, requiring big corpora.

The Project

The Department of Computational Linguistics together with the Swiss Law Sources Foundation (SLSF) address the above mentioned problems investigating the methods for extraction of temporal expressions from historical texts.

Purpose: to enrich the database of historical names and places of Switzerland (based on the texts of the SLSF) with additional temporal information extracted from the source texts.

Data

Being a research institution that publishes sources of old law up to 1798, the Swiss Law Sources Foundation provided us with about 30 volumes of digitized texts in German, French, Italian, Romansh or Latin depending on the canton of origin and creation time.

Image on the left: View on Appenzell from the "oldest" Appenzeller Landbuch, 1540 (LAAI, Bücher, Nr. 10).
Image on the right: Title page of the Landbuchs Appenzell Ausserrhoden, 1632 (KBAR, CM Ms. 16).

Methods and stages of the project

First stage: creation of a manually annotated Gold Standard

As basis for our experiments, we created a manully annotated Gold Standard sample corpus of Early New High German. The corpus contains annotations of 800 temporal expressions of different kinds: explicit, implicit and relative. The annotations were performed in a subset of the TimeML mark-up language, i.e. temporal expressions are tagged with TIMEX3 tags.
Second stage: applying spelling normalisation techniques

Normalisation is a standard approach for processing of historical texts. It eliminates spelling variation and thus allows for application of modern tools, e.g. a rule-based temporal tagger HeidelTime. We are evaluating various normalisation techniques in order to observe the improvement that can be brought by normalisation to the task of temporal entity extraction.
Third stage: training of an annotation based system

We will use our Gold Standard corpus to train a machine learning system for automatic annotation.
Fourth stage: evaluation of the annotation based system

We will evaluate the best performing system for temporal entity extraction on a previously unseen historical corpora.

Project leaders:

Researchers:

Natalia Korchagina

The project was funded by the Swiss Law Sources Foundation. It started in the beginning of the 2014 and was finished in 2019.

Publications

Korchagina, Natalia (2016). Building a Gold Standard for Temporal Entity Extraction from Medieval German Texts. In: Proceedings of the Conference on Language Technologies and Digital Humanities, Ljubljana, Slovenia, 29 September 2016 - 1 October 2016, 90-94.
Korchagina, Natalia (2017). Normalizing Medieval German Texts: from rules to deep learning. In: NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, 22 May 2017 - 22 May 2017.
Korchagina, Natalia. Temporal Entity Extraction from Historical Texts. Doctoral thesis (to be published later in 2020).

The Gold Standard Corpus

For the experiments in this projects, we created a Gold Standard of temporal annotations. The corpus contains 50 historical legal articles in Early New High German. It was annotated in a subset of the TimeML markup language for temporal annotation. The corpus contains about 34,000 tokens and is available here.

Quicklinks

Main navigation

Temporal entity extraction from historical texts

NLP for historical texts

Spelling variation

Unavailability of large corpora

The Project

Data

Methods and stages of the project

First stage: creation of a manually annotated Gold Standard

Second stage: applying spelling normalisation techniques

Third stage: training of an annotation based system

Fourth stage: evaluation of the annotation based system

Publications

The Gold Standard Corpus