Temporal entity extraction from historical texts

NLP for historical texts

Motivated by research questions from the humanities, automatic text processing of historical texts is an emerging subfield of the NLP, dealing with a set of unique problems.

The main issues of the automatic processing of historical texts are:

  • Spelling variation


    Spelling in historical text is not only different from today's orthography, but it can also vary in the same text written by the same author. Unfortunately for NLP researchers, standardized orthography is a relatively recent development. Due to this, most of the available NLP tools cannot be directly applied to historical texts.
  • Unavailability of large corpora


    Large corpora of historical text practically do not exist. Any aquisition of historical corpora requires a long process of digitization. The size of the aquired historical corpora cannot be compared to the size of corpora in a modern language. This limits the possibility of statistical text processing, requiring big corpora.

The Project

The Institute of Computational Linguistics together with the Swiss Law Sources Foundation (SLSF) addresse the above mentioned problems investigating the methods for extraction of temporal expressions from historical texts.

Purpose: to enrich the database of historical names and places of Switzerland (based on the texts of the SLSF) with additional temporal information extracted from the source texts.

Data

Being a research institution that publishes sources of old law up to 1798, the Swiss Law Sources Foundation provided us with about 30 volumes of digitized texts in German, French, Italian, Romansh or Latin depending on the canton of origin and creation time.

Image on the left: View on Appenzell from the "oldest" Appenzeller Landbuch, 1540 (LAAI, Bücher, Nr. 10).
Image on the right: Title page of the Landbuchs Appenzell Ausserrhoden, 1632 (KBAR, CM Ms. 16).

Methods and stages of the project

  • First stage: creation of a manually annotated Gold Standard


    As basis for our experiments, we created a manully annotated Gold Standard sample corpus of Early New High German. The corpus contains annotations of 800 temporal expressions of different kinds: explicit, implicit and relative. The annotations were performed in a subset of the TimeML mark-up language, i.e. temporal expressions are tagged with TIMEX3 tags.
  • Second stage: applying spelling normalisation techniques


    Normalisation is a standard approach for processing of historical texts. It eliminates spelling variation and thus allows for application of modern tools, e.g. a rule-based temporal tagger HeidelTime. We are evaluating various normalisation techniques in order to observe the improvement that can be brought by normalisation to the task of temporal entity extraction.
  • Third stage: training of an annotation based system


    We will use our Gold Standard corpus to train a machine learning system for automatic annotation.
  • Fourth stage: temporal entity extraction and database enrichment


    When the whole corpus of SLSF historical texts is annotated, we will extract temporal information and add it to already existing entries in our database.

Project leaders:

Researchers:

The project is funded by the Swiss Law Sources Foundation and started in the beginning of the 2014.