NLP for historical texts
Motivated by research questions from the humanities, automatic text processing of historical texts is an emerging subfield of the NLP, dealing with a set of unique problems.
The main issues of the automatic processing of historical texts are:
Spelling in historical text is not only different from today's orthography, but it can also vary in the same text written by the same author. Unfortunately for NLP researchers, standardized orthography is a relatively recent development. Due to this, most of the available NLP tools cannot be directly applied to historical texts.
Unavailability of large corpora
Large corpora of historical text practically do not exist. Any aquisition of historical corpora requires a long process of digitization. The size of the aquired historical corpora cannot be compared to the size of corpora in a modern language. This limits the possibility of statistical text processing, requiring big corpora.
The Institute of Computational Linguistics together with the Swiss Law Sources Foundation (SLSF) addresse the above mentioned problems investigating the methods for extraction of temporal expressions from historical texts.
Purpose: to enrich the database of historical names and places of Switzerland (based on the texts of the SLSF) with additional temporal information extracted from the source texts.
Being a research institution that publishes sources of old law up to 1798, the Swiss Law Sources Foundation provided us with about 30 volumes of digitized texts in German, French, Italian, Romansh or Latin depending on the canton of origin and creation time.
Methods and stages of the project
First stage: creation of a manually annotated Gold Standard
As basis for our experiments, we created a manully annotated Gold Standard sample corpus of Early New High German. The corpus contains annotations of 800 temporal expressions of different kinds: explicit, implicit and relative. The annotations were performed in a subset of the TimeML mark-up language, i.e. temporal expressions are tagged with TIMEX3 tags.
Second stage: applying spelling normalisation techniques
Normalisation is a standard approach for processing of historical texts. It eliminates spelling variation and thus allows for application of modern tools, e.g. a rule-based temporal tagger HeidelTime. We are evaluating various normalisation techniques in order to observe the improvement that can be brought by normalisation to the task of temporal entity extraction.
Third stage: training of an annotation based system
We will use our Gold Standard corpus to train a machine learning system for automatic annotation.
Fourth stage: temporal entity extraction and database enrichment
When the whole corpus of SLSF historical texts is annotated, we will extract temporal information and add it to already existing entries in our database.