Natural Language Processing for Cultural Heritage Texts (Dr.-Ing. Michael Piotrowski)
Summarization and Generation: From “Extracting” to “Abstracting” (Prof. Dr. Manfred Stede)
Connectionist Modelling of Language and Cognitive Processes (Prof. Dr. Gert Westermann)
Introduction to Tree Adjoining Grammar (Wolfgang Maier M.A., Timm Lichte)
Contents of the course
Many large-scale digitization projects are currently under way that intend to preserve the cultural heritage contained in paper documents (in particular books) and make it available on the Web. In order to make historical documents accessible online it is, however, not sufficient to scan them and run them through off-the-shelf OCR software. Furthermore, the specific linguistic properties of historical texts—such as non-standardized spelling—require special NLP methods and resources.
Natural language processing for cultural heritage data is therefore an expanding area of research. This is also demonstrated by the series of LaTeCH workshops taking place since 2007. What is more, processing of historical texts share a number of challenges with processing of certain genres of very recent texts, such as Twitter, SMS, and chat messages or newsgroup and forum postings, including non-standard orthography and grammar and profuse use of abbreviations. It has also become clear that the use of NLP for genres beyond the usual newspaper and news wire texts requires the adaptation of NLP methods and tools to the domain of the texts.
The course will give an introduction to the use of NLP for historical texts. It will present examples of cultural heritage texts and highlight the differences to modern texts and the implications for NLP. The course will outline methods for the acquisition of cultural heritage texts. We will then discuss three important areas of research for historical texts:
- Methods and tools for the processing of noisy data: “Noise” in historical documents can come from different sources. Original text may be “noisy,” e.g., when it uses historical, non-standardized orthography, texts may mix different languages, e.g., Latin and a vernacular language, or noise may have been introduced by OCR or manual keying. The course will outline the effects of noise on applications, such as information retrieval, and present approaches to detecting, correcting, normalizing, or processing noise in texts.
- Domain adaptation: Cultural heritage texts usually differ in various respects (e.g., subject, genre, language variety) from the texts normally used to create language resources and tools. For example, the performance of a parser trained on the Wall Street Journal is lower when used on literary texts rather than news text. This means that NLP tools must be adapted to each domain to which they are to be applied. Cultural heritage texts obviously form no single domain, so the degree and form of adaptation necessary varies. The course will give an overview of current approaches to domain adaptation, specifically in relation to cultural heritage texts.
- Text mining of historical texts: Cultural heritage texts are not only digitized to preserve them but especially to make them more accessible, both for the general public and to researchers. For both groups, persons, place names, and dates are of particular interest. The identification of named entities in historical texts is, however, more difficult than in, say, current newspaper texts, since naming is often very inconsistent and ambiguous. The course will discuss approaches to mining historical texts for persons, place names, and dates. It will present examples of applications that enable new ways of accessing cultural heritage texts based on this information, e.g., spatial browsing.
More and more historical texts are becoming available through large-scale digitization projects; digital text is thus increasingly available for research in the humanities and the social sciences. However, to actually access the information contained in these texts natural language processing methods and tools suitable for historical texts are required. As there is currently no textbook specifically covering NLP for cultural heritage texts, the course aims to give an introduction to the field and an overview of the state of the art. As noted above, many challenges in NLP for historical texts can also be found in other evolving NLP applications, e.g., in the analysis of social media texts, so that methods and techniques required for the effective processing of historical texts will also be of interest for students in other areas.
- Piotrowski, M. (2010) From Law Sources to Language Resources. In
C. Sporleder, K. Zervanou (eds.), Proceedings of the ECAI 2010
workshop on Language Technology for Cultural Heritage, Social
Sciences, and Humanities (LaTeCH 2010), pp. 67-71.
- Proceedings of LaTeCH (Language Technology and Resources for Cultural Heritage)
- Proceedings of AND (Workshop on Analytics for Noisy Unstructured Text Data)
- Piotrowski, M. (2010). Leveraging back-of-the-book indices to
enable spatial browsing of a historical document collection. In
R. Purves, P. Clough, and C. Jones (eds.), Proceedings of the 6th
Workshop on Geographic Information Retrieval, New York, NY, USA, pp.
89-90. ACM Press.
- Piotrowski, M., S. Läubli, and M. Volk (2010). Towards mapping of
alpine route descriptions. In R. Purves, P. Clough, and C. Jones
(Eds.), Proceedings of the 6th Workshop on Geographic Information
Retrieval, New York, NY, USA, pp. 15-16. ACM Press.
- International Journal on Document Analysis and Recognition, Special
Issue on Noisy Text Analytics: Vol. 12, No. 3, September 2009: Guest
Editors: D. Lopresti, S. Roy, K. Schulz, L. V. Subramaniam
General familiarity with NLP methods and tools.
Requirements for ECTS Credit Points
- Attendance of all course sessions, active participation and oral contributions to the discussions in the course = 1 CP.
- During the course you will be given daily reading assignments with accompanying questions. You have to read the texts and should be able to answer the questions during the course. One assignment will instead require you to write a short essay (2 pages ACM style) = 2 CPs.
For the remaining credit point, there are two options:
- EITHER: Register for and attend the Second Workshop on Systems and Frameworks for Computational Morphology (SFCM 2011), which takes place on August 26 in Zurich, and write a short workshop report (about 2 to 3 pages ACM style), to be handed in 2 weeks after the end of the Fall School = 1 CP.
- OR: After the Fall School: Completion of a short paper (4 pages ACM style) focusing on one aspect of the course, to be handed in 4 weeks after the end of the Fall School = 1 CP.
The vast majority of approaches to automatic text summarization follow the "sentence-extraction" paradigm, where robust statistical methods are used to identify the most relevant sentences of the text, which are then extracted and taken to constitute the summary. In the first phase of the course, we present the central methods and look at practical implementations. Also, we discuss the problem of evaluating the quality of summaries.
Then, we identify the shortcomings of sentence extraction and turn to the alternative approach of "abstraction", that is, the actual synthesis of an abstract with sentences that need not appear in this form in the source text. We view this as a special case of natural language generation (NLG), where text is produced not from pre-verbal representations (as in standard NLG) but from other texts; this scenario as known as "text-to-text" generation. We discuss the various subproblems involved, including sentence compression, paraphrasing, and simplification. (These are also relevant for other applications beyond summarization). At the end of the course, we compile the pros and cons of extraction versus abstraction, in terms of robustness, language- and genre-dependency, and quality of summaries.
The course breaks into six thematic blocks, which are sketched below. At various points, we will look into existing implementations of summarizers, evaluation modules, sentence compressors, etc.; thus participants will get acquainted both with the methodologies and with available software.
- Introduction and Overview
- how do human summarizers work?
- types of summaries: indicative / informative / critical-evaluative; generic / query-focused
- the role of genre and document structure
- extraction versus abstraction
- Statistical extraction: The basic approach
- first ideas by Luhn 58, Edmundson 65
- measurements for term and sentence relevance: Bieler/Dipper 08 etc.
- potential problems of extracts
- Multi-document summarization
- detecting similarity, avoiding redundancy: early approaches (Mani/ Bloedorn 99, Barzilay et al 99, etc)
- the sentence ordering problem: Barzilay et al 02
- graph-based methods for MDS: Erkan/Radev 04 etc.
- using ontologies for topic-based extraction: Hennig et al 08. etc
- Evaluating summaries
- features of good summaries
- measures for automatic evaluation: ROUGE (Lin/Hovy 03), Pyramid (Nenkova/Passonneau 04), BE (Hovy et al 05)
- Beyond extraction
- text generation
- text-to-text generation: case studies (film subtitling, etc)
- sentence compression: deletion / generalization / aggregation (Clarke 06, etc.)
- Toward abstraction
- discourse structure for summarization (Marcu 00)
- multi-level-annotated text as the basis for automatic abstracting
Requirements for ECTS Credit Points
- Attendance of all course sessions, active participation and oral contributions to the discussions in the course = 1CP
- Daily assignments: either exercise (actual work with texts and with summarization algorithms), or reading for the next class = 2 CP
- After the Fall School: Completion of a short paper (4 pages ACM style) focusing on one aspect of the course, to be handed in 4 weeks after the end of the Fall School = 1 CP
This course will provide an introduction to connectionist (artificial neural network) modelling and the application of this approach to developing theories and explanations in language processing. No prior knowledge of computational modelling is required. We will treat both the conceptual underpinnings of the modelling approach in general as well as discuss the function of specific models and the insights we can gain from them for human cognitive and language processing. Topics include: General Introduction to Modelling and Connectionist Models, Memory, Development of Speech Sounds, Word Learning and Lexical Development, Reading, Inflection Processing, Sentence Processing.
Grammar theories try to model the properties of well-formed strings (and ex negativo ill-formed strings) of natural language. They make use of grammar formalisms (GB, MG, LFG, HPSG, ...), and ideally they get implemented in NLP applications. This course will provide an in-depth introduction to Tree Adjoining Grammar (TAG), a grammar formalism that features a nice trade-off between generative capacity and computational complexity. The first week highlights the formal machinery of TAG, TAG parsing and the syntax-semantics interface. In the second week, we will go into grammar engineering, first by looking into TAG analyses for several phenomena of English and German, and then by showing ways to build up and to maintain wide coverage grammars based on TAG.