Navigation auf


Institut für Computerlinguistik

Kolloquium in Computerlinguistik, HS 2011

Alle 14 Tage von 10.15 Uhr bis 12.00 Uhr in Raum BIN 2.A.10.





Rinaldi, Hess, Volk

Organisation - IE intro - SASEBio


Massimiliano Ciaramita (Google)

Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition


Simon Clematide,
Lenz Furrer

Fraktur-OCR: Projektrückblick


Gintarė Grigonytė,
Anne Göhring, Annette Rios

NLP for assisting the building and evaluation of domain ontologies,
Spanisch-Quechua Alignement


Dietrich Rebholz-Schuhmann

Biomedical terminology at EBI


Michael Strube

Coreference Resolution via Hypergraph Partitioning in the News and the Medical Domain


Maya Bangerter,
Mark Fishel

Analyse geographischer Referenzen im Text+Berg-Korpus,
The SUMAT project


Fabio Rinaldi, SASEBio
After a brief introduction to the field of Information Extraction, I present the results achieved in the first year of the SASEBio project (Semi-Automated Semantic Enrichment of the Literature), in particular the participation in several text mining competitions (BioCreative III, BioNLP 2011, >CALBC). I then focus on a more recent experiment on assisted curation in collaboration with the PharmGKB group at Stanford University.

M. Ciaramita, Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
We use search engine results to address a particularly difficult domain adaptation problem, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional context for correctly disambiguating each token. We achieve strong gains in NER performance in-domain and out-of-domain.

S. Clematide, Ranking interactions for a curation task
One of the key pieces of information which biomedical text mining systems are expected to extract from the literature are interactions among different types of biomedical entities (proteins, genes, diseases, drugs, etc.). Different types of entities might be considered, for example protein-protein interactions have been extensively studied as part of the BioCreative competitive evaluations. However, more complex interactions such as those among genes, drugs, and diseases are increasingly of interest. We describe a machine-learning based reranking approach for candidate interactions extracted from the literature. The results are evaluated using data derived from the PharmGKB database. The importance of a good ranking is particularly evident in the case the results are applied to support human curators.

Lenz Furrer, Fraktur-OCR: Projektrückblick
Nachdem das Projekt RRB-Fraktur schon im FS 11 vorgestellt wurde, möchte ich nun nach Abschluss des Projekts einige rückblickende Betrachtungen anstellen. Das Hauptaugenmerk liegt auf Methoden für Korrektur von OCR-Fehlern: was war erfolgreich, was weniger, was hätte mit mehr Zeit noch versucht werden können? Zudem soll der Prototyp einer Wiki-Applikation vorgestellt werden, die als webbasiertes Korrektursystem verwendet werden kann.

G. Grigonytė, NLP for assisting the building and evaluation of domain ontologies
An ontology is a knowledge representation structure made up of concepts and their interrelations. It usually means shared understanding delineated by some domain. The building of ontology can be addressed from the perspective of natural language processing. Many methods and approaches can be combined for achieving an automation of this complex task. During this talk in particular I will talk about: a) AUTOTERM - a linguistically orientated approach for domain terminology extraction, b) unsupervised approach for synonymy (and other semantic relationship) discovery by aligning paraphrases in monolingual domain corpora and c) evaluation of domain ontologies based on domain terminologies. This talk will summarize my PhD project. The experimental investigation of proposed methodology has been evaluated on two different domains: computer security and cancer research.

Anne Göhring, Annette Rios, Spanisch-Quechua Alignment
Parallel treebanking is greatly facilitated by automatic word alignment. We work on building a trilingual treebank for German, Spanish and Quechua. We ran different alignment experiments on parallel Spanish-Quechua texts, measured the alignment quality, and compared these results to the figures we obtained aligning the corresponding Spanish-German texts. This preliminary work has shown us the best word segmentation to use for the agglutinative language Quechua with respect to alignment. We also acquired a first impression about how well Quechua can be aligned to Spanish, an important prerequisite for bilingual lexicon extraction, parallel treebanking or statistical machine translation.

D.R. Schuhmann, Semantic Interoperability between literature and data resources: from genes to diseases
The biomedical research community stores research results on biological entities in large-scale databases for their experimental data analysis and, in addition, produces semantic resources such as ontologies for the annotation of the entities. For automatic literature analysis, we exploit these resources to improve information retrieval, information extraction, ontology development and the discovery of new knowledge. As a result, the literature research team at the European Bioinformatics Institute has established state of the art solutions for high-throughput analyses of the literature. Current developments lead towards semantic interoperability through standardised data resources, e.g. large-scale annotated corpora, state of the art terminological resources and linked open data. The presentation will have a focus on the benefits (and limitations) of the existing ontological resources and will describe several solutions, how the conceptual knowledge can be successfully exploited to filter out relevant information from the scientific literature. The presentation will discuss ongoing efforts to deliver gene-disease associations from different resources exploiting semantic interoperability.

M. Strube, Coreference Resolution via Hypergraph Partitioning in the News and the Medical Domain
We describe a novel approach to coreference resolution which implements a global decision via hypergraph partitioning. In contrast to almost all previous approaches, we do not rely on separate classification and clustering steps, but perform coreference resolution globally in one step. Our hypergraph-based global model implemented within an end-to-end coreference resolution system outperforms strong baselines.
The presentation will begin with an overview over previous machine learning based approaches. Shortcomings of these approaches will motivate our proposal. We show that our system can be ported easily to new domains: we participated successfully in the CoNLL-2011 shared task (coreference resolution in the news domain) as well as in the i2b2-2011 shared task (coreference resolution in the medical domain). We conclude with an outlook on future work and our grand vision on integrating several components into a larger discourse processing system.
Publications relevant for this presentation:
Jie Cai and Michael Strube (2010). End-to-end coreference resolution via hypergraph partitioning. In Proc. COLING '10, pp.143-151. (
Jie Cai, Eva Mujdrizca-Maydt and Michael Strube (2011). Unrestricted coreference resolution via global hypergraph partitioning. In Proc. of the CoNLL Shared Task'11, pp. 56-60. (
Jie Cai, Eva Mujdricza-Maydt, Yufang Hou, and Michael Strube (2011). Weakly supervised graph-based coreference resolution for clinical data. In Proc. of the i2b2'11 Shared Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data. To appear.

M. Bangerter, Analyse geographischer Referenzen im Text+Berg-Korpus
Im Text+Berg-Korpus liegen für die Jahrgänge ab 1957 deutsch-französische Paralleltexte vor. In meinem Projekt mache ich mir diesen Umstand zunutze machen und gehe der Frage nach, wie die Analyse geographischer Referenzen im Text+Berg-Korpus durch den Einbezug von Paralleltexten verbessert werden kann. Die Analyse von Toponymen erfordert die drei Schritte: Erkennung möglicher Namenskandidaten Disambiguierung der Kandidaten und Klassifizierung der Toponyme Grounding der Toponyme Erste Experimente haben gezeigt, dass die Verwendung der Paralleltexte für den Disambiguierungsschritt hilfreich ist.

Mark Fishel, The SUMAT project
We will present the recent developments of the SUMAT project, which aims at developing an online service of automatic translation of subtitles. In the first phase of the project the main challenges were collecting the parallel data for training the translation systems, normalizing and pre-processing it, and training the initial baseline machine translation systems.

Weiterführende Informationen


Teaser text