Navigation auf


Institut für Computerlinguistik

Kolloquium in Computerlinguistik, FS 2014

Exploiting literature data
(daneben Berichte aus der aktuellen Forschung am Institut)

Zeit/Ort: Circa alle 14 Tage am Dienstag von 10.15 Uhr bis 12.00 Uhr in Raum BIN 2.A.10.

Dozierende: Dietrich Rebholz-Schuhmann, Michael Hess, Martin Volk


Dietrich Rebholz-Schuhmann: How golden can a silver standard corpus be? The answers from the Mantra project.

The Mantra project addressed solutions to improve entity recognition (ER) in parallel multilingual document collections. A large set of documents in different languages, i.e. Medline titles, EMEA drug label documents and patent claims, have been prepared to enable ER in parallel documents. Each set of documents forms a corpus-language pair (CLP) and the number of documents for each CLP vary from about 120,000 for patents up to 760,000 for Medline abstract titles. The documents (in different langauges) have been processed with annotation solutions and the annotations have been used to generate silver standard corpora (SSC). With the help of a gold standard corpus (in several languages), the SSC generation has been optimised to achieve the best possible results. The gap between the SSCs and the GSC will form the core of this presentation.

Short bio

Dietrich Rebholz-Schuhmann is medical doctor and computer scientist with a PhD in immunology. He has headed a research team at the LION Bioscience AG (Heidelberg, 1998-2003) and at the European Bioinformatics Institute, Hinxton (Uk, 2003-2012). At the department for computational linguistics he is coordinating the EU-MANTRA project (2012-2014).

Don Tuggener: A Hybrid Entity-Mention Pronoun Resolution Model for German Using Markov Logic Networks

Don will present a hybrid pronoun resolution system for German. It uses a rule-based entity-mention formalism to incrementally process discourse entities. Antecedent selection is performed with Markov Logic Networks (MLNs). The hybrid architecture yields a neat problem formulation in the MLNs w.r.t. efficient inference complexity but pertains their full expressiveness. The system is compared to a rule-driven baseline and to an extension which uses a memory-based learner. The MLN hybrid outperforms its competitors to a significant degree.

Simone Teufel: Discourse Structure, Argumentation and Citations for Detecting Emerging Ideas -- example: RNAi

Simone will demonstrate how the recognition of rhetorical structure and argumentation in scientific articles is a useful and achievable task, one that could potentially advance text understanding and that can be exploited in many applications. She describes Argumentative Zoning, a method of shallowly structuring a text into zones according to speech-act type status, and explains the theory behind it. As example serves the project FUSE that aims to detect new, emerging ideas in entire scientific disciplines. Simone will describe the natural language-based features - in particular, the rhetorical ones - that help in the prediction task. The talk concludes with current work on tracing citations in a newly-created corpus on the topic of RNAi. The idea here is to quantify and correlate a cited work's impact and status with various factors related to the rhetorical structure of the citing text, such as where in the text it is cited, and exactly how.

Martin Wettstein: Halbautomatische Inhaltsanalyse

Martin holds a major in communication science and a minor in computational linguistiscs. Currently he pursues his PhD at the IPMZ (Institut für Publizistikwissenschaft und Medienforschung). His talk will be focused to the "Angrist 1.2", a solution (in Python) that produces query forms for relational data input in the content analysis. It is especially useful for the application of hierarchical codebooks, as it allows data entry at different levels in the analysis without increasing the cognitive load onto coders.

Nadine Stamm: Improving the recognition of profession titles in the Text+Berg project’s NER-code

Nadine will give an introduction on how instances of named entities (e.g. persons, titles) are represented in the Text+Berg corpus and how her solution for improved NER is working. An evaluation will be given as well.

Sampo Pyysalo: Biomedical event extraction and its applications

Automatic methods for the analysis of biomedical texts have matured considerably over the last 15 years. As tools for basic tasks have become established, the focus of efforts in domain information extraction has turned toward new challenges, such as detailed, ontology-based recognition and normalization of physical entity mentions and complex processes involving multiple entities in a variety of roles.
This talk will present these and related trends in the context of the BioNLP Shared Task (BioNLP ST) series of events, focusing on the Cancer Genetics and Pathway Curation event extraction tasks of BioNLP ST 2013. Following an introduction to the event extraction task setting and representation, I will discuss the state of the art in extraction methods and present manually annotated resources, available tools, and databases of analysis results. Current applications of the extraction technology such as semantic search and curation support tools will be introduced, with emphasis on remaining opportunities. Future directions for event extraction will be discussed with the theme of "scaling up": from few specific entity and event types to hundreds; from the molecular scale to higher levels of biological organization; and from small challenge datasets to analyses of millions of documents and databases encompassing the entire available domain literature.

Dr Sampo Pyysalo has been working on the development of resources and methods for biomedical information extraction with particular focus on supervised machine learning approaches, structured knowledge representations, and large-scale text mining. He has initiated and participated in the design and development of several annotated biomedical corpora, including BioNLP Shared Task and GENIA resources, leads the development of the open-source annotation tool BRAT, and has contributed to the creation of automatic structured analyses spanning the entire available biomedical literature through the development and deployment of text mining tools at the University of Tokyo, the UK National Centre for Text Mining, and the University of Turku. He has been a co-organizer of conference, workshops and challenges in this domain.

Annette Rios: Hybrid Machine Translation from Spanish to Quechua

The term hybrid machine translation refers to any combination of statistical MT with rule-based MT or example-based MT, or a mixture of all three approaches. In this talk, a hybrid MT system for the language pair Spanish-Cuzco Quechua will be presented. The core of the system is a classical, rule-based pipeline. However, as not all ambiguities can be resolved efficiently by rules, the system relies on statistic models for certain tasks.

Tilia Ellendorf: Using Databases for Information Extraction in the Biomedical Literature

Tilia presents the results of her ongoing work concerning the extraction of biomedical entities and the relations between them from the scientific literature. In addition to extracting chemical entities from the text, she now focuses on the identification of genes (aka. proteins). The entities are extracted based on a database for interactions between chemicals and genes (CTD, The Comparative Toxicogenomics Database) and with the help of scientific databases for proteins (UniProt, EntrezGene) and other semantic resources, such as ontologies (ChEBI, Chemical Entities of biological interest).

Johannes Graen:Multi-layer parallel corpora for linguistic research

Tobias holds a master in computational linguistics from the University of Zurich and currently pursues a PhD at the Institute of Computational Linguistics of the University of Zurich.
The talk will give an introduction into previous efforts to create a multi-parallel corpus database for storing, combining and querying several layers of annotations and alignments in order to empirically answer linguistic questions. Emphasis will be put to the cleaning and turn-alignment of the Europarl Corpus that is core to our database.

Laura Mascerell: Discourse-level lexical choice and consistency

Laura holds a diploma in computer software and a master in information technologies, both from Universitat Politècnica de Catalunya (Barcelona). She has specialized in Natural Language Processing, Software Engineering and Information Systems. Currently she is pursuing her PhD in machine translation (at cl@UZH).
She will present her work concerning the consistent translation of German compound coreferences based on using two in-domain phrase-based SMT systems. In contrast to most Statistical Machine Translation (SMT) systems, which do translate at the sentence level leading to translation inconsistencies across the document, she presents a method to enforce consistency across the whole document. Her experimental results demonstrate that the correctness and consistency of compound coreferences can be improved.

Tobias Kuhn: Meme extraction from corpora of scientific literature using citation networks

Tobias holds a master in computer science from the University of Zurich and a PhD in computer sciences from the Institute of Computational Linguistics of the University of Zurich. Afterwards he engaged in a number of PostDoc endeavours at the University of Chile, of Zurich, of Malta, of Helsinki, of Yale (Prof. Krauthammer), at the SIB, and now the EHT Zuerich (Chari of Sociology). His research projects are concerned with computational linguistics, bioinformatics, simulation, semantic web, social systems, controlled natural languages, and artificial intelligence.
This talk is about the automatic extraction of scientific concepts, i.e. memes, from large corpora of scientific literature. This work shows that citation networks can provide powerful clues for interpreting large quantities of scientific texts, in particular for observing trends, tracking ideas, and detecting research fields. Our technique has the potential to improve existing approaches on terminology extraction, named-entity extraction, topic modeling, and keyphrase extraction, but it also has a remarkable performance on its own. We validated our simple meme formula with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, network randomizations, and comparisons with several alternative approaches confirm that our technique is accurate and effective, even without including linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

Weiterführende Informationen


Teaser text