Navigation auf


Institut für Computerlinguistik

Kolloquium in Computerlinguistik, FS 2014

Biomedical text mining
(daneben Berichte aus der aktuellen Forschung am Institut)

Zeit/Ort: Circa alle 14 Tage am Dienstag von 10.15 Uhr bis 12.00 Uhr in Raum BIN 2.A.10.

Dozierende: Fabio Rinaldi, Michael Hess, Martin Volk

Kontakt: Fabio Rinaldi



Vortragende / Verantwortlich

Thema / Folien

February 18, 2014

Manuela Weibel

Susanna Tron

Erstellung zweier paraller Korpora (Deutsch - Rumantsch Grischun) und Implementierung in ein bilinguales Korpus-Wortsuchsystem

A verb-centered Sentiment Analysis for French

March 4, 2014

Pierre Zweigenbaum

Information extraction from clinical texts

March 18, 2014

Dietrich Rebholz-Schuhmann

Fabio Rinaldi

MANTRA: multilingual biomedical terminology enhancement. Results so far.

SASEBio: Semi-Automated Semantic Enrichment of the Biomedical Literature. A final balance.

April 1, 2014

Anne Goehring

Luzia Roth

SQUOIA: Building a Spanish-German dictionary for MT


April 15, 2014

Florian Leitner

From the BioCreative text mining challenges to transcription regulation network extraction

May 6, 2014

Sarah Ebling

Kyoko Sugisaki

Automatic translation of German train announcements into Swiss German Sign Language

Automated Detection of Syntax-related Style Violations in Legislative Drafts

May 20, 2014

Noah Bubenhofer

Diagrammspiele: Gedanken zu einer "visuellen Linguistik" am Beispiel von Geokollokationen

Pierre Zweigenbaum: Biomedical information extraction at LIMSI

Health has long been a favored testbed for artificial intelligence and natural processing methods. LIMSI has been taking part in projects and international challenges on biomedical information extraction since 2007. This talk will present a sample of this work. It will first cover a few topics in biomedical information extraction, with different stands with respect to expert-based on data-driven methods: de-identification of clinical reports, detection of drug prescriptions, unsupervised word classes for supervised entity recognition, experiments in combining expert-based and data-driven classifiers, fully supervised abbreviation expansion, and co-reference resolution for relation detection. It will then shortly outline some recent projects which tackle French clinical language processing, machine translation of scientific abstracts, the search for similar cases based on clinical reports, pharmacovigilance in patient forums, and a dialoguing virtual patient to train health care professionals.

Short bio

Pierre Zweigenbaum is a CNRS Senior Researcher at LIMSI (Orsay) since 2006, where he leads the Natural Language Processing group LIMSI/ILES. From 2003 to 2013 he has been a part-time Professor at INALCO, where he teaches Natural Language Engineering. Before 2007 he has been doing research at the Paris Hospital Trust (APHP) and INSERM for over twenty years. After two engineering degrees, he obtained a PhD in Computer Science from Télécom ParisTech (1985) and an Habilitation à diriger des recherches from Université Paris-Nord (1998). His main research interests are in Information Extraction in multilingual settings, with applications to the medical domain. With the LIMSI-CNRS team, he has been taking part since 2009 in the yearly i2b2 challenges on information extraction from clinical texts. He coordinated the EU-funded MENELAS project (1992-1995) on the analysis of natural language patient records and was leader of the workpackage on Machine Translation in Context in the FP7 T4ME/META-NET network of excellence (2010-2013). He is the chair of the recently created Francophone Special Interest Group of IMIA, which fosters the development of resources and tools to process French clinical texts. LIMSI/ILES stands for Information, Written and Signed Language. LIMSI/ILES performs fundamental and applied research on corpus collection and annotation, evaluation in natural language processing, paraphrasing and multilingualism, information extraction (including temporal and opinion), question answering, machine reading, sign language modeling and methods to generate realistic signed productions.

Dietrich Rebholz-Schuhmann: MANTRA: multilingual biomedical terminology enhancement, results so far.

The CLEF-ER challenge (part of the Mantra project) addressed solutions to improve entity recognition (ER) in parallel multilingual document collections. It brought together researchers in the domains of entity recognition in the biomedical domain, normalisation of entity mentions, and machine translation for the challenges linked to the identification of concepts in the biomedical domain. A large set of documents in different languages, i.e. Medline titles, EMEA drug label documents and patent claims, have been prepared to enable ER in parallel documents. Each set of documents forms a corpus-language pair (CLP), for example the full set of Medline abstract titles in German is the “EMEA/de” CLP, and the number of documents for each CLP vary from about 120,000 for patents up to 760,000 for Medline abstract titles. The challenge participants have been asked to annotate entity mentions with concept unique identifiers (CUIs) in the documents of their preferred non-English language. The main ER task is concerned with attribution of CUIs to entities in the non-English corpora and a second task targets the identification of entity mentions against a silver standard corpus. The challenge participants could make use of the prepared terminological resources for entity normalisation and of the English silver standard corpora as input for concept candidates in the parallel non-English documents. Several evaluation measures have been applied to determine the best performing solutions against the different CLPs, e.g. the F1-measure for the entity recognition in the non-English languages (“evaluation A”) and for the assignment of the correct concept unique identifiers (CUIs) evaluated against an English Silver Standard Corpus (SSC, “evaluation B”). The results will be discussed in the presentation.

Short bio

Dietrich Rebholz-Schuhmann ist Arzt (Univ. Düsseldorf, 1988) mit einem Doktor in Immunologie (Univ. Düsseldorf, 1989) und mit einen Master in Informatik (Univ. Passau, 1993). Nach seinem Studium arbeitete er als leitender Wissenschaftler bei der gsf (München) im Bereich der Bildanalyse und 3D-Visualisierung. Von 1998 bis 2003 leitete er ein Forschungsteam bei der LION Bioscience AG (Heidelberg) und entwickelte neuartige Text-Mining-Lösungen. Von 2003 bis 2012 war er wissenschaftlicher Gruppenleiter am European Bioinformatics Institute, Hinxton (Uk), für biomedizinische Literatur-Analyse. Seit Juli 2012 ist er Oberassistent an der Universität Zürich in der Abteilung für Computerlinguistik und leitet dort das Projekt.

Fabio Rinaldi: SASEBio, semi-Automated Semantic Enrichment of the Biomedical Literature. A final balance.

There are vast amounts of knowledge encoded in the scientific literature which could be made more easily accessible and useful to a broader range of users through the application of more effective software tools. Text mining is a new discipline which seeks to provide ways to find, extract and manipulate the knowledge which still remains to a large extent hidden in the literature. Text mining tools can already provide a very effective way to extract some specific types of information, but are not yet so advanced that their results can be used without human verification by domain experts. Therefore one very promising area of application of text mining technologies is within the process of database curation. The need to efficiently retrieve key information derived from experimental results, and published in the scientific literature, is of fundamental importance in biology. In order to help biologists, as well as in some cases medical practitioners, to efficiently find such information in the enormous quantity of published articles, several public and private institutions fund the construction and maintenance of specialized databases, which have the role to collect specific knowledge items and provide them in an easily accessible format. There are several dozens of such databases, each specializing in a particular domain of the life sciences [1]. In this talk I will describe text mining activities conducted by my research group at the University of Zurich (OntoGene: The OntoGene group is supported by the Swiss National Science Foundation (project SASEBIO: Semi-Automated Semantic Enrichment of the Biomedical Literature) and by Roche Pharmaceuticals. The SASEBio project focuses in particular on applications of text mining technologies to the process of biomedical database curation. The OntoGene team has participated in several competitive evaluations of biomedical text mining technologies, obtaining competitive results in all of them. Some of these results will be discussed in the talk. Additionally, I will present ODIN (OntoGene Document Inspector), an interactive tool which allows database curators to leverage upon the results of the OntoGene text mining system and use them in their curation tasks.

Short bio

Fabio Rinaldi is the leader of the OntoGene research group at the University of Zurich and the principal investigator of the SASEBio project. He holds an MSc in Computer Science (University of Udine, Italy) and a PhD in Computational Linguistics (University of Zurich, Switzerland). He is author of more than 100 scientific publications (including 19 journal papers) dealing with topics such as Ontologies, Text Mining, Text Classification, Document and Knowledge Management, Language Resources and Terminology.

Anne Goehring, SQUOIA: Building a Spanish-German dictionary for MT

I will describe the development of the Spanish-German dictionary used in our hybrid MT system. The compilation process relies entirely on open source tools and freely available language resources. The resulting bilingual dictionary currently contains around 33,700 entries.

Luzia Roth: Chunktagging

Abstract: Im ersten Teil des Vortrages wird das Programmierprojekt zum Thema Chunktagging mit Strukturtags mittels CRF vorgestellt. Die Strukturtags zeigen jeweils die Relation zum vorherigen und nachfolgenden Token auf und bilden so die Chunkstrukturen ab. Angelehnt ist das Verfahren an Skuts Ansatz mit HMM [1]. Im zweiten Teil werden Zusatzexperimente zum Programmierprojekt mit alternativen Evaluationsformen und Algorithmen sowie ein Vergleich mit einem HMM-Tagger vorgestellt.
[1] Wojciech Skut and Thorsten Brants. Chunk Tagger -- Statistical Recognition of Noun Phrases. Proceedings of the ESSLLI Workshop on Automated Acquisition of Syntax and Parsing. 1998.

Florian Leitner: From the BioCreative text mining challenges to transcription regulation network extraction.

More than a decade ago, biomedical text mining was a minor discipline of bioinformatics. Today, it plays the central role of providing the scientific background knowledge for nearly every larger project being published. Beyond traditional information retrieval (e.g., search engines), text mining extracts qualified, relevant content for the biomedical researcher. It addresses issues ranging from uncovering gene-diseases relationships in neurodegenerative diseases to aiding in the design of network models the drive cancer progression. This recent growth has lead to a rich collection of algorithms and methodologies; To be able to make fair, direct comparisons between these approaches, several community challenges have been created, among the first the BioCreative challenges. An important objective for BioCreative always was the promotion of publicly available tools, thereby being a driver of the applied aspect of text mining. During the BioCreative II challenge, extraction of protein interactions was the main topic. This topic was repeated with the II.5 challenge in collaboration with FEBS Letters and MINT, but now directly comparing the text mining systems to the author's own annotations and the work of professional bio-curators. During BioCreative III, the focus switched from the interactions to the underlying experimental methods, information that is critical to biologists, but too often being neglected by text miners. In its latest installment, BioCreative IV, the detection of chemical entity mentions from text was the main theme, providing the community with a Big Corpus of manually annotated abstracts. Apart from the proteome, the transcriptome forms another central network in molecular biology, interfacing proteins and genes. To standardize the description of this interface, a large, international consortium lead by Astrid Laegreid from the NTNU between biologists, curators (e.g., GOA), databases (e.g., IntAct), and text miners has been created. It is now leading to the first literature curated catalogue of mammalian TFs, and will soon cover their known target genes, too.

Short bio

Florian Leitner just started his Juan de la Cierva research fellowship at the at the Universidad Politécnica de Madrid. He has been a post-doctoral researcher in the Structural Computational Biology group at the Spanish National Cancer Research Center (CNIO), where he also completed his research for a PhD degree in Molecular Biology. He holds a Masters degree in Molecular Biology from the University of Vienna, while working on post-translational protein modifications with Frank Eisenhaber at the IMP in Austria. Earlier, he had collaborated as an undergraduate with Rebecca Wade's group in Heidelberg on the annotation of protein structures, during which he achieved his first co-author publication. His doctoral thesis summarizes his work at the CNIO/Universidad Autónoma def Madrid under the supervision of Alfonso Valencia. He has published more than two dozen articles, pioneered the integration of biomedical text mining systems into a meta-system - the BioCreative Meta-Server, was the main and is a co-organizer of the BioCreative text mining community challenges, where topics have been protein interaction detection, experimental method classification, and chemical entity recognition. He is currently using text mining for the detection of transcription regulation networks from literature. With his new fellowship - in Pedro Larrañaga's Computational Intelligence Group - in addition, he will now be working on issues related to the Human Brain Project, too.

Sarah Ebling: Automatic translation of German train announcements into Swiss German Sign Language

This talk will report on the latest progress in the Trainslate project. In particular, the relevance and annotation of non-manual features in our parallel corpus of train announcements will be discussed. Preliminary ideas for the automatic generation of non-manual features will also be presented

In my talk, I will present my on-going PhD project "automatic detection of syntax-related style rules", whose aim is to develop a domain-specific tool for the detection of syntax-related style violations in Swiss German-language law texts. In particular, I will talk about two components in the style checking tool: the recognition of grammatical functions and the detection of overly complex participle phrases.

Noah Bubenhofer: Diagrammspiele: Gedanken zu einer "visuellen Linguistik" am Beispiel von Geokollokationen

Im Vortrag plädiere ich für die Forschungsrichtung einer "visuellen Linguistik". Hintergrund sind die semiotischen Grundlagen der Diagrammatik, mit der die Funktionen von Visualisierungen im Forschungsprozess beschrieben werden können. Gerade bei einem korpuslinguistischen Zugang auf Sprachdaten, bei dem große Datenmengen für explorative Analysezwecke aufbereitet werden, sind visuelle Analysemethoden von Vorteil. Am Beispiel von sog. "Geokollokationen", Kollokatoren zu Toponymen, zeige ich einen Anwendungsfall aus diesem Bereich.

Weiterführende Informationen


Teaser text