Kolloquiumsplan HS 2017

Kolloquium HS 2017: Berichte aus der aktuellen Forschung am Institut, BA/MA-Arbeiten, Programmierprojekte, Gastvorträge

Zeit/Ort: Circa alle 14 Tage am Dienstag von 10.15 Uhr bis 12.00 Uhr, BIN-2.A.01

Verantwortliche: Kyoko Sugisaki und Martin Volk

Kontakt: Kyoko Sugisaki

Datum

Vortragende / Thema

19. September

Programmierprojektvortrag: Dominique Sandoz (Going further with multilingwis, a web based search engine for exploration of word-aligned parallel and multiparallel corpora)

Masterarbeitvortrag: Xi Rao (Automatic Labeling of Articles in International Investment Agreements Using Semi-Supervised Learning and Word Embeddings)

3.

Oktober

Masterarbeitvortrag: Parijat Ghoshal (Topic Modeling and Visualisation of Diachronic Trends in Biomedical Academic Articles)

Projektvortrag: Dina Vishnyakova(Tracking and integrating the scientific output of researchers in the biomedical space)

17. Oktober

Masterarbeitvortrag: Yvonne Gwerder (Named Entity Recognition in Digitized Historical Texts)

Aktuelle Angelegenheiten

  • Norbert Fuchs (The Pigeon Hole Principle: Puzzles & Applications
  • Peter Makarov (UZH at TAC KBP 2017: Event Nugget Detection via Joint Learning with Softmax-Margin Objective.)

31. Oktober

PhD-Projektvortrag: Mathias Müller (Document-level context in deep recurrent neural networks)

Aktuelle Angelegenheiten:

  • Michi Amsler (CONLL_reader_utils: python library for working with parses)
  • Samuel Läubli (Perceptions of MT in Social Media)

14. November 

Projektvortrag: Tilia Ellendorff (PsyMine: crowd-style annotations and recognition of event sentences)

PhD-Projektvortrag: Lenz Furrer (Disambiguating biomedical entities)

28. November

Gastvortrag: Don Tuggener (ZHAW): Word embeddings and weighted word overlap: A practical ensemble approach to Question Matching in a dialogue simulator

Gastvortrag: Jan Deriu (ZHAW): tbc

12. Dezember 

Gastvortrag: Prof. Ce Zhang (ETH Zürich): tbc

 

19. September

Dominique Sandoz:

Titel: Going further with multilingwis, a web based search engine for exploration of word-aligned parallel and multiparallel corpora

Abstract: Searching and exploring multi-words units in large multiparallel corpora was made possible with a web-based search engine called ‚multilingwis‘ in 2016. Due to its success and improvements in the fields of the various technical components it was made from, a decision was taken to redo the application, resulting in multilingwis2 which was released in early 2017 as a SPARCLING project. The new version includes state of the art search techniques, new features and an easy to use web interface. This presentation will provide insight on the whys, whats and hows of this remake.

Xi Rao

Titel: Automatic Labeling of Articles in International Investment Agreements

AbstractInternational investment agreements (IIAs) are international commitments amongst contracting parties to protect and promote investment. Although each treaty has a distinctive structure regarding placement and organization of information, IIAs as instruments of international law share underlying textual and legal structures. Treaty articles are important components in IIAs: Some articles have been assigned with titles, while the other articles remained untitled. In this paper, the text collection of IIAs in English was used, which was created by a project called Diffusion of International Law under the Swiss Network for International Studies (SNIS) network (the SNIS corpus). To understand and analyze the treaty structure thoroughly, automatically assigning titles to untitled articles is crucial for content analysis; however, numerous titles have been assigned to similar texts due to the variability in factors such as negotiating partners, languages, and traditions. This variability leads to more than 5,000 various surface forms of titles after normalization. Hence, we first used k-means clustering with document embeddings as features to assign the 34,524 titled articles into ten classes. Then the ten topics were used as the labels in multiclass classification tasks where titles were assigned to 10,074 untitled articles. Expert annotations of 100 untitled articles were used as the gold standards in the evaluation. We found out that k-means clustering with the retrained word embeddings tailored to the SNIS corpus has brought about an increase of 30% in accuracy compared with a simple convolutional neural network (CNN) classifier, which has scored the highest amongst all supervised classifiers with an accuracy of 46%.

3. Oktober 

Parijat Ghoshal: 

Titel: Topic Modeling and Visualisation of Diachronic Trends in Biomedical Academic Articles

AbstractIn the biomedical domain, there is an abundance of texts making the task of having a thematic overview about them a challenging endeavour. This is also due to the fact that many of these texts are unlabelled and one simply cannot always assign them to a certain thematic domain. Some texts remain thematically ambiguous and sorting them neatly into thematic domains is impossible. Thus, it could be helpful to implement an unsupervised algorithm to sort into topics a corpus of unlabeled data. For my Master’s thesis,  I used latent Dirichlet allocation will be on a corpus of bio-medical articles to automatically generate topics. I generated topic models based on articles from PubMed Central’s Open Access Subset.  I  observed diachronic trends in them on three different levels with the help of the topic model. On the first level, I observed diachronic changes in the popularity of the topics themselves. Then I checked how the popularity of the topic words within a topic evolved throughout the corpus. On the third level, I  observed the popularity of common words that belong to documents about a certain topic. Moreover, a companion website and a topic modelling pipeline are also created as an output of this project.

Dina Vishnyakova: 

Titel: Tracking and integrating the scientific output of researchers in the biomedical space

Abstract: Keeping abreast of scientific discoveries and the scientists, institutions/companies that produce them is fundamental for innovation. But, what are effective ways to stay up-to-date? A lot of energy is spent in tracking scientific publications and conferences or identifying key researchers and potential collaborators. These efforts involve an important level of systematic manual curation and do not manage to explore the entire breadth of scientific information that is available in the open domain. In particular, it is challenging to track the work of individualresearchers. Scientific documents are spread in a manner that makes it difficult to build a complete representation of the scientific work of a researcher. If an information scientist has to conduct a literature review by collecting and interpreting information, they can use either a concept-centric  or an author-centric method. Indeed, a significant part of literature reviews are conducted with an author-centric approach using queries that are based on author names. However, when  a scientist goes through publications under the same author name there is no guarantee that all those publications belong to the same author. To make matters more challenging, it is not rare that information about affiliation (mainly contact details) is often missing in publications or that a researcher may change institutions over time. Additional problems are, for example, that author names can be recorded differently depending on local transliteration preferences, e.g. Müller in Germany may be written as Mueller or Muller in English. Different attemps have been made to solve all these problems, for instance by assigning unique identifiers to authors, such as ORCID, or by developing author name disambiguation (AND) methods, which is the focus of our work. There are some published algorithms and methodologies describing solutions to AND in the main source of biomedical publications - MEDLINE, using both supervised and unsupervised machine learning algorithms. We started by developing our own methodology to disambiguate authorship, which was not based only on comparing available information, but which also included descriptors for domains of research. Then, we chose a supervised machine learning algorithm – C4.5 (decision trees), which showed best performance for AND, and trained data with descriptors as supplementary features. We showed that this methodology improved over the current state of the art (Vishnyakova, 2016; Song 2014). The achieved results showed that these features helped to disambiguate authors even in cases when the information on affiliation of the author was missing. Thus, our assumption that additional features to describe the main subjects and domains of the publication will improve the results of AND was confirmed. In order to train and evaluate our algorithm we used an existing publicly available gold standard (manually curated data set). Our posterior analysis of this gold standard revealed that it had several shortcomings. These affected particularly the disambiguation of Asian names. Thus, we decided to create a new gold standard actually representative of MEDLINE. For this purpose we involved crowdsourcing platforms such as CrowdFlower and Amazon MTurk and a group of expert curators. The results produced by the experts were superior to the results from the crowdsourcing platforms. Our evaluations done on this newly created gold standard showed that we could achieve performance improvements in AND. Additionally, we developed a prototype tool that allows users to conduct author-centric searches on MEDLINE records. As a search engine, our prototype returns ranked lists of authors with information about their publications, affiliations, email, etc. The author lists will be linked in the future to other open source databases such as  clinical trial.gov, grants and patents. Thus, by having all the information about researchers’ activities, one can have a full overview of who is making progress and achievements in science. 

17. Oktober

Yvonne Gwerder: 

Titel: Named Entity Recognition in Digitized Historical Texts

Abstract: We would like to present an approach at automatically recognizing Named Entities in legal documents written in late medieval and early modern variants of German and French. We describe the transformation of the digitized texts into a structured XML format, while exemplifying how resources for tokenization and OCR-processing can be adapted and applied to this end. Named Entities are extracted by exploiting indices of place and person names, and subsequently detected in the texts via approximate string matching techniques. The resulting pre-annotated texts are then additionally tagged with a ready-made Named Entity Recognition tool intended for the modern language. Ultimately, by training and testing own Machine Learning models, we aim at illustrating the main possibilities and limitations characteristic of historical data.

Norbert Fuchs:

TitelThe Pigeon Hole Principle: Puzzles & Applications

Peter Makarov:

TitelUZH at TAC KBP 2017: Event Nugget Detection via Joint Learning with Softmax-Margin Objective.

 

31. Oktober

Mathias Müller:

TitelDocument-level context in deep recurrent neural networks

AbstractPreliminary experiments have shown that neural machine translation systems benefit from document-level context. Fortunately, the most widely used approach to neural translation, encoder-decoder models, are easily extended with more context. Any additional information (such as the previous source sentence) can be transformed to a hidden representation by an additional encoder. During decoding, access to the hidden states of this additional encoder is regulated by an additional attention mechanism. In this talk I will explain how to model this problem with a conditional GRU network with deep transition, an arbitrary number of layers and arbitrary number of additional encoder and attention networks.

Michi Amsler

Titel: CONLL_reader_utils: python library for working with parses

Samuel Läubli

Titel: Perceptions of MT in Social Media.

​14. November 

Tilia Ellendorff: 

TitelPsyMine: crowd-style annotations and recognition of event sentences

AbstractThe PsyMine project is about the recognition of etiological factors for psychiatric disorders in research papers of the biomedical domain. Methods of biomedical text mining are applied with the aim of recognising and extracting etiological events. This year’s presentation will focus on crowd-style annotation of event structures and sentence classification for event recognition. 

Lenz Furrer: 

TitelDisambiguating biomedical entities

AbstractIn biomedical text mining, entity recognition is an important initial step to all kinds applications. This task is concerned with disambiguation on different levels: detecting relevant terms (ie. distinguishing relevant from irrelevant text spans), determining its entity type (eg. gene, chemical, disease), and assigning a unique identifier (aka. linking, grounding, normalisation, concept recognition). One approach to this is using a pipeline with a candidate-generation phase followed by filtering steps. This strategy was very successful in an evaluation with a standard annotated corpus of full-text scientific articles, combining knowledge-based candidate generation with a neural network for filtering.

​28. November

Don Tuggener (ZHAW):

Titel: Behavioural Simulator for Professional Training based on Natural Language Interaction 

Abstract: 

We present results of experiments for Question Matching that combines two features, i.e. word embeddings and weighted word overlap. The aim is matching a given input question to a pre-defined set of available questions in a dialogue simulator for professional training. We find that both features have their strengths and weaknesses and combining them with an ensemble SVM improves their individual performance significantly. 

The experiments were conducted as part of the CTI Project ``Behavioural Simulator for Professional Training based on Natural Language Interaction'', that aims to replace the current keyboard and mouse input in a dialogue simulator with free speech input. One key challenge that arises in such a setting, besides handling speech input, is determining if a given input does not match any of the available questions and what measures to take in such a situation. We present an approach that tries to detect such non-matching inputs.

Link KTI-Projekt: http://www.cl.uzh.ch/en/research/opinionmining/lifelike.html

Gastvorträge: ​​

Jan Deriu (ZHAW): tbc

12. Dezember: ​​

Prof. Ce Zhang (ETH Zürich): tbc