Kolloquiumsplan HS 2017

Kolloquium HS 2017: Berichte aus der aktuellen Forschung am Institut, BA/MA-Arbeiten, Programmierprojekte, Gastvorträge

Zeit/Ort: Circa alle 14 Tage am Dienstag von 10.15 Uhr bis 12.00 Uhr, BIN-2.A.01

Verantwortliche: Kyoko Sugisaki und Martin Volk

Kontakt: Kyoko Sugisaki


Vortragende / Thema

19. September

Programmierprojektvortrag: Dominique Sandoz (Going further with multilingwis, a web based search engine for exploration of word-aligned parallel and multiparallel corpora)

Masterarbeitvortrag: Xi Rao (Automatic Labeling of Articles in International Investment Agreements Using Semi-Supervised Learning and Word Embeddings)



Masterarbeitvortrag: Parijat Ghoshal (Topic Modeling and Visualisation of Diachronic Trends in Biomedical Academic Articles)

Projektvortrag: Dina Vishnyakova(author name disambiguation in PubMed)

17. Oktober

Masterarbeitvortrag: Yvonne Gwerder (Named Entity Recognition in Digitized Historical Texts)


31. Oktober

PhD-Projektvortrag: Mathias Müller (Document-level context in deep recurrent neural networks)


14. November 

Projektvortrag: Tilia Ellendorff (PsyMine, a final balance)

PhD-Projektvortrag: Lenz Furrer (Advances in biomedical entity annotation)

28. November

Gastvortrag: Don Tuggener (ZHAW): Word embeddings and weighted word overlap: A practical ensemble approach to Question Matching in a dialogue simulator

Gastvortrag: Jan Deriu (ZHAW): tbc

12. Dezember 

Gastvortrag: Prof. Ce Zhang (ETH Zürich): tbc


19. September

Dominique Sandoz:

Titel: Going further with multilingwis, a web based search engine for exploration of word-aligned parallel and multiparallel corpora

Abstract: Searching and exploring multi-words units in large multiparallel corpora was made possible with a web-based search engine called ‚multilingwis‘ in 2016. Due to its success and improvements in the fields of the various technical components it was made from, a decision was taken to redo the application, resulting in multilingwis2 which was released in early 2017 as a SPARCLING project. The new version includes state of the art search techniques, new features and an easy to use web interface. This presentation will provide insight on the whys, whats and hows of this remake.

Xi Rao

Titel: Automatic Labeling of Articles in International Investment Agreements

AbstractInternational investment agreements (IIAs) are international commitments amongst contracting parties to protect and promote investment. Although each treaty has a distinctive structure regarding placement and organization of information, IIAs as instruments of international law share underlying textual and legal structures. Treaty articles are important components in IIAs: Some articles have been assigned with titles, while the other articles remained untitled. In this paper, the text collection of IIAs in English was used, which was created by a project called Diffusion of International Law under the Swiss Network for International Studies (SNIS) network (the SNIS corpus). To understand and analyze the treaty structure thoroughly, automatically assigning titles to untitled articles is crucial for content analysis; however, numerous titles have been assigned to similar texts due to the variability in factors such as negotiating partners, languages, and traditions. This variability leads to more than 5,000 various surface forms of titles after normalization. Hence, we first used k-means clustering with document embeddings as features to assign the 34,524 titled articles into ten classes. Then the ten topics were used as the labels in multiclass classification tasks where titles were assigned to 10,074 untitled articles. Expert annotations of 100 untitled articles were used as the gold standards in the evaluation. We found out that k-means clustering with the retrained word embeddings tailored to the SNIS corpus has brought about an increase of 30% in accuracy compared with a simple convolutional neural network (CNN) classifier, which has scored the highest amongst all supervised classifiers with an accuracy of 46%.

3. Oktober 

Parijat Ghoshal: Topic Modeling and Visualisation of Diachronic Trends in Biomedical Academic Articles

Dina Vishnyakova: author name disambiguation in PubMed

17. Oktober

Yvonne Gwerder: 

Titel: Named Entity Recognition in Digitized Historical Texts

Abstract: We would like to present an approach at automatically recognizing Named Entities in legal documents written in late medieval and early modern variants of German and French. We describe the transformation of the digitized texts into a structured XML format, while exemplifying how resources for tokenization and OCR-processing can be adapted and applied to this end. Named Entities are extracted by exploiting indices of place and person names, and subsequently detected in the texts via approximate string matching techniques. The resulting pre-annotated texts are then additionally tagged with a ready-made Named Entity Recognition tool intended for the modern language. Ultimately, by training and testing own Machine Learning models, we aim at illustrating the main possibilities and limitations characteristic of historical data.


31. Oktober

Mathias Müller:

TitelDocument-level context in deep recurrent neural networks

AbstractPreliminary experiments have shown that neural machine translation systems benefit from document-level context. Fortunately, the most widely used approach to neural translation, encoder-decoder models, are easily extended with more context. Any additional information (such as the previous source sentence) can be transformed to a hidden representation by an additional encoder. During decoding, access to the hidden states of this additional encoder is regulated by an additional attention mechanism. In this talk I will explain how to model this problem with a conditional GRU network with deep transition, an arbitrary number of layers and arbitrary number of additional encoder and attention networks.

​14. November 

Tilia Ellendorff: PsyMine, a final balance

Lenz Furrer: advances in biomedical entity annotation

​28. November

Don Tuggener (ZHAW):

Titel: Behavioural Simulator for Professional Training based on Natural Language Interaction 


We present results of experiments for Question Matching that combines two features, i.e. word embeddings and weighted word overlap. The aim is matching a given input question to a pre-defined set of available questions in a dialogue simulator for professional training. We find that both features have their strengths and weaknesses and combining them with an ensemble SVM improves their individual performance significantly. 

The experiments were conducted as part of the CTI Project ``Behavioural Simulator for Professional Training based on Natural Language Interaction'', that aims to replace the current keyboard and mouse input in a dialogue simulator with free speech input. One key challenge that arises in such a setting, besides handling speech input, is determining if a given input does not match any of the available questions and what measures to take in such a situation. We present an approach that tries to detect such non-matching inputs.

Link KTI-Projekt: http://www.cl.uzh.ch/en/research/opinionmining/lifelike.html

Gastvorträge: ​​

Jan Deriu (ZHAW): tbc

12. Dezember: ​​

Prof. Ce Zhang (ETH Zürich): tbc