Kolloquiumsplan HS 2019

Kolloquium HS 2019: Berichte aus der aktuellen Forschung am Institut, Bachelor- und Master-Arbeiten, Programmierprojekte, Gastvorträge

Zeit & Ort: alle 14 Tage dienstags von 10.15 Uhr bis 12.00 Uhr, BIN-2.A.10 (Karte)

Verantwortlich: Simon Clematide


Vortragende & Thema


Lenz Furrer (PhD CL UZH): Sequence Tagging for Concept Recognition 

Peter Makarov (PhD CL UZH):  Semi-supervised Historical Text Normalization

Mo/Di 7./8.10. im KOL G 217

Diverse Berufungsvorträge zur neuen Professur Digitale Sprachwissenschaft am Institut für Computerlinguistik:


Janis Goldzycher (BA CL UZH): Taxonomy Learning without Labeled Data: Building on TaxoGen

Ximena Gutierrez-Vasques (PhD UNAM Mexico): Measuring Language Complexity


Tatyana Ruzsics (PhD UZH): Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

Nikola Nikolov (PhD ETH/UZH): Abstractive Document Summarization without Parallel Data


Felix Morger (Språkbanken, University of Gothenburg, Sweden): A Review of Machine Learning Interpretability in Natural Language Processing

Jan Deriu (PhD UZH/ZHAW): A Benchmark for Lifelong Machine Learning for Question Answering over Structured Data

Di 26.11. 

Lonneke van der Plas  web (University of Malta): Analysing compounds and predicting their emergence over time

Do 28.11. 17:15h

Raphael Winkelmann/Christoph Draxler: BAS Tools for the Processing of Spoken Language (ifi colloquium)

Fabio Rinaldi (UZH): 15 Years of Biomedical Text Mining


Lenz Furrer:  Sequence Tagging for Concept Recognition 

As our submission to the CRAFT shared task 2019, we present two neural approaches to concept recognition. We propose two different systems for joint named entity recognition (NER) and normalization (NEN), both of which model the task as a sequence labeling problem. Our first system is a BiLSTM network with two separate outputs for NER and NEN trained from scratch, whereas the second system is an instance of BioBERT fine-tuned on the concept-recognition task. We exploit two different strategies for extending concept coverage, ontology pretraining and back-off with a dictionary lookup.

Peter Makarov:  Semi-supervised historical text normalization 

Text normalization is the task of mapping non-standard texts (informal, dialectal, historical) into modern standard language. In this talk, I report on ongoing work on semi-supervised training of generative neural models for normalization of historical data. In contrast to most prior work, which treats this problem as character-level transduction of isolated words, we use sentential context to obtain training signal. This leads to considerably more data-efficient training.

Janis GoldzycherTaxonomy Learning without Labeled Data: Building on TaxoGen

Taxonomy learning is of great interest for automated knowledge acquisition since Taxonomies not only are a popular way to represent knowledge, but they also enable deductive reasoning and constitute an important step for ontology learning. Taxonomies are made of hypernym relations. Most current methods need labeled data to extract hypernym-relations. TaxoGen is a method for unsupervised learning of topical taxonomies using distributional semantics and a recursive, adaptive clustering process. I will talk about reimplementing TaxoGen, testing it with different embedding and clustering techniques, and introducing a new label score.

Ximena Gutierrez-Vasques: Measuring Language Complexity

Conceptualizing and quantifying linguistic complexity is not an easy task, many quantitative and qualitative dimensions must be taken into account. In particular, languages of the world have different word production processes. Therefore, the amount of semantic and grammatical information encoded at the word level, may vary significantly from language to language. It is important to quantify this morphological richness of languages and how it varies depending on their linguistic typology. This presentation summarizes some of the approaches presented at the Interactive Workshop on Measuring Language Complexity (IWMLC 2019).

Ximena Gutierrez-Vasques. PhD in Computational linguistics by the National Autonomous University of Mexico (UNAM). She is currently doing a postdoctoral stay at the University of Zurich (URPP Language and Space). Her research interests comprise: NLP for low-resource languages, quantitative linguistics, machine translation.

Nikola Nikolov: Abstractive Document Summarization without Parallel Data

Abstractive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to include in the final summary, as well as a sentence abstractor that is trained on pseudo-parallel and synthetic data, that paraphrases each of the extracted sentences. We perform an extensive evaluation of our method: on the CNN/DailyMail benchmark, on which we compare our approach to fully supervised baselines; as well as on the novel task of automatically generating a press release from a scientific journal article, which is well suited for our system. We show promising performance on both tasks, without relying on any article-summary pairs.  

Tatyana Ruzsics: Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

We define multilevel text normalization as sequence-to-sequence processing that transforms naturally noisy text into a sequence of normalized units of meaning (morphemes) in three steps: 1) writing normalization, 2) lemmatization, 3) canonical segmentation. These steps are traditionally considered separate NLP tasks, with diverse solutions, evaluation schemes and data sources. We exploit the fact that all these tasks involve sub-word sequence-to-sequence transformation to propose a systematic solution for all of them using neural encoder-decoder technology. The specific challenge that we tackle is integrating the traditional know-how on separate tasks into the neural sequence-to-sequence framework to improve the state of the art. We address this challenge by enriching the general framework with mechanisms that allow processing the information on multiple levels of text organization (characters, morphemes, words, sentences) in combination with structural information (multilevel language model, part-of-speech) and heterogeneous sources (text, dictionaries). We show that our solution consistently improves on the current methods in all three steps. In addition, we analyze the performance of our system to show the specific contribution of the integrating components to the overall improvement. 

Felix Morger: A Review of Machine Learning Interpretability in Natural Language Processing

Machine learning interpretability is concerned with uncovering the black box of advanced machine learning systems.  In terms of natural language processing, this has come to mean to understand what linguistic competence machine learning systems acquire. In this presentation, I'd like to present natural language processing in terms of machine learning interpretability: How it relates to the field in general, how the discourse around it has evolved, what methodologies have become prominent, and remaining challenges.


Lonneke van der Plas: Analysing compounds and predicting their emergence over time

Compounds can be defined as the formation of a new lexeme by adjoining two or more lexemes. The compound word formation process is productive and as a consequence, compounds are a common word type, but many occur with very low token counts. This creates challenges for NLP tools, and it raises questions about the processes that underlie the generation of novel compounds over time.

Lonneke van der Plas is a senior lecturer in Human Language Technology at the University of Malta. Before that, she was a junior professor at the Institute for Natural Language Processing (IMS), University of Stuttgart, where she led a research group in the framework of the SFB collaborative research center 732. She did a post-doc at the University of Geneva and earned her PhD from the University of Groningen.