Texttechnologie-Kolloquium HS 2021

Kolloquium HS 2021: Berichte aus der aktuellen Forschung am Institut, Bachelor- und Master-Arbeiten, Programmierprojekte, Gastvorträge

Zeit & Ort: alle 14 Tage dienstags von 10.15 Uhr bis 12.00 Uhr, BIN-2.A.01 (Karte)

Verantwortlich: Dr. Tilia Ellendorff

Colloquium Schedule

Date

Speaker & Topic

21.09.21 CANCELLED
05.10.21 CANCELLED
19.10.21

Tannon Kew: 

Getting More with Less: Improving Specificity in Hospitality Review Response Generation through Data-Driven Data Curation

Janis Goldzycher (short presentation):

Adjusting the Word Embedding Association Test for Austrian, German and Swiss Demographics

 

02.11.21 Eva Vanmassenhove [ONLINE SESSION]: TBD
16.11.21

Anastassia Shaitarova: TBD

Olga Sozinova: 

How do you tokenize? Humans vs. algorithms

30.11.21

Jason Armitage: TBD

Marek Kostrzewa: TBD

14.12.21

Patrick Haller: TBD

Noemi Aepli: TBD

Abstracts

 

19.10.2021 

Getting More with Less: Improving Specificity in Hospitality Review Response Generation through Data-Driven Data Curation
Neural network-based approaches to conditional text generation have been shown to deliver highly fluent and natural looking texts in a wide variety of tasks. However, in open-ended tasks such as response generation or dialogue modelling, models tend to learn a strong, undesirable bias towards generating overly generic outputs. This can, at least, be partially attributed to characteristics of the underlying training data, suggesting that finding ways to improve the quality of the training data at scale is crucial. In this talk, I will present results from experiments aimed at improving thematic specificity in review response generation for the hospitality domain. These experiments focus primarily on data-driven approaches to quantify ‘genericness’ in the training corpus and subsequently filter undesirable and uninformative examples. Using both automatic metrics and human evaluation, we show that such targeted data filtering, despite reducing the amount of training data to 40% of its original size, improves specificity in the resulting generated responses considerably.

 

Adjusting the Word Embedding Association Test for Austrian, German and Swiss Demographics

Introducing the Word Embedding Association Test (WEAT), Caliskan et al. (2017) showed that English word embeddings often contain human-like biases, such as racial and gender bias. Lauscher and Glavaš (2019) extended these tests to other languages, including German. However, these bias tests are still tailored to American demographics. I will argue that for this reason the tests provide only an imprecise measure with multiple distorting factors for the biases in question. Subsequently, I will present an experiment setup that considers the specific demographic circumstances of Austria, Germany and Switzerland and embedding tests for racial bias, anti-immigrant bias, gender bias and antisemitic bias. The results reveal stronger and more biases than previously found for German Wikipedia-based embeddings and weaker and fewer biases for embeddings based on other corpora.

02.11.2021

16.11.2021

How do you tokenize? Humans vs. algorithms
Words can be split into segments by human annotators or by algorithms. The resulting segmentations vary due to different linguistic intuitions in humans, and due to particular designs of algorithms. The main difference between those methods lies in the plausibility of the resulting segments. But how is plausibility of segmentations determined? There are several ways to make comparisons, such as measuring an overlap between segments' sets and comparing their size. However, both of these methods are indirect and do not reveal much about the decisions taken while segmenting. In this study, we provide a new method to assess the properties of segmentations relying on an analysis of the subwords' lengths. Our experiments on English, Finnish and Turkish data show that BPE finds more regularities in longer words, Morfessor tends to identify bigger, less regular chunks, and human annotators optimize segments in longer words so that they are neither too short nor too long.

30.11.2021

14.12.2021