Navigation auf


Institut für Computerlinguistik

Text Technology/Digital Linguistics colloquium FS 2024

Time & Location: every 2-3 weeks on Tuesdays from 10:15 am to 12:00 pm in room BIN-2-A.10.
Please note that the room has changed from the previous semester.

Online participation via the MS Teams Team CL Colloquium is also possible.

Responsible: Marius Huber

Colloquium Schedule

20 Feb 2024

Andrianos Michail: Robustness of Multilingual Embedding Models in Historical News

Embedding models are fundamental components of semantic search engines and other Natural Language Processing (NLP) systems, as they provide us with powerful vectorized representations of text ("embeddings"). But how can we judge whether one embedding model is better than another or diagnose perspectives for their improvement? While for English and even English-X language pairs, the situation appears mostly clear due to the availability of large-scale benchmarks, we still don't know much about the robustness of embeddings towards extremely heterogeneous texts as we can encounter "in the wild", i.e., texts that can be from a different language, from a different time, contain transcription errors and/or code-mixes, just to name a few common phenomena. To test such an open setting, we plan to build a testbed for embedding models from the IMPRESSO corpus that contains millions of digitized multi-lingual and tempo-spatially distributed news texts from more than two centuries. Are current embedding models up to the challenge?

Zifan Jiang: Recent Developments in Sign Language Processing: towards realistic sign language machine translation

Applying NLP tasks to sign languages is challenging primarily due to data scarcity and the absence of a well-established methodology. While it is still unclear whether an end-to-end or a pipeline approach will take the lead, we notice more basic problems to solve in sign language processing, including segmentation, alignment, and representation. On the one hand, we are working on releasing more and better-quality data that is publicly available. On the other, we draw inspiration from the recent advances in LLMs and deep pretrained models to guide our research in tackling the above-mentioned basic problems.

5 Mar 2024

Bryan Eikema: Why Are Modes of Natural Language Generation Models Inadequate?

The highest probability sequences of most neural language generation models tend to be degenerate in some way, a problem known as the inadequacy of the mode. While many approaches to tackling particular aspects of the problem exist, such as dealing with too short sequences or excessive repetitions, explanations of why it occurs in the first place are rarer and do not agree with each other.  In this talk we will discuss the current attempts at explaining this phenomenon and why we believe those to not paint a full picture. We will also provide an alternative hypothesis that links the inadequacy of the mode to the desire for our models to generalise to previously unseen contexts.

Mario Giulianelli: Measuring utterance uncertainty and predictability via simulation of contextually plausible alternatives

Viewing linguistic communication as information transmission between cognitive agents, successful language production can be understood as an act of reducing the uncertainty over future states that a comprehender may be anticipating. When an individual utters a sentence, they narrow down the comprehender's expectations, and they do so by an amount proportional to the contextual predictability of the utterance. I will discuss two recent studies that demonstrate how we can empirically estimate utterance uncertainty and predictability by simulating potential upcoming linguistic contributions using neural text generators. The first study introduces a statistical framework to quantify utterance uncertainty as production variability, and evaluates the alignment of language generators to the production variability observed in humans. We find that different types of production tasks exhibit distinct levels of lexical, syntactic, and semantic variability, and neural text generators generally achieve satisfactory calibration of uncertainty. In the second study, we use the previously introduced statistical framework to define a novel measure of utterance predictability, which we term information value. Information value quantifies predictability by measuring the distance from contextually plausible alternatives and offers advantages over traditional measures by disentangling various dimensions of uncertainty and being less influenced by surface form competition. Psycholinguistic experiments demonstrate that information value is a superior predictor of utterance acceptability in written and spoken dialogue compared to token-level surprisal aggregates, and that it complements surprisal in predicting eye-tracked reading times.

19 Mar 2024

Janis Goldzycher: Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset

Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca. 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness.

Juri Opitz: Metrics of Graph-Based Meaning Representations and their Interesting Applications

"Who does what to whom?" The goal of a graph-based meaning representation (in short: MR) is to represent the meaning of a text in a structured format. With an MR, we can explicate the meaning of a text, describe occurring events and entities, and their semantic relations. Thus, a metric of MRs would measure a distance (or similarity) between MRs. A main hypothesis of my PhD thesis was that such a meaning-focused similarity measurement can be useful for several important AI tasks, for instance, testing the capability of systems to produce meaningful output (system evaluation), or when searching for similar texts (information retrieval). Moreover, due to the natural explicitness of MRs, I hypothesized that MR metrics could provide us with valuable explainability of their similarity measurement. Indeed, if texts reside in a space where their meaning has been isolated and structured, we might directly see in which aspects two texts are actually similar (or dissimilar).In this talk, I'll give a brief overview of some findings of my thesis, showing the usefulness of MR metrics to important AI applications, including explainable NLG evaluation and semantic search.

16 Apr 2024

Sina Ahmadi: Multilingual Tokenization Parity

Language models have transitioned into ubiquitous commercial web APIs, with recent research highlighting their proficiency in multilingual applications. These APIs operate on a token-based pricing system, where the definition of a token varies depending on the specific model and training data, resulting in varying cost efficiencies across languages. Previous studies have identified several drawbacks of tokenization in multilingual settings, including increased costs, latency, and limitations in contextual learning. This talk discusses an ongoing project aimed at identifying critical factors influencing tokenization parity across languages.

Masoumeh Chapariniya and Aref Farhadi Pour: Comparative Analysis of Modality Fusion Approaches for Audio-Visual Identity Identification and Verification

Multimodal learning involves integrating and combining information from various modalities or sources to enhance learning and comprehension. The fusion of data from different modalities can improve performance in identity recognition scenarios. In our paper, we compare three modality fusion approaches in identity identification and verification scenarios by processing two modalities: voice and face. We explore sensor fusion, feature fusion, and score fusion approaches. Our evaluation, conducted on the VoxCeleb2 dataset using K-fold cross-validation, shows that the feature fusion strategy achieves the highest performance, with an accuracy of 98.33% in identity identification and 0.62% for EER in verification tasks.

30 Apr 2024

Chiara Tschirner: TBA


Pius von Däniken: TBA


14 May 2024

Alessia Battisti: TBA


Iuliia Thorbecke: TBA


28 May 2024

Lena Bolliger: TBA


Ann-Sophie Gnehm: TBA