Text Technology/Digital Linguistics colloquium HS 2025
Time & Location: every 2 weeks on Tuesdays from 10:15 am to 12:00 pm in room BIN-2-A.10.
Online participation via the MS Teams Team CL Colloquium is also possible.
Responsible: Juri Opitz
Colloquium Schedule
| 16.09.2025 | cancelled | |
| 30.09.2025 | Cui Ding | Negar Foroutan |
| 14.10.2025 | Chiara Tschirner |
Anna Bondar |
| 28.10.2025 | Patrick Giedermann | Andrianos Michail |
| 11.11.2025 | Jannis Vamvas |
Marius Huber |
| 25.11.2025 | Michelle Wastl | Gerard Sant |
| 09.12.2025 | Alexander Hoyle |
Longqian Ming |
09 Dec 2025
Alexander Hoyle: Social-Science Oriented Applications of Natural Language Processing
Methods in natural language processing (NLP) have matured to the point where they can address complex real-world problems. However, the process of advancing machine learning and NLP relies on the evaluation of constrained and often artificial tasks that may bear no clearly valid relationship to real-world problems. This disconnect leads to failures in generalization and limits methods’ utility.
In contrast, the social sciences provide a rich problem space, where questions of validity are at the center: what and how should we measure? Here, moving from language data to quantifiable social constructs demands complex reasoning over language. The premise underpinning this talk is that an effective way to advance NLP as a field is to anchor it in the needs of social science. An emphasis on validity helps mitigate NLP's benchmark myopia while also advancing the study of social phenomena. The talk will focus on some recent contributions toward construct conceptualization and construct measurement in text, two key activities in the social sciences.
Longqian Ming: Breathing in Sign Language - Challenges and Plan
Breathing is an unceasing physiological activity in humans and many other animals. While research on phonation-based spoken language has revealed breathing’s subtle adjustments for speech fluency, conversation, and even infants' rhythm perception, the study of breathing in sign language has remained unexplored for 45 years. Recent work revealed unique sign language breathing patterns: it is quicker, more evenly distributed, and less stable than speech breathing. Rapid signing movements lead to each breathing cycle spanning multiple signs. Respiratory signal recorded by belts appears to fluctuate with movement, which cannot be simply dismissed as artifacts, since oral and nasal air channels may produce actual respiratory fluctuations. Furthermore, unlike in speech, where exhalation clearly anchors the speech phase, no linguistic unit consistently aligns with respiratory boundaries. Therefore, compared to speech breathing research, sign language breathing research faces unique challenges. This presentation will state the possible solutions and the experiment plan. Despite these difficulties, preliminary data have revealed mouth-produced sounds, opening an intriguing avenue for integrating acoustic cues into sign language recognition technology. An initial annotation scheme has been developed, laying the groundwork for an automatic labeling model based on current sound detection systems.
25 Nov 2025
Michelle Wastl: Recognizing Token-Level Semantic Differences in Crosslingual Related Documents
Recognizing token-level semantic differences is a largely unexplored task, and existing evaluation data has so far relied on synthetic augmentation from related tasks such as interpretable STS (iSTS). In this talk, I present the data acquisition process for SwissGov-RSD, the first human-annotated, naturalistic dataset of token-level semantic differences in crosslingual related documents. I discuss annotation challenges, design decisions, and benchmarking results across a range of state-of-the-art encoder models and LLMs. The findings show that although these models are theoretically capable of handling multiple languages and long input sequences, they perform surprisingly poorly on SwissGov-RSD out-of-the-box as well as when trained on synthetic data. This opens up a wide range of opportunities for future work. I will conclude by describing work-in-progress on one such opportunity: How encoder models may be improved with a simple contrastive learning objective.
Gerard Sant: Tooling and Modality Choices in Sign Language Translation
Sign language processing (SLP) is still far behind spoken-language NLP in terms of tooling, reproducibility, and comparability of results. Many state-of-the-art systems rely on complex, ad-hoc codebases that are hard to adapt or even impossible to reproduce. This talk discusses ongoing work on sign language translation and the role of input modality. I will first outline MultimodalHugs, a lightweight extension of Hugging Face that makes it easier to run and compare experiments with non-text inputs such as pose sequences and video. Building on this, I will present the Modality Matters study, where we use MultimodalHugs to systematically compare pose data, precomputed video features, and end-to-end video as inputs for sign language translation on How2Sign under a shared experimental setup. I will conclude with a very brief outlook on other ongoing projects that reuse the same infrastructure, illustrating how a shared toolkit can lower the barrier to SLP research and enable more robust and fair experimental comparisons.
11 Nov 2025
Jannis Vamvas: Machine Translation for Romansh Language Varieties
Natural language processing for the Romansh language has historically focused on the standardized variety, Rumantsch Grischun, a scope that does not reflect the actual language use. My talk presents an ongoing applied science project that aims to extend machine translation to the five Romansh idioms: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. An important first step was the collection of parallel training data and reference translations, accomplished in close collaboration with Lia Rumantscha, RTR, and PHGR. A baseline NMT system developed with the new resources achieves state-of-the-art BLEU for all six target varieties. With this milestone reached, our focus is now shifting to more fundamental research questions that will require human evaluation.
Marius Huber: Topological data analysis for parts-of-speech across languages
In this talk, I will present the SNF project ProPoSaL (Prototypes and Parts-of-Speech across Languages) and some results and methods from it. The goal of the project is to investigate the existence of adjectives as a prototypical PoS-category across languages from the perspective of NLP, neurolinguistics as well as theoretical lingustics. In the NLP portion of the project, the idea is to analyze whether relationships among PoS-categories in a certain language (as hypothesized by theoretical lingustics) manifest themselves in the latent space of a language model. I will first present some results of this analysis for English and Mandarin Chinese, and then proceed to present the concepts stemming from topological data analysis (TDA) used for this. For that, I will illustrate the overall idea behind TDA as well as the concrete tool that we developed for and used in our analysis.
28 Oct 2025
Patrick Giedermann: ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos
The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.
Andrianos Michail: Embedding Digitized Historical Articles
The large collections of text digitized through imperfect OCR systems require semantic search models that can perform robustly on noisy input. Such collections are highly heterogeneous, exhibiting varying degrees of OCR quality, spelling conventions, and other inconsistencies — all phenomena that are underrepresented in the training data of standard embedding models, with clear ramifications for their generalization. When tasked with adding semantic search capabilities to the impresso project (a historical newspaper archive), we suspected that these heterogeneous texts might pose challenges we could not leave to chance. To confirm these difficulties, we constructed in-domain and simulated test sets that reveal the performance degradation of models trained on modern text. Fortunately, we find that this performance drop can be mitigated through simple and inexpensive methods that adapt models to the historical and error-prone text domain. Finally, we combine these methods to derive our new OCR-robust models, which will serve the impresso digitized newspaper collection.
14 Oct 2025
Chiara Tschirner: On the connection between naming speed, visual search and reading comprehension
Rapid automatized naming (RAN) is one of the most established predictors of reading ability, including reading comprehension, due to its general similarity to the reading process. However, this broad scope makes it difficult to pinpoint specific issues in reading development. Previous research investigating the components of RAN and their connection to reading suggests that visual scanning could be one of the main properties of RAN that predicts reading ability. Visual search is a task where the participant has to find the target symbol among distractors, by efficiently shifting overt attention,i.e. visual scanning. There are studies showing that performance in this task predicts reading ability, though most of them have focused on reading fluency, rather than reading comprehension. In this work, we investigate the effects of visual search performance and RAN on two levels of reading comprehension, namely on the word and on the sentence level. We use Bayesian linear mixed models to estimate these effects and separately calculate Pearson correlation coefficients to further shed light on the relevance of visual scanning for reading comprehension.
Anna Bondar: Cognitive MoE Routing for Aligning Computation with Human Reading Behavior
Mixture-of-Experts (MoE) has emerged as an efficient architecture for training LLMs, whose core mechanism is a router that decides which experts process which tokens. We aim to enhance this routing by integrating human cognitive signals into it. To this end, we introduce cognitive routing: a gating strategy that uses signals of human text processing, derived from eye-tracking data, to decide which experts handle each token, aligning computation with human reading behaviour. Our router conditions not only on text features but also on the similarity between token-level representations and learned human-reading embeddings when selecting experts. In doing so, we aim to internalise human processing information within the routing mechanism and improve its effectiveness.
30 Sept 2025
Cui Ding: When Half a Word Is Enough: Bottom-up Information Processing in English and Chinese
Contemporary theories model language processing as integrating both top-down expectations and bottom-up inputs. One major prediction of such models is that the quality of the bottom-up inputs modulates ease of processing---noisy inputs should lead to difficult and effortful comprehension. We test this prediction in the domain of reading. First, we propose an information-theoretic operationalization for the “quality” of bottom-up information as the mutual information (MI) between visual information and word identity. We formalize this prediction in a mathematical model of reading as Bayesian update. Second, we test our operationalization by comparing participants’ reading times in conditions where words’ information quality has been reduced, either by occluding their top or bottom half, with full words. We collect data in English and Chinese. We then use multimodal language models to estimate the mutual information between visual inputs and words. We use these data to estimate the specific effect of reduced information quality on reading times. Finally, we compare how information is distributed across visual forms. In English and Chinese, the upper half contains more information about word identity than the lower half. However, the asymmetry is more pronounced in English, a pattern which is reflected in the reading times.
Negar Foroutan: Responsibly Building Multilingual Language Models for Hundreds of Languages
Large Language Models (LLMs) have reshaped artificial intelligence by enabling systems that can understand, generate, and reason with human language at an unprecedented scale. However, most development has focused on English and a small set of high-resource languages, raising important questions about inclusivity, fairness, and global reach. Multilingual Large Language Models (MLLMs) extend these capabilities across hundreds of languages, but they face significant challenges. Performance in low-resource languages is constrained by data scarcity, tokenization methods often privilege dominant languages, and existing benchmarks can obscure weaknesses outside high-resource contexts. In this presentation, I will examine these challenges and outline potential solutions. I will discuss approaches to overcoming data limitations, such as contrastive learning for language identification and strategies for more balanced multilingual pretraining. I will also highlight methods for developing parity-aware tokenization and for evaluating MLLMs more effectively, including assessments that account for regional and cultural knowledge.
16 Sept 2025
cancelled