Time & Location: every 2 weeks on Tuesdays from 10:15 am to 12:00 pm in room AND 3.46.
Please note that the room has changed from previous semesters.
Online participation via the MS Teams Team CL Colloquium is also possible.
The creation of suitable reading comprehension materials for reading skill assessment tests in various frameworks (such as SAT, PISA, TOEFL or our own group's MultiplEye project) is both time and resource intensive. It often involves multiple iterations of question creation, quality control and field testing to get to a satisfactory result. In this talk, on the one hand, I will explore some ways in which LLMs could be used in the creation and evaluation process of reading comprehension questions in the context of reading skill assessment tests. On the other hand, I will outline what a model specialized for this task could look like, and how it could be integrated in the reading comprehension question creation process, in order to speed up the process and reduce its resource intensity.
The MultiplEYE COST action has officially started about a year ago. Its goal is to enable a multilingual eye-tracking while reading data collection that can be used as a basis to study human language processing within psycholinguistic research or the evaluation and improvement of machine language processing within machine learning research. I will present intermediate results and challenges of the action. Those range from the creation of a parallel stimulus corpus of more than 20 languages including carefully created comprehension questions to the development of software tools for the experimental presentation and the preprocessing pipeline of the eye-tracking data. One of the big challenges is the coordination between languages and labs mostly across Europe but also internationally.
It is important to investigate the behavior of the machine translation (MT) metrics when facing different error types, particularly accuracy errors, as those can lead to dangerous outcomes e.g. in legal or medical contexts. Last year, we developed ACES, a Translation Accuracy Challenge Set with 68 phenomena, varying from simple word/character-level perturbations to more complex errors. In this talk, I will outline how the scores assigned by a wide range of MT metrics when evaluated on ACES can be standardized onto a common scale. By doing so, we can compare the scores given to both correct and incorrect translations, revealing the metrics' sensitivity to various types of accuracy errors. As part of our research into the behavior of MT metrics across various phenomena, I will also discuss our work on annotating error spans in ACES. These spans can be used to develop more interpretable MT metrics that predict error spans rather than a single sentence-level score.
Preclinical neuroscientists currently lack a comprehensive resource that can assist them in making well-informed decisions when planning animal experiments. The drug development process is hindered as considerable amount of potentially relevant evidence is scattered throughout the literature without systematic curation. The goal of my PhD project is to provide scientists with a centralized access to this information, enabling them to optimize their research planning and ultimately reduce the number of experimental animals needed. In this talk, I will outline the methods that we are planning to use to achieve this objective. Furthermore, I will present an ongoing work for Named Entity Recognition of drug and disease names from clinical trial registries.
LLMs have been shown to be extremely useful generic assistants and can be applied directly to an astonishing array of tasks. But in an enterprise setting, one of the most important and useful tasks is to answer questions grounded in company-specific knowledge. This requires for LLMs to be customised or updated with external knowledge, while still retaining the useful general-purpose abilities that we've come to appreciate from applications like ChatGPT. Furthermore, existing open-weight LLMs are predominantly trained in English and their multilingual abilities largely under-explored, or at least undocumented. In this talk, I will present results from a recent internship project, in which we investigated custom "chat" LLMs that could potentially serve German-speaking companies in a Swiss setting. Specifically, I will discuss ways of integrating company-specific knowledge and how we can adapt publicly available, English-centric LLMs for a German-language use case.
Mouse Tracking for Reading (MoTR) is a new naturalistic incremental processing measurement tool that simulates eye-tracking. MoTR runs in the browser, enabling cheaper data collection, and collection in places where no eye-tracking equipment is available. In a MoTR trial participants are presented with text that is blurred, except for a small in-focus region around the tip of the mouse. Participants move the mouse over the text, bringing individual words into focus in order to read. Mouse movement is recorded, and can be analyzed similarly to eye-tracking data. We implement MoTR experiments in Magpie and validate it in two suites of experiments. First, we record MoTR data for the Provo Corpus, for which eye-tracking data exists. We find strong correlations between eye-tracking and MoTR reading times (RTs) from 0.67-0.78. In an analysis similar to Smith and Levy (2013), we find a linear effect on MoTR RTs of by-word surprisal (estimated from GPT-2). Second, we conduct a cross-methodological replication of three experiments from Witzel et al., 2012 and Boyce et al., 2020 that test preference for high vs. low attachment. MoTR RTs replicate previous self-paced reading results and novelly reveal how regressions are implicated in the processing of these phenomena.
The purpose of this talk is to present the SNF project ProPoSaL (Prototypes and Parts-of-Speech across Languages) that I have been part of since joining this department. In brief, the goal of the project is to investigate the existence of adjectives as a prototypical category across languages from the perspective of NLP, neurolinguistics as well as theoretical lingustics. As far as the NLP portion of the project is concerned, the broad idea is to analyze word embeddings in order to verify whether the (non-)existence of adjectives in a certain language, as hypothesized by theoretical lingustics, manifests itself in a given model. I will first elaborate on the project before proceeding to illustrate the part pertaining to NLP and the relevant methods, which for my role in the project include topological data analysis (TDA). Namely, I will give a hands-on overview of the main concepts and tools of TDA that I am using in the project.
The way we represent language has a significant impact on how well we perform various tasks. There are three primary methods: Bag of tokens (BOW), token sequences, and graph-based representation. When we compare these approaches, the graph-based representation offers a richer perspective on the relationships between elements in the text. It reflects an inherent compositional and hierarchical language structure, allowing for injecting more prior language knowledge. In this talk, I will present the outcomes of applying graph-based representations in the context of semantic textual similarity and monolingual alignment tasks. Specifically, I will discuss ways of transforming text to graphs, multiple negative ranking learning in settings of monolingual alignment and different explainability techniques for graph neural networks.
Interpretability research of pre-trained neural language models and, more recently, large language models (LMs), has been focused on a wide range of aspects. This includes linguistic abilities or different types of biases such as gender and political biases. However, to date, it remains an open question whether LMs also exhibit cognitive biases, or, in other words, whether an LM is biased towards specific cognitive profiles (e.g., "high working memory").
We propose to investigate this question by deploying pre-trained generative LMs (mono- and multilingual; of different sizes) to assess the correlation between human processing effort, using eye-tracking measures as a proxy, with various metrics derived from scores such as token probabilities, surprisal, and entropy. To this end, we will use the individual differences corpus (InDiCo; Haller et al. 2023), which provides scores of a large psychometric assessment for a range of cognitive capacities such as verbal and non-verbal working memory, cognitive control, reading fluency, and intelligence. We use the psychometric scores to subset the data into groups according to cognitive capabilities (e.g., "high working memory" vs. "low working memory"), and assess the above-mentioned correlations for each group separately. Comparing them will allow us to draw conclusions on the type of reader a given language model represents, i.e., to set up a profile for a given pre-trained LM in terms of the cognitive capabilities it emulates. In a second step, we will test whether fine-tuning will allow us to bias LMs to exhibit certain cognitive traits. This might be desirable for a given downstream task, tailored towards specific groups, or when attempting to estimate surprisal for individuals belonging to a specific group in psycholinguistic analyses.
Narratives are everywhere and an important aspect of society – thus they tell us a lot about the world we live in. Also, narratives often contain certain characters, such as heroes, villains and victims. In this talk, we demonstrate how to use GPT-4 to extract heroes, villains and victims from political text. And we attempt to explain what the extracted characters can tell us about the world we live in.
I'll discuss work in progress based on https://aclanthology.org/2022.wnu-1.6/ and am curious to hear your feedback!
Is it possible to distinguish human- and machine-translated sentences solely by examining their word alignments with the source text? Quite often, the answer is yes! Previous research has indicated that machine translation tends to replicate the syntax of the source language, whereas human translators reposition words and even split or merge sentences. However, such syntactic mirroring, or syntactic literarity, has also been observed in human translations. Furthermore, alternative syntactic structures can become inaccessible to human post-editors (Carl and Schaeffer, 2017), further contributing to the standardization of syntax in language. In my presentation, I will introduce some ideas for the development of a new metric to assess syntactic variability in translation, which I call the Syntactic Creativity Index. As this work is still in its initial stages, I will only share some preliminary results and discuss potential ways forward.
Vision-Language Models (VLMs) have made advances in generating 2D images and conducting zero-shot inference on new and existing benchmarks. Pretraining on large-scale collections of image-text pairs has underpinned the performance of these systems. In contrast, limitations on data availability and the diversity of requirements have constrained the number of large-scale vision-language systems designed for applications where visual inputs are 3D. The ability to align over modalities has motivated building on the abilities of large VLMs trained on 2D images but these approaches face challenges in the spatial shift presented by an additional dimension and the compositional requirements of inferring over scenes. This presentation will detail cross-modal tasks designed to build on the generative and discriminative abilities of VLMs in settings where the visual input is a 3D scene.