13.12.2023Research spotlight: Synthetic reading data

Last week, Lena S. Bolliger presented an innovative paper at EMNLP 2023, introducing ScanDL, a cutting-edge approach for generating synthetic scanpaths on text using a diffusion model. ScanDL is the new state of the art to simulate human reading behaviour.

Why would we even care to simulate human reading data?
Studying how people read gives us valuable insights into human language processing. When we read, our eyes focus on important words or skip words that are not important to understand the text. Language models don't naturally know which parts of a text are important for a human reader. Nonetheless, we’d like them to achieve human performance on language tasks! One of the goals of our group is therefore to explore how to use human cognitive data to make language models more human-like. A problem with using human cognitive data is that the process of collecting reading data is not only time-consuming but also expensive and available datasets are thus limited in size. Generating synthetic reading data that mimics how humans read a text is therefore an optimal solution to leverage cognitive data for NLP research.

How can we make use of synthetic reading data?
For further insights, explore our second paper accepted at EMNLP 2023 by Shuwen Deng, which delves into the practical applications of generated reading data. This research illuminates how fine-tuning language models on synthetic gaze data can enhance their performance on NLU downstream tasks, such as sentiment classification. Experience the impact of synthetic reading data and its role in contributing to potential improvement of large language models!

ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts, EMNLP 2023
Lena S. Bolliger, David R. Reich, Patrick Haller, Deborah N. Jakobi, Paul Prasse, Lena A. Jäger
[ ArXiv-preprint|Video|Poster|bib]

Pre-Trained Language Models Augmented with Synthetic Scanpaths for Natural Language Understanding, EMNLP 2023
Shuwen Deng, Paul Prasse, David R. Reich, Tobias Scheffer, Lena A. Jäger
[ArXiv-preprint| bib]

Back to news overview

Department of Computational Linguistics Digital Linguistics

Quicklinks und Sprachwechsel

Main navigation

13.12.2023Research spotlight: Synthetic reading data

Unterseiten