Header

Search

UZH at Swiss NLP Expo 2025

Unpacking our puzzle

The puzzle highlights the importance of subword tokenization in NLP.

  • Zür + cher: NLP systems compose rare words out of subwords. For this, they use a subword tokenization algorithm such as BPE.
  • nöd + betg: NLP systems learn representations end-to-end across languages. Even though “nöd” is Swiss German and “betg” is Romansh, NLP will recognize that the words have the same meaning when given enough training data.

Visualizing tokenization in current LLMs

Demo: https://hf.co/spaces/ZurichNLP/subword-tokenization

  • Out-of-domain words: ChatGPT needs to compose “Zürcher”, “nöd” and “betg” out of several subwords. Generally, more needed subwords means higher cost and higher latency.
  • Tokenization disparity: ChatGPT needs many more subwords to compose a Romansh sentence than a German sentence.

Our research on subword tokenization for NLP

Ever since of our researchers – Rico Sennrich – proposed using subword tokenization with BPE for NLP (Sennrich et al., 2026), our department has contributed important research on subword tokenization in NLP:

  • As seen in the demo, SwissBERT is an encoder model with a vocabulary tailored to the four Swiss languages. “Zürcher” and “betg” are in the vocabulary of SwissBERT. There is also a Swiss German variant of SwissBERT that – of course – contains the token “nöd”.
  • Amrhein & Sennrich (2021) investigated the limits of tokenization with non-concatenative morphologies.
  • Aepli & Sennrich (2022) improved robustness of Swiss German NLP with character noise.^
  • Pelloni et al. (2022) analyzed subwords as a tool for linguistic typology
  • Jiang et al. (2023) used subword tokenization to align spoken languages to signed language represented in SignWriting

Stay connected!

Grid containing content elements