Multilingual Latent Tokenization
Supervisor: Prof. Dr. Rico Sennrich
Summary
Latent Tokenizers such as the Byte Latent Transformer have the potential to become widespread alternatives to fixed-vocabulary tokenizers such as BPE. However, while multilingual inequity has been discussed as a limitation of traditional tokenizers, it is currently unclear how easily cross-lingual fairness can be achieved with latent tokenizers. In this project, you will apply latent tokenization to a multilingual dataset, and explore methods to increase multilingual equity.
Requirements
- Background in machine learning
- Python