Vision-Language Multilingual NLU

Supervisor: Farhad Nooralahzadeh


Vision-and-language pre-training (VLP) models (e.g., VilBERT, CLIP) have driven tremendous progress in joint multimodal representation learning. To generalize this success to non-English languages, recent works (e.g., mUNITER, M3P, UC2) aimed to learn universal representations to map objects that occurred in different modalities or texts expressed in various languages into shared semantic space.

Recent benchmarks across various tasks and languages showed a large gap between monolingual and (zero-shot) cross-lingual transfer of the current multilingual VLP models and motivated future work in this area.

Such a catastrophic forgetting and interference have been addressed recently by applying the Lottery Ticket Hypothesis and sparse fine-tuning techniques in multilingual NLP tasks.

We would like to explore these techniques in multilingual vision-language NLU tasks. Based upon this, we would like to answer the following questions:

  1. 1-  Are there winning tickets for cross-lingual transfer on multilingual vision-language NLU tasks?

  2. 2-  What is the impact of the sparse fine-tuning on the performance of multilingual vision- language models in cross-lingual vision-language NLU tasks?

  3. 3-  Does this hypothesis provide a model- and task-agnostic framework in zero- and low- resource vision-language NLU tasks?



  • Programming knowledge in Python

  • Familiar with PyTorch framework

  • Foundational knowledge in the field of machine learning


  1. IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages, Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond, Elliott, Edoardo Maria Ponti, Ivan Vulić.

  2. Composable Sparse Fine-Tuning for Cross-Lingual Transfer, Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Ivan Vulić.