Header

Search

Evaluating Embedding Models In The Historic Domain

Supervisor(s): Dr. Juri Opitz (contact point) &Andrianos Michail &Dr. Simon Clematide

Summary

Semantic search performance in historical document collections might be different than in contemporary texts.

Embedding Models are important NLP tools:


  • They give us a “Similarity” for two texts (→backbone of document retrieval)
  • The “accuracy” of these models is evaluated on large scale benchmarks
  • How trustworthy are such large-scale evaluations for the historic domain?

To test this we want to:
 

  • Investigate in-domain evaluations of embedding models on historic newspaper texts
  • The “accuracy” of these models is evaluated on large scale benchmarks
  • E.g., Matching newspaper titles against newspaper texts

Will the best models from the benchmark also be the best for in-domain tasks?

Results will have implications on the reliability of benchmarks and the recommended use of embedding models in a real world project (e.g Impresso)        

Requirements

  • Deep Learning
  • Python/PyTorch