Text Reuse Detection (BA thesis)

Supervisors: Simon Clematide/Ann-Sophie Gnehm 

Introduction: The Swiss Job Market Monitor is extracting information from job advertisements to monitor and analyze trends on the Swiss job market.  For the NRP77 “Digital Transformation” we analyze, how digitalization is changing task and skill profiles of workers. Our multilingual data consists of print and online job advertisements in German, French, English and Italian, and covers the time span from 1950 up to today. 

Aim and Purpose:  The style of job advertisement is often formulaic and typical phrases, sentences or whole paragraphs can be found in many different job ads («Wir sind ein führendes Unternehmen im Bereich …»,, «Für unser kleines, dynamisches Team suchen wir per sofort …»,). Nonetheless the language changes over time, and varies across industries, professions or print versus online job ads. The aim of this project is to automatically detect and analyze text reuse in our job ad corpus. Tasks include: 

  • Application of existing text reuse detection tools (e.g. passim) to our multilingual corpus 
  • Analysis of text reuse on paragraph, sentence and phrase level and its development over time and distribution over industries and other metadata (e.g. profession, text zones) 
  • Creation of a phrasebook for frequent formulaic utterances 
  • Visualization of phrasebook results in a simple (web) application 

This project can be a programming project or a BA and the range of tasks (mono- vs. multi-lingual analysis) can be discussed.  

Procedure: The first part of the work consists in finding reasonable parameter settings for the passim tool (n-gram size, edit distance). The second part consists in analyzing passim’s JSON output (word n-grams) and compiling a phrasebook. The last part consists in providing a web application (e.g. based on Meilisearch or any other retrieval engine) where users can search for reused text elements, taking into account metadata search facets and textual facets (length of text reuse passage, content words). 

Requirements: 

  • Flair for data science analyses and web applications 
  • Programming skills in Python