Navigation auf


Department of Computational Linguistics

Automatic Classification of Large Text Collections

Betreuer: Gerold Schneider


The aim of this programming project or bachelor thesis is to classify large numbers of newspaper articles. The project is conducted in cooperation with a Swiss media database. There are interfaces that provide documents in XML or JSON form. OpenSource tools such as WEKA, Rapidminer, Date or others are to be examined.

In addition to the testing of various algorithms and methods, the evaluation of these algorithms and methods is the focus of our work. A comprehensive Gold Standard is made available to us in the media database. Depending on the time available, specific adaptations of the standard solutions can also be made.

Aim and Purpose

  • Testing Open Source tools for document classification
  • Evaluation
  • Adjustments, if necessary
  • If the task is successfully completed, it may lead to the development of a new solution with our industrial partner.


Knowledge and skills required for addressing the task include programming skills in Python, Perl or Java. Experience with XML and tools like WEKA is an advantage, but not mandatory. An interest in automatic media content analysis is also recommended.

Weiterführende Informationen


Teaser text