Automatic Classification of Large Text Collections

Betreuer: Gerold Schneider

Introduction

The aim of this programming project or bachelor thesis is to classify large numbers of newspaper articles. The project is conducted in cooperation with a Swiss media database. There are interfaces that provide documents in XML or JSON form. OpenSource tools such as WEKA, Rapidminer, Date or others are to be examined.

In addition to the testing of various algorithms and methods, the evaluation of these algorithms and methods is the focus of our work. A comprehensive Gold Standard is made available to us in the media database. Depending on the time available, specific adaptations of the standard solutions can also be made.

Kurze Beschreibung / Short descrption

Aim and Purpose

  • Testing Open Source tools for document classification
  • Evaluation
  • Adjustments, if necessary
  • If the task is successfully completed, it may lead to the development of a new solution with our industrial partner.

Requirements

Knowledge and skills required for addressing the task include programming skills in Python, Perl or Java. Experience with XML and tools like WEKA is an advantage, but not mandatory. An interest in automatic media content analysis is also recommended.

Literature