As of September 2017, I have been collaborating in the impresso project. impresso stands for Integrated Monitoring of Historical Press Corpora. During a project phase of three years, which is financed by a SNSF Sinergia grant, the DHLAB at the EPFL, the C2DH at the University of Luxembourg, and our institute will work on text mining of historical newspapers.
My main contribution to this undertaking, which will (hopefully) end in a dissertation, will comprise lexical semantic indexing of texts, as well as topic modeling of historical newspaper articles. Since we are dealing with multiple languages in our newspaper collection, while only some of the data is available as parallel corpora, my main focus will lie on cross-lingual topic modeling. More concretely, I will dedicate my main efforts towards transfer learning, that is, making knowledge gained from topic models in one language available in other languages.
Right now I am ...
... working on improving OCR for historical newspaper texts.
... working on a state-of-the-art paper on natural language processing for historical newspapers.
... getting text out of 200 years of NZZ newspapers.
... working on topic models for the federal gazette, a parallel corpus. Also, I'm trying out if we can achieve some topic-like distribution by clustering word embeddings.
If you want to know more about what is going on right now, you might be interested in my blog.
... there is nothing yet.