Answer Extraction and Text Mining
In Answer Extraction we try to design systems that can find, in large volume texts (such as manuals for software or aircraft that often run into tens of thousands of pages), explicit and literal answers to queries that are phrased in everyday natural langugage. Answers may be entire sentences in the documents or mere snippets of sentences.
The main difference to IR systems such as Google is that it is not entire documents that are found, on the basis of certain key terms, but concrete answers to concrete questions. For this to work the system must determine the meaning of both questions and texts and represent it in a suitable, logic based, language.
In technical terms, the system must perform a maximally complete syntactic analysis of the linguistic input, followed by a translation of the syntactic structures into Logical Forms. Of equal importance is a thorough knowledge of the domain terminology. Without this the system will miss out on a large number of answers in the texts.
Text Mining we try to find hidden regularities and dependencies in
large collections of texts. Today's methods basically apply
statistical methods to the textual surface, i.e. they determine
statistical values for occurences of isolated content words. However,
this approach makes such methods less suitable for finding in the texts
dependencies between complex phenomena, such as entire events. In order
to identify this type of entity we need to use more of the lingusitic
information contained in the texts. It is, in particular, essential to
make a syntactical analysis of the texts.
A specific form of Text Mining is Literature Based Discovery. In LBD we try to find, in a number of different texts, descriptions of local functional dependencies which, in combination, can result in the formation of hypotheses about novel large-scale functional dependencies. In biomedicine, for instance in genetics, such methods are used to identify new pathways in the expression of genes.
One important task in Text Mining is relation extraction. In relation extraction we try to identify the core propositions in individual sentences irrespective of their linguistic manifestation (X activates Y, Y is activated by X, activation of Y by X, XY activation etc.). This analysis allows us, despite its rather superficial character, to define in a much more precise manner various search operations that would be impossible to express on the basis of isolated keywords.
Within the scope of the OntoGene project, we collaborate with the NITAS/TMS group (Text Mining Services) at Novartis Pharma AG on the development of novel text mining techniques for the biomedical domain.
In cooperation with Finnova, a leading Swiss provider of banking software, we are investigating applications of Information Extraction technologies in the area of software specifications. In particular we plan to analyze questions and requests from internal and external users and identify relevant segments of internal documentation. The extracted information is intended to support human experts in formulating satisfactory replies to the original queries.
The EU project "Multilingual Annotation of Named Entities and Terminology Resources Acquisition" (Mantra) is executed by different academic and commercial partners. The goal of the project is the generation of multi-lingual biomedial terminologies from public corpora. The project partners will setup a community challenge leading to the annotation of the multi-lingual corpora. The identified entities will be integrated into the multi-lingual terminologies.