OFAI

AUREX/W: Automated retrieval and extraction of web contents

In today's information society, an ever increasing number of companies, organizations or individuals offer a plethora of information. Accordingly, information retrieval is a task growing in importance. Currently, however, most of this information is searched in a rather trivial way by making use of key words or phrases which are entered into standard search engines.

Companies, however, usually do not search for documents as such, but for subject specific information. Given the current state of the art, these inquiries are mostly done by hand, making use of search engines only for preselection.

Automatic information extraction aims at automating these activities for large amounts of texts. This is currently possible in the case of quite specific retrieval tasks like for stock information where systems can retrieve relevant data even from texts which do not contain predefined search terms. Other methods rely on knowledge about the exact structure of web pages to be searched. Unfortunately, large amounts of information in the WWW are available only in relatively unstructured form.

It is the goal of AUREX/W to investigate the development of tools for information extraction from unstructured websites, thus making information extraction methods applicable to a significant larger portion of the web in a largely task-independent way. AUREX/W aims at developing tools that are generic and easy to use for end-users and to provide a working toolbox for certain common subtasks of the information extraction process. The project is focused primarily on extraction from German language websites.


Duration: 2006
Sponsor: wwtf (city of Vienna)
Researchers: Johann Petrak, Friedrich Neubarth
Partner: Web Integration IT Services


Aurex/W uses the GATE framework. You can find software that is publicly available under the GPL here.