Xcollect : a generic “user-oriented” approach

to retrieve data from biological sources on the web



Introduction

Exploiting at best all the mass of biological information stored in the numerous and heterogeneous public data sources is the next challenge in bio-informatics. Various functionalities are proposed by resource providers in terms of databank retrieval or analysis tools. Integrated systems exist that offer unified access to heterogeneous sources and resources. Mediation architectures allow in certain case-studies automatic processing of complex queries. Most of the time, existing solutions (reviewed in [1]) tend to lack flexibility when dealing with any biological question.

The Xcollect project is based on the distinction between two levels of problem analysis : first the design of a retrieval scenario involving relevant sources, second, the enactment of this scenario to collect and integrate desired data [2]. At the first level, most users wish to keep the control of scenario design, so that their personal preferences and expertise about sources could be taken into account. A generic model has been produced to allow users describing their scenarios so that an automated solution can be envisaged to deal with the second level of the problem. The Xcollect application supports such scenario description as well as its execution. It is a java application composed of two modules : the configuration module and the execution module.

Automation of the data collecting process allows taking into account the frequent changes in source contents by refreshing the data in a time-saving manner.


THE GENERIC SCENARIO MODEL AND THE CONFIGURATION MODULE

The Xcollect data retrieval process is based on a generic scenario model. This model appears as a succession of steps described in the XML Xcollect scenario_DTD. For each step, following information is specified :

(1)     source name and location,

(2)      input formal name and value (inputs include parameters for query construction, as well as relevant data retrieved at any previous step),

(3)     output formal name and type,

(4)     patterns necessary to extract the useful data from the returned document (e.g. regular expressions).

 

The Xcollect configuration module thus offers an interface that allows the user to enter manually all the information specifying his scenario. Entered data are then stored into an XML scenario document according to the generic scenario_DTD.


THE EXECUTION MODULE

The Xcollect execution module takes as input the XML scenario document, implements each step of the scenario and returns an XML document containing the retrieved data structured according to the generic Xcollect session_DTD. Indeed, structuring the retrieved data also implies a model. In the absence of any existing standard solution, a simple generic session_DTD has been written on the basis of the scenario_DTD. It describes the steps of the scenario with their respective input and output data. Depending on the desired usage of the data, appropriate XSL transformations should allow easy conversion of this generic representation of the retrieved data into desired more human readable documents.


AVAILABILITY

The Xcollect scenarios that have already been tested with success are currently deployed as web services.

These are available for testing on our Xcollect Web Service browser at http://crick.loria.fr:8080/ ws_browser/index.jsp


PUBLICATIONS

Devignes MD and Smaïl M (2004) Integration of Biological Data From Web Resources : Management of Multiple Answers Through Metadata Retrieval. Short paper, ISMB-ECCB, Glasgow, 31 july -4 august 2004, short paper.

 

Devignes MD, Schaaff A and Smaïl M (2002). Collecte et intégration de données biologiques hétérogènes sur le Web – Xmap : application dans le domaine de la cartographie du génome humain. Revue des sciences et technologies de l’information (RSTI) – Série Ingéniérie des systèmes d’information (ISI) 7 : 45-61.