Bioscape Data Sources

From irefindex
Revision as of 13:39, 14 July 2010 by PaulBoddie (talk | contribs) (Added status note.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
NoteNotePlease note that this documentation covers an unreleased product and is for internal use only.

Bioscape data source packages share a number of common structural features which should facilitate the integration of new data types and sources. See the "Supported Data Sources" document for details of specific data sources supported by Bioscape.

Data Source Modules

Each Python package for a data source contains a number of modules - individual files which provide a collection of facilities - and most packages of this nature will contain modules with the following names:

  • download - the retrieval of data from the location where it is published
  • extract - the extraction of data from the Bioscape database for subsequent external processing
  • parse - the processing of data into a form suitable for Bioscape

The bioscape.sources.common package (in the bsadmin distribution) contains classes which can be reused by data source implementations, with the intention that these base classes may help in the development of such implementations by providing commonly needed conveniences. In addition, for textual data sources, the bsindex.sources.text package (in the bsindex distribution) contains a number of modules which also provide reusable classes:

  • citations - base class functionality for parsers
  • parse - file writing and configuration assistance

Overview of Data Processing

The following diagram summarises data processing activities in Bioscape:

Incoming data Database Outgoing data for external use
Download Parse Storage Export
Extraction

Data Source Acquisition

The bioscape.sources module (in the bsadmin distribution) and the bsindex.sources module (in the bsindex distribution) provide a unified API for the acquisition of a data source with a particular name and access to the principal data source operations. Generally, if data source packages support the principal modules (listed above), programs will be able to access the implemented operations through this generic interface.

Examples of accessing data sources can be found in various scripts found in each distribution's scripts directory, including the following:

  • bioscape_download.py and bsindex_download.py - access to the appropriate data source, downloading functionality through the fetch_files function
  • bioscape_parse.py and bsindex_index.py - access to the parsing functionality of the appropriate data source through the parse_files function

Note that the underlying infrastructure in the bioscape.sources module is augmented by that in the bsindex.sources module, thus supporting specific features in the common operations.

Downloading Data

The download module in each data source package has the responsibility of accessing remote data repositories through the most appropriate means and retrieving available resources. Each such module should expose a function called fetch_files in order to support the generic data source acquisition mechanism (described above).

The means of downloading data depends on the nature of the remote repository. PubMed abstracts are acquired by FTP and involve tests against checksums to ensure the integrity of the downloaded data; Entrez Gene data files are acquired by FTP with no such integrity tests employed. Other remote repositories could be accessed via HTTP or other mechanisms.

Parsing Data

The parse module in each data source package should provide support for the processing of downloaded data and its conversion into an appropriate format for Bioscape. Each such module should expose a function called parse_files which accepts the details of files to be parsed and returns information about the files which were subsequently processed. Much of the general infrastructure for parsing is provided by a data source "configuration", as described below.

Data Source Configuration

A data source configuration is an abstraction involving a data directory containing downloaded files, some of which are regarded as processable using designated parsing tools, with output being produced by an "updater" object which produces files suitable for import into the database. For textual data sources, in addition to parsing, indexing is also performed using appropriate tools, producing index resources which may then be searched.

The bioscape.sources.common.SourceConfig class (in the bsadmin distribution) is widely used to support the parse_files operation, with each data source defining its own specific parser, and with any specialised preprocessing being added to a source's own derivative of the common SourceConfig class. For example, when processing Entrez Gene data types, preprocessing is typically needed for the following:

  • For ASN.1 data, the gene2xml must be run so that an XML parser can be used on the result
  • For tab-delimited data, various UNIX tools are used to filter the raw downloaded files and to eliminate duplicate entries

Exporting and Extracting Data

Data can be retrieved from the Bioscape database by either exporting or extracting such data. The key difference between exporting and extracting is as follows:

  • The export of data can be achieved by issuing a query and merely exporting the results to a file without further postprocessing
  • The extraction of data requires additional postprocessing, the code for which being located in the extract module of the appropriate data source package

An example of postprocessing is that performed when extracting keywords from Entrez Gene records: although a database query is issued to obtain gene summary fields, the tokenisation process performed on these fields can only be done in a program outside the database system. Consequently, this activity is supported by an extraction mechanism as opposed to a more straightforward export mechanism.