Bioscape Data Sources

From irefindex
Revision as of 17:20, 1 October 2009 by PaulBoddie (talk | contribs) (Initial version of data sources documentation.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Bioscape data source packages share a number of common structural features which should facilitate the integration of new data types and sources.

Data Source Modules

Each Python package for a data source contains a number of modules - individual files which provide a collection of facilities - and most packages of this nature will contain modules with the following names:

  • download - the retrieval of data from the location where it is published
  • extract - the extraction of data from the Bioscape database for subsequent external processing
  • parse - the processing of data into a form suitable for Bioscape

The bioscape.sources.common package (in the bsadmin distribution) contains classes which can be reused by data source implementations, with the intention that these base classes may help in the development of such implementations by providing commonly needed conveniences. In addition, for textual data sources, the bsindex.sources.text package (in the bsindex distribution) contains a number of modules which also provide reusable classes:

  • citations - base class functionality for parsers
  • parse - file writing and configuration assistance

Data Source Acquisition

The bioscape.sources module (in the bsadmin distribution) and the bsindex.sources module (in the bsindex distribution) provide a unified API for the acquisition of a data source with a particular name and access to the principal data source operations. Generally, if data source packages support the principal modules (listed above), programs will be able to access the implemented operations through this generic interface.

Examples of accessing data sources can be found in various scripts found in each distribution's scripts directory, including the following:

  • bioscape_download.py and bsindex_download.py - access to the appropriate data source, downloading functionality through the fetch_files function
  • bioscape_parse.py and bsindex_index.py - access to the parsing functionality of the appropriate data source through the parse_files function

Note that the underlying infrastructure in the bioscape.sources module is augmented by that in the bsindex.sources module, thus supporting specific features in the common operations.

Downloading Data

The download module in each data source package has the responsibility of accessing remote data repositories through the most appropriate means and retrieving available resources. Each such module should expose a function called fetch_files in order to support the generic data source acquisition mechanism (described above).

The means of downloading data depends on the nature of the remote repository. PubMed abstracts are acquired by FTP and involve tests against checksums to ensure the integrity of the downloaded data; Entrez Gene data files are acquired by FTP with no such integrity tests employed. Other remote repositories could be accessed via HTTP or other mechanisms.

Parsing Data

The parse module in each data source package should provide support for the processing of downloaded data and its conversion into an appropriate format for Bioscape. Each such module should expose a function called parse_files which accepts the details of files to be parsed and returns information about the files which were subsequently processed. Much of the general infrastructure for parsing is provided by a data source "configuration", as described below.

Data Source Configuration

A data source configuration is an abstraction involving a data directory containing downloaded files, some of which are regarded as processable using designated parsing tools, with output being produced by an "updater" object which produces files suitable for import into the database. For textual data sources, in addition to parsing, indexing is also performed using appropriate tools, producing index resources which may then be searched.

The bioscape.sources.common.SourceConfig class (in the bsadmin distribution) is widely used to support the parse_files operation, with each data source defining its own specific parser, and with any specialised preprocessing being added to a source's own derivative of the common SourceConfig class. For example, when processing Entrez Gene data types, preprocessing is typically needed for the following:

  • For ASN.1 data, the gene2xml must be run so that an XML parser can be used on the result
  • For tab-delimited data, various UNIX tools are used to filter the raw downloaded files and to eliminate duplicate entries