Difference between revisions of "Bioscape Development"

From irefindex
(→‎Adding New Search Result Types: Revised version details.)
(→‎Adding New Data Sources and Types: Revised version information.)
Line 45: Line 45:
  
 
<ol>
 
<ol>
<li>Add a new module (see "Adding New Modules" above). For example...
+
<li>Add a new data source module. For example, for a "pure data" source (in <tt>bsadmin</tt>) involving only the database:
  
   <pre>bioscape.modules.chebi</pre></li>
+
   <pre>bioscape.sources.chebi</pre>
  
<li>Define modules which retrieve data from sources. For example, a module which uses FTP to download files and to place them in a special downloads directory. For example...
+
For a "data plus indexed text" source (in <tt>bsindex</tt>):
  
   <pre>bioscape.modules.chebi.chebiftp</pre></li>
+
   <pre>bsindex.sources.pmcweb</pre>
  
<li>Define modules which parse the downloaded data, if necessary, producing import data files. For example...
+
This involves the usual creation of a Python package at the appropriate place in the directory hierarchy and with an <tt>__init__.py</tt> file to indicate that a package (or subpackage) is present.</li>
  
  <pre>bioscape.modules.chebi.chebiparse</pre></li>
+
<li>Define a module which retrieves data from the actual source:
  
<li>Add templates to implement the database schema for the data type, along with templates which support the import and update of such data. For example...
+
  <pre>bioscape.sources.chebi.download</pre></li>
  
  <pre>
+
<li>Define a module which parses the downloaded data:
  chebi-pgsql.sql.in
+
 
   drop-chebi-pgsql.sql.in
+
   <pre>bioscape.sources.chebi.parse</pre></li>
  chebi-constraints-pgsql.sql.in
 
  drop-chebi-constraints-pgsql.sql.in
 
  import-chebi-pgsql.sql.in
 
  update-chebi-pgsql.sql.in</pre></li>
 
  
<li>Add a bulk import module for the data type. For example...
+
<li>Add templates to implement the database schema for the data type, along with templates which support the import and update of such data. For example, within the <tt>bioscape/sql/chebi</tt> directory:
  
   <pre>bioscape.modules.chebi.bulk</pre></li>
+
   <pre>
 +
  init.sql.in
 +
  drop.sql.in
 +
  init-constraints.sql.in
 +
  drop-constraints.sql.in
 +
  import.sql.in</pre></li>
  
 
<li>Define configuration settings for the locations and details used in the above modules. For example...
 
<li>Define configuration settings for the locations and details used in the above modules. For example...
Line 76: Line 77:
 
   chebi_ftp_address
 
   chebi_ftp_address
 
   chebi_data_directory</pre></li>
 
   chebi_data_directory</pre></li>
 
<li>Add scripts which download and process data and import such data. For example...
 
 
  <pre>
 
  scripts/bioscape_get_chebi.py
 
  scripts/bioscape_import_chebi.py</pre></li>
 
  
 
<li>Add functions to the <tt>scripts/bioscape_quickstart.py</tt> script to support the new data type.</li>
 
<li>Add functions to the <tt>scripts/bioscape_quickstart.py</tt> script to support the new data type.</li>

Revision as of 17:33, 21 July 2009


Please note that this documentation covers an unreleased product and is for internal use only.


Bioscape Development

This document describes a selection of different development tasks undertaken when improving Bioscape.

Adding New Scoring Methods

The following steps should be sufficient to define and make available a new scoring method.

  1. Add a new entry to the bioscape/sources/score/Resources/methods.txt file in the bsadmin distribution.
  2. Add new templates for scoring to the appropriate subdirectory of the bioscape/sql directory. For example, for a result score, add importdb-N.sql.in to the bioscape/sql/resultscore directory for a method called N.
  3. Create a new record in the text_method table. In PostgreSQL this can be done using a COPY command together with a file containing new lines from the methods.txt file.

Adding New Search Result Types

Defining new kinds of search results involves a number of modifications:

  1. Add a constant in bioscape.constants (found in bsadmin) for the new search result type if appropriate. For example...
    text_termtype_gene_ontology_term = 13

    ...for predefined search result types, or...

    text_termid_chromosome = -3
    ...for speculative search result types.
  2. Add infrastructure to acquire and to emit the results, such as classes in modules within the bsindex.search package (found in bsindex). Such classes may include a phrase class in bsindex.search.phrases, if the data of interest is found in an unconventional way.
  3. Add convenience functions to the bsindex.search package.
  4. Modify the bsindex_quickstart.py script (in bsindex) to configure the export of an appropriate search cache for the new result type, or to create a new search cache for the new type.

Adding New Data Sources and Types

Defining new kinds of data types involves a number of modifications:

  1. Add a new data source module. For example, for a "pure data" source (in bsadmin) involving only the database:
    bioscape.sources.chebi

    For a "data plus indexed text" source (in bsindex):

    bsindex.sources.pmcweb
    This involves the usual creation of a Python package at the appropriate place in the directory hierarchy and with an __init__.py file to indicate that a package (or subpackage) is present.
  2. Define a module which retrieves data from the actual source:
    bioscape.sources.chebi.download
  3. Define a module which parses the downloaded data:
    bioscape.sources.chebi.parse
  4. Add templates to implement the database schema for the data type, along with templates which support the import and update of such data. For example, within the bioscape/sql/chebi directory:
      init.sql.in
      drop.sql.in
      init-constraints.sql.in
      drop-constraints.sql.in
      import.sql.in
  5. Define configuration settings for the locations and details used in the above modules. For example...
      chebi_ftp_address
      chebi_data_directory
  6. Add functions to the scripts/bioscape_quickstart.py script to support the new data type.

Adding New Word Lists for Searching

New lists of words which shall be searched as part of finding contextual information can be added as follows:

  1. Define a list of words. This list may be used directly by the finder implementations, or it may be imported into the database and combined with other information.
  2. Where the list is imported into the database, write database templates to define the tables involved, along with a template to import the data. For example...
      adjectives-pgsql.sql.in
      drop-adjectives-pgsql.sql.in
      import-adjectives-pgsql.sql.in
  3. If necessary, add a bulk import function for the list. For simple lists, this step is not necessary.
  4. Where a database table is involved, potentially in combination with other tables, a database reader class must be defined in the bioscape.modules.text.finders.database module along with a function returning instances of such a class.
  5. The scripts need updating to include the new data source:
      scripts/bioscape_search_cache.py
      scripts/bioscape_search_text.py (potentially only messages and comments)
  6. A constant indicating the type of contextual information must be added to the bioscape.constants module and the translations.xml file provided for the Web application.
  7. A finder class is needed to actually search for the data provided by the new source. In the bioscape.modules.text.finders.lucene module, such a class should be defined, using suitable mix-in classes, employing the newly defined constant for the data source.
  8. The bioscape.modules.text.finders module needs updating to identify the appropriate finder class when a context type is supplied to the get_context_term_finder function.
  9. Add functions to the scripts/bioscape_quickstart.py script to support the new data source.

Database Constants

Some constant values stored in the database are referenced explicitly in various parts of the software. For such values, it is most convenient to define them in the bioscape.constants module and to reference them in the database templates used to initialise and populate the database.

Other kinds of values may not be referenced in the source code in this way, and may also belong to data sets which may change over time (thus being only the initial values in a data set, rather than true constants). Such values should instead be defined in files which are used to import data into the database.

Generating API Documentation

The tools directory contains a program which can be run to generate API documentation and to put such documentation in a special apidocs directory at the root of the distribution:

python tools/apidocs.py

The generated documentation is principally useful as a reference to the API, rather than as a resource illustrating the architecture of the system or as a guide to writing new components.