Difference between revisions of "Bioscape Development"

From irefindex
m
m (Moved note to a separate page.)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
----
+
{{:Bioscape Status}}
 
 
'''Please note that this documentation covers an unreleased product and is for internal use only.'''
 
 
 
----
 
  
 
== Bioscape Development ==
 
== Bioscape Development ==
Line 15: Line 11:
  
 
<ol>
 
<ol>
<li>Add a new entry to the <tt>data/text/resources/methods.txt</tt> file.</li>
+
<li>Add a new entry to the <tt>bioscape/sources/score/Resources/methods.txt</tt> file in the <tt>bsadmin</tt> distribution.</li>
<li>Add new templates for scoring to the <tt>bioscape/modules/text/sql</tt> directory. For example...
+
<li>Add new templates for scoring to the appropriate subdirectory of the <tt>bioscape/sql</tt> directory. For example, for a result score, add <tt>importdb-N.sql.in</tt> to the <tt>bioscape/sql/resultscore</tt> directory for a method called <tt>N</tt>.</li>
 
 
  <pre>importdb-score-N-pgsql.sql.in</pre>
 
 
 
...where <tt>N</tt> is the method name and <tt>pgsql</tt> refers to a database system (PostgreSQL, according to the <tt>bioscape.cfg</tt> file conventions).</li>
 
 
<li>Create a new record in the <tt>text_method</tt> table. In PostgreSQL this can be done using a <tt>COPY</tt> command together with a file containing new lines from the <tt>methods.txt</tt> file.</li>
 
<li>Create a new record in the <tt>text_method</tt> table. In PostgreSQL this can be done using a <tt>COPY</tt> command together with a file containing new lines from the <tt>methods.txt</tt> file.</li>
 
</ol>
 
</ol>
Line 29: Line 21:
  
 
<ol>
 
<ol>
<li>Add a constant in <tt>bioscape.constants</tt> for the new search result type if appropriate. For example...
+
<li>Add a constant in <tt>bioscape.constants</tt> (found in <tt>bsadmin</tt>) for the new search result type if appropriate. For example...
  
  <pre>text_context_gene_ontology_term = 7</pre></li>
+
<pre>text_termtype_gene_ontology_term = 13</pre>
  
<li>Add infrastructure to acquire and to emit the results, such as classes in modules within the <tt>bioscape.modules.text.finders</tt> package. Since Lucene is typically used, such classes will be in the <tt>bioscape.modules.text.finders.lucene</tt> module and will include...
+
...for predefined search result types, or...
  
  <ul>
+
<pre>text_termid_chromosome = -3</pre>
  <li>A locator class</li>
 
  <li>A finder class</li>
 
  <li>An iterator class</li>
 
  </ul></li>
 
  
<li>Add convenience functions to the <tt>bioscape.modules.text.finders</tt> package, or modify existing functions such as <tt>get_context_term_finder</tt>.</li>
+
...for speculative search result types.</li>
<li>Add any reader or writer classes to the appropriate modules, such as the <tt>bioscape.modules.text.finders.files</tt> which contains classes that consume inputs and produce result output.</li>
 
<li>Modify the <tt>bioscape_search_text.py</tt> and <tt>bioscape_search_cache.py</tt> scripts to include options and invocations for the new search type.
 
<li>Add database templates for the result data. For example...
 
 
 
  <pre>
 
  acronyms-pgsql.sql.in
 
  drop-acronyms-pgsql.sql.in
 
  acronyms-constraints-pgsql.sql.in
 
  drop-acronyms-constraints-pgsql.sql.in
 
  acronyms-partition-pgsql.sql.in
 
  drop-acronyms-partition-pgsql.sql.in
 
  acronyms-partition-constraints-pgsql.sql.in
 
  drop-acronyms-partition-constraints-pgsql.sql.in
 
  import-acronyms-pgsql.sql.in</pre>
 
  
And add entries to the <tt>dependencies.txt</tt> file.</li>
+
<li>Add infrastructure to acquire and to emit the results, such as classes in modules within the <tt>bsindex.search</tt> package (found in <tt>bsindex</tt>). Such classes may include a phrase class in <tt>bsindex.search.phrases</tt> and a policy class in <tt>bsindex.search.policies</tt>, if the data of interest is found in an unconventional way.</li>
<li>Add functions in the <tt>bioscape.modules.text.bulk</tt> module which employ the above import template.</li>
+
<li>Add convenience functions to the <tt>bsindex.search</tt> package.</li>
<li>Modify the <tt>bioscape_import_text.py</tt> script to include options and invocations for the new search type.</li>
+
<li>Modify the <tt>bsindex_quickstart.py</tt> script (in <tt>bsindex</tt>) to configure the export of an appropriate search cache for the new result type, or to create a new search cache for the new type.</li>
<li>Add functions to the <tt>scripts/bioscape_quickstart.py</tt> script to support the new search type.</li>
+
<li>The database import template should not normally need modifying but can be found in the <tt>bioscape/sql/search</tt> directory (in the <tt>bsadmin</tt> distribution).</li>
 +
<li>Add a translation for the constant in the <tt>bsweb/Resources/translations.xml</tt> file (in the <tt>bsweb</tt> distribution).</li>
 
</ol>
 
</ol>
  
Line 68: Line 43:
  
 
<ol>
 
<ol>
<li>Add a new module (see "Adding New Modules" above). For example...
+
<li>Add a new data source module. For example, for a "pure data" source (in <tt>bsadmin</tt>) involving only the database:
 +
 
 +
  <pre>bioscape.sources.chebi</pre>
  
  <pre>bioscape.modules.chebi</pre></li>
+
For a "data plus indexed text" source (in <tt>bsindex</tt>):
  
<li>Define modules which retrieve data from sources. For example, a module which uses FTP to download files and to place them in a special downloads directory. For example...
+
  <pre>bsindex.sources.pmcweb</pre>
  
  <pre>bioscape.modules.chebi.chebiftp</pre></li>
+
This involves the usual creation of a Python package at the appropriate place in the directory hierarchy and with an <tt>__init__.py</tt> file to indicate that a package (or subpackage) is present.</li>
  
<li>Define modules which parse the downloaded data, if necessary, producing import data files. For example...
+
<li>Define a module which retrieves data from the actual source:
  
   <pre>bioscape.modules.chebi.chebiparse</pre></li>
+
   <pre>bioscape.sources.chebi.download</pre></li>
  
<li>Add templates to implement the database schema for the data type, along with templates which support the import and update of such data. For example...
+
<li>Define a module which parses the downloaded data:
  
   <pre>
+
   <pre>bioscape.sources.chebi.parse</pre></li>
  chebi-pgsql.sql.in
 
  drop-chebi-pgsql.sql.in
 
  chebi-constraints-pgsql.sql.in
 
  drop-chebi-constraints-pgsql.sql.in
 
  import-chebi-pgsql.sql.in
 
  update-chebi-pgsql.sql.in</pre></li>
 
  
<li>Add a bulk import module for the data type. For example...
+
<li>Add templates to implement the database schema for the data type, along with templates which support the import and update of such data. For example, within the <tt>bioscape/sql/chebi</tt> directory:
  
   <pre>bioscape.modules.chebi.bulk</pre></li>
+
   <pre>
 +
  init.sql.in
 +
  drop.sql.in
 +
  init-constraints.sql.in
 +
  drop-constraints.sql.in
 +
  import.sql.in</pre></li>
  
 
<li>Define configuration settings for the locations and details used in the above modules. For example...
 
<li>Define configuration settings for the locations and details used in the above modules. For example...
Line 99: Line 75:
 
   chebi_ftp_address
 
   chebi_ftp_address
 
   chebi_data_directory</pre></li>
 
   chebi_data_directory</pre></li>
 
<li>Add scripts which download and process data and import such data. For example...
 
 
  <pre>
 
  scripts/bioscape_get_chebi.py
 
  scripts/bioscape_import_chebi.py</pre></li>
 
  
 
<li>Add functions to the <tt>scripts/bioscape_quickstart.py</tt> script to support the new data type.</li>
 
<li>Add functions to the <tt>scripts/bioscape_quickstart.py</tt> script to support the new data type.</li>
 
</ol>
 
</ol>
 +
 +
See the [[Bioscape Data Sources]] document for more information about the structure of data sources.
  
 
=== Adding New Word Lists for Searching ===
 
=== Adding New Word Lists for Searching ===
  
New lists of words which shall be searched as part of finding contextual
+
New lists of words which shall be searched as part of finding contextual information can be added as follows:
information can be added as follows:
 
  
 
<ol>
 
<ol>
<li>Define a list of words. This list may be used directly by the finder implementations, or it may be imported into the database and combined with other information.
+
<li>Define a list of words in the <tt>Resources</tt> subdirectory of the <tt>bioscape.sources.bioentities</tt> package.</li>
 
 
<li>Where the list is imported into the database, write database templates to define the tables involved, along with a template to import the data. For example...
 
 
 
  <pre>
 
  adjectives-pgsql.sql.in
 
  drop-adjectives-pgsql.sql.in
 
  import-adjectives-pgsql.sql.in</pre></li>
 
  
<li>If necessary, add a bulk import function for the list. For simple lists, this step is not necessary.</li>
+
<li>Write a database template to import the data into the appropriate tables. For example, in the <tt>bioscape/sql/bioentities</tt> directory:
 
 
<li>Where a database table is involved, potentially in combination with other tables, a database reader class must be defined in the <tt>bioscape.modules.text.finders.database</tt> module along with a function returning instances of such a class.</li>
 
 
 
<li>The scripts need updating to include the new data source:
 
  
 
   <pre>
 
   <pre>
   scripts/bioscape_search_cache.py
+
   import-adjectives-pgsql.sql.in</pre>
  scripts/bioscape_search_text.py (potentially only messages and comments)</pre></li>
 
  
<li>A constant indicating the type of contextual information must be added to the <tt>bioscape.constants</tt> module and the <tt>translations.xml</tt> file provided for the Web application.</li>
+
This will make the new search terms available.</li>
  
<li>A finder class is needed to actually search for the data provided by the new source. In the <tt>bioscape.modules.text.finders.lucene</tt> module, such a class should be defined, using suitable mix-in classes, employing the newly defined constant for the data source.</li>
+
<li>The search result type can then be added as described above.</li>
 
 
<li>The <tt>bioscape.modules.text.finders</tt> module needs updating to identify the appropriate finder class when a context type is supplied to the <tt>get_context_term_finder</tt> function.</li>
 
 
 
<li>Add functions to the <tt>scripts/bioscape_quickstart.py</tt> script to support the new data source.</li>
 
 
</ol>
 
</ol>
  
 
=== Database Constants ===
 
=== Database Constants ===
  
Some constant values stored in the database are referenced explicitly in
+
Some constant values stored in the database are referenced explicitly in various parts of the software. For such values, it is most convenient to
various parts of the software. For such values, it is most convenient to
+
define them in the <tt>bioscape.constants</tt> module and to reference them in the database templates used to initialise and populate the database.
define them in the <tt>bioscape.constants</tt> module and to reference them in the
 
database templates used to initialise and populate the database.
 
  
Other kinds of values may not be referenced in the source code in this way,
+
Other kinds of values may not be referenced in the source code in this way, and may also belong to data sets which may change over time (thus being only
and may also belong to data sets which may change over time (thus being only
+
the initial values in a data set, rather than true constants). Such values should instead be defined in files which are used to import data into the
the initial values in a data set, rather than true constants). Such values
 
should instead be defined in files which are used to import data into the
 
 
database.
 
database.
  
=== Generating API Documentation ===
+
[[Category:Bioscape]]
 
 
The <tt>tools</tt> directory contains a program which can be run to generate API
 
documentation and to put such documentation in a special <tt>apidocs</tt> directory at
 
the root of the distribution:
 
 
 
<pre>python tools/apidocs.py</pre>
 
 
 
The generated documentation is principally useful as a reference to the API,
 
rather than as a resource illustrating the architecture of the system or as a
 
guide to writing new components.
 

Latest revision as of 13:48, 14 July 2010

NoteNotePlease note that this documentation covers an unreleased product and is for internal use only.

Bioscape Development

This document describes a selection of different development tasks undertaken when improving Bioscape.

Adding New Scoring Methods

The following steps should be sufficient to define and make available a new scoring method.

  1. Add a new entry to the bioscape/sources/score/Resources/methods.txt file in the bsadmin distribution.
  2. Add new templates for scoring to the appropriate subdirectory of the bioscape/sql directory. For example, for a result score, add importdb-N.sql.in to the bioscape/sql/resultscore directory for a method called N.
  3. Create a new record in the text_method table. In PostgreSQL this can be done using a COPY command together with a file containing new lines from the methods.txt file.

Adding New Search Result Types

Defining new kinds of search results involves a number of modifications:

  1. Add a constant in bioscape.constants (found in bsadmin) for the new search result type if appropriate. For example...
    text_termtype_gene_ontology_term = 13

    ...for predefined search result types, or...

    text_termid_chromosome = -3
    ...for speculative search result types.
  2. Add infrastructure to acquire and to emit the results, such as classes in modules within the bsindex.search package (found in bsindex). Such classes may include a phrase class in bsindex.search.phrases and a policy class in bsindex.search.policies, if the data of interest is found in an unconventional way.
  3. Add convenience functions to the bsindex.search package.
  4. Modify the bsindex_quickstart.py script (in bsindex) to configure the export of an appropriate search cache for the new result type, or to create a new search cache for the new type.
  5. The database import template should not normally need modifying but can be found in the bioscape/sql/search directory (in the bsadmin distribution).
  6. Add a translation for the constant in the bsweb/Resources/translations.xml file (in the bsweb distribution).

Adding New Data Sources and Types

Defining new kinds of data types involves a number of modifications:

  1. Add a new data source module. For example, for a "pure data" source (in bsadmin) involving only the database:
    bioscape.sources.chebi

    For a "data plus indexed text" source (in bsindex):

    bsindex.sources.pmcweb
    This involves the usual creation of a Python package at the appropriate place in the directory hierarchy and with an __init__.py file to indicate that a package (or subpackage) is present.
  2. Define a module which retrieves data from the actual source:
    bioscape.sources.chebi.download
  3. Define a module which parses the downloaded data:
    bioscape.sources.chebi.parse
  4. Add templates to implement the database schema for the data type, along with templates which support the import and update of such data. For example, within the bioscape/sql/chebi directory:
      init.sql.in
      drop.sql.in
      init-constraints.sql.in
      drop-constraints.sql.in
      import.sql.in
  5. Define configuration settings for the locations and details used in the above modules. For example...
      chebi_ftp_address
      chebi_data_directory
  6. Add functions to the scripts/bioscape_quickstart.py script to support the new data type.

See the Bioscape Data Sources document for more information about the structure of data sources.

Adding New Word Lists for Searching

New lists of words which shall be searched as part of finding contextual information can be added as follows:

  1. Define a list of words in the Resources subdirectory of the bioscape.sources.bioentities package.
  2. Write a database template to import the data into the appropriate tables. For example, in the bioscape/sql/bioentities directory:
      import-adjectives-pgsql.sql.in
    This will make the new search terms available.
  3. The search result type can then be added as described above.

Database Constants

Some constant values stored in the database are referenced explicitly in various parts of the software. For such values, it is most convenient to define them in the bioscape.constants module and to reference them in the database templates used to initialise and populate the database.

Other kinds of values may not be referenced in the source code in this way, and may also belong to data sets which may change over time (thus being only the initial values in a data set, rather than true constants). Such values should instead be defined in files which are used to import data into the database.