Difference between revisions of "Bioscape Workflow"

From irefindex
(Added workflow information.)
 
m (Added status note.)
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
{{:Bioscape Status}}
 +
 
The preparation of a working Bioscape system involves a number of activities in a workflow or schedule. These activities are performed in the following general order (with annotations referring to functions in the <tt>bsindex_quickstart.py</tt> script, found in the <tt>scripts</tt> directory of the <tt>bsindex</tt> distribution):
 
The preparation of a working Bioscape system involves a number of activities in a workflow or schedule. These activities are performed in the following general order (with annotations referring to functions in the <tt>bsindex_quickstart.py</tt> script, found in the <tt>scripts</tt> directory of the <tt>bsindex</tt> distribution):
  
 
# Initialise basic resources (<tt>quickstart</tt>, <tt>init_database</tt>).
 
# Initialise basic resources (<tt>quickstart</tt>, <tt>init_database</tt>).
 
# Import essential data and initialise data sources (<tt>update_sources</tt>, <tt>update_source</tt>).
 
# Import essential data and initialise data sources (<tt>update_sources</tt>, <tt>update_source</tt>).
# Update derived information such as lexicon tables, scores for essential data (<tt>update_derived_sources</tt>).
+
# Update derived information such as lexicon tables, [[Bioscape Methods|scores for essential data]] (<tt>update_derived_sources</tt>).
 
# Import textual data and initialise textual data sources (<tt>update_text_source</tt>).
 
# Import textual data and initialise textual data sources (<tt>update_text_source</tt>).
# Update text search results and related information such as result scores (<tt>update_text</tt>).
+
# Update text search results and related information such as [[Bioscape Methods|result scores]] (<tt>update_text</tt>).
 
# Initialise the Web database in order to present a coherent view of the system (<tt>init_web_database</tt>).
 
# Initialise the Web database in order to present a coherent view of the system (<tt>init_web_database</tt>).
  
Line 16: Line 18:
 
# Search results and related data preparation: <tt>search</tt>, <tt>sentencescore</tt>, <tt>results</tt>, <tt>resultscore</tt>, <tt>evidence</tt>, <tt>evidencescore</tt>
 
# Search results and related data preparation: <tt>search</tt>, <tt>sentencescore</tt>, <tt>results</tt>, <tt>resultscore</tt>, <tt>evidence</tt>, <tt>evidencescore</tt>
 
# Web data preparation: <tt>web-bioentities</tt>, <tt>web-index</tt>, <tt>web-search</tt>, <tt>web-results</tt>, <tt>web-evidence</tt>, <tt>web-resultscore</tt>
 
# Web data preparation: <tt>web-bioentities</tt>, <tt>web-index</tt>, <tt>web-search</tt>, <tt>web-results</tt>, <tt>web-evidence</tt>, <tt>web-resultscore</tt>
 +
 +
== Labelling Data ==
 +
 +
Throughout the Bioscape schema the notion of a ''generation'' is employed to label a particular version or batch of data. Thus, such labels are applied to data from sources as such data is imported. When derived data is prepared, separate ''generation'' labels are applied because although a batch of derived data may be completely derived from a batch of source data, the possibility exists to derive multiple batches from the original source data, perhaps employing different settings to define the nature of the derived data.
 +
 +
Such a labelling policy is most obvious when dealing with textual data:
 +
 +
# Text is indexed and its details recorded in the database, labelled using a ''text generation'' (and recorded in the <tt>text_generations</tt> table).
 +
# Searches are performed on text and their results are recorded, labelled using a ''search generation'' (and recorded in the <tt>text_search_generations</tt> table). Since the search parameters can be changed, potentially many search generations can exist for a single text generation.
 +
# Concrete bioentity suggestions are then compiled for each search result, labelled using a ''result generation'' (and recorded in the <tt>text_result_generations</tt> table). Since the parameters for this activity can be changed, potentially many result generations can exist for a single search generation.
 +
# Evidence suggestions are then compiled for pairs of concrete bioentity results, labelled using an ''evidence generation'' (and recorded in the <tt>text_evidence_generations</tt> table). Since the parameters for this activity can be changed, potentially many evidence generations can exist for a single result generation.
 +
 +
For example:
 +
 +
{| border="0" cellspacing="5" cellpadding="5" style="margin: 2em"
 +
| rowspan="7" style="border: 1px solid #000000" | Indexed text<br>''text generation 1''
 +
| colspan="3" style="height: 1em" |
 +
|-
 +
| rowspan="2" style="border: 1px solid #000000" | Search results<br>''search generation 1''
 +
| colspan="2" style="height: 1em" |
 +
|-
 +
| style="border: 1px solid #000000" | Bioentity results<br>''result generation 1''
 +
|
 +
|-
 +
| rowspan="2" style="border: 1px solid #000000" | Search results<br>''search generation 2''
 +
| colspan="2" style="height: 1em" |
 +
|-
 +
| style="border: 1px solid #000000" | Bioentity results<br>''result generation 2''
 +
| style="border: 1px solid #000000" | Evidence results<br>''evidence generation 1''
 +
|-
 +
| rowspan="2" style="border: 1px solid #000000" | Search results<br>''search generation 3''
 +
| colspan="2" style="height: 1em" |
 +
|-
 +
| style="border: 1px solid #000000" | Bioentity results<br>''result generation 3''
 +
| style="border: 1px solid #000000" | Evidence results<br>''evidence generation 2''
 +
|-
 +
| rowspan="2" style="border: 1px solid #000000" | Indexed text<br>''text generation 2''
 +
| colspan="3" style="height: 1em" |
 +
|-
 +
| style="border: 1px solid #000000" | Search results<br>''search generation 4''
 +
|
 +
|
 +
|}
 +
 +
The consequence of this is a tree-like structure of data, with each branch defined by the original data supporting a number of branches of derived data, each supporting a number of branches, and so on.
 +
 +
[[Category:Bioscape]]

Latest revision as of 13:38, 14 July 2010

NoteNotePlease note that this documentation covers an unreleased product and is for internal use only.

The preparation of a working Bioscape system involves a number of activities in a workflow or schedule. These activities are performed in the following general order (with annotations referring to functions in the bsindex_quickstart.py script, found in the scripts directory of the bsindex distribution):

  1. Initialise basic resources (quickstart, init_database).
  2. Import essential data and initialise data sources (update_sources, update_source).
  3. Update derived information such as lexicon tables, scores for essential data (update_derived_sources).
  4. Import textual data and initialise textual data sources (update_text_source).
  5. Update text search results and related information such as result scores (update_text).
  6. Initialise the Web database in order to present a coherent view of the system (init_web_database).

The bioscape/sql directory (in the bsadmin distribution) provides a reasonable overview of the different activities, containing activity-specific directories which each contain templates for manipulating the database. The activities involved include the following:

  1. Basic resource initialisation: dictionaries, score
  2. Data source initialisation: chebi, gene, go, taxonomy
  3. Derived data preparation: bioentities, searchscore, termscore
  4. Textual data source initialisation: text
  5. Search results and related data preparation: search, sentencescore, results, resultscore, evidence, evidencescore
  6. Web data preparation: web-bioentities, web-index, web-search, web-results, web-evidence, web-resultscore

Labelling Data

Throughout the Bioscape schema the notion of a generation is employed to label a particular version or batch of data. Thus, such labels are applied to data from sources as such data is imported. When derived data is prepared, separate generation labels are applied because although a batch of derived data may be completely derived from a batch of source data, the possibility exists to derive multiple batches from the original source data, perhaps employing different settings to define the nature of the derived data.

Such a labelling policy is most obvious when dealing with textual data:

  1. Text is indexed and its details recorded in the database, labelled using a text generation (and recorded in the text_generations table).
  2. Searches are performed on text and their results are recorded, labelled using a search generation (and recorded in the text_search_generations table). Since the search parameters can be changed, potentially many search generations can exist for a single text generation.
  3. Concrete bioentity suggestions are then compiled for each search result, labelled using a result generation (and recorded in the text_result_generations table). Since the parameters for this activity can be changed, potentially many result generations can exist for a single search generation.
  4. Evidence suggestions are then compiled for pairs of concrete bioentity results, labelled using an evidence generation (and recorded in the text_evidence_generations table). Since the parameters for this activity can be changed, potentially many evidence generations can exist for a single result generation.

For example:

Indexed text
text generation 1
Search results
search generation 1
Bioentity results
result generation 1
Search results
search generation 2
Bioentity results
result generation 2
Evidence results
evidence generation 1
Search results
search generation 3
Bioentity results
result generation 3
Evidence results
evidence generation 2
Indexed text
text generation 2
Search results
search generation 4

The consequence of this is a tree-like structure of data, with each branch defined by the original data supporting a number of branches of derived data, each supporting a number of branches, and so on.