Bioscape Workflow

From irefindex
Revision as of 12:29, 23 November 2009 by PaulBoddie (talk | contribs) (Added links to methods/scoring documentation.)

The preparation of a working Bioscape system involves a number of activities in a workflow or schedule. These activities are performed in the following general order (with annotations referring to functions in the bsindex_quickstart.py script, found in the scripts directory of the bsindex distribution):

  1. Initialise basic resources (quickstart, init_database).
  2. Import essential data and initialise data sources (update_sources, update_source).
  3. Update derived information such as lexicon tables, scores for essential data (update_derived_sources).
  4. Import textual data and initialise textual data sources (update_text_source).
  5. Update text search results and related information such as result scores (update_text).
  6. Initialise the Web database in order to present a coherent view of the system (init_web_database).

The bioscape/sql directory (in the bsadmin distribution) provides a reasonable overview of the different activities, containing activity-specific directories which each contain templates for manipulating the database. The activities involved include the following:

  1. Basic resource initialisation: dictionaries, score
  2. Data source initialisation: chebi, gene, go, taxonomy
  3. Derived data preparation: bioentities, searchscore, termscore
  4. Textual data source initialisation: text
  5. Search results and related data preparation: search, sentencescore, results, resultscore, evidence, evidencescore
  6. Web data preparation: web-bioentities, web-index, web-search, web-results, web-evidence, web-resultscore

Labelling Data

Throughout the Bioscape schema the notion of a generation is employed to label a particular version or batch of data. Thus, such labels are applied to data from sources as such data is imported. When derived data is prepared, separate generation labels are applied because although a batch of derived data may be completely derived from a batch of source data, the possibility exists to derive multiple batches from the original source data, perhaps employing different settings to define the nature of the derived data.

Such a labelling policy is most obvious when dealing with textual data:

  1. Text is indexed and its details recorded in the database, labelled using a text generation (and recorded in the text_generations table).
  2. Searches are performed on text and their results are recorded, labelled using a search generation (and recorded in the text_search_generations table). Since the search parameters can be changed, potentially many search generations can exist for a single text generation.
  3. Concrete bioentity suggestions are then compiled for each search result, labelled using a result generation (and recorded in the text_result_generations table). Since the parameters for this activity can be changed, potentially many result generations can exist for a single search generation.
  4. Evidence suggestions are then compiled for pairs of concrete bioentity results, labelled using an evidence generation (and recorded in the text_evidence_generations table). Since the parameters for this activity can be changed, potentially many evidence generations can exist for a single result generation.

For example:

Indexed text
text generation 1
Search results
search generation 1
Bioentity results
result generation 1
Search results
search generation 2
Bioentity results
result generation 2
Evidence results
evidence generation 1
Search results
search generation 3
Bioentity results
result generation 3
Evidence results
evidence generation 2
Indexed text
text generation 2
Search results
search generation 4

The consequence of this is a tree-like structure of data, with each branch defined by the original data supporting a number of branches of derived data, each supporting a number of branches, and so on.