Difference between revisions of "Bioscape Workflow"

Revision as of 12:29, 23 November 2009

The preparation of a working Bioscape system involves a number of activities in a workflow or schedule. These activities are performed in the following general order (with annotations referring to functions in the bsindex_quickstart.py script, found in the scripts directory of the bsindex distribution):

Initialise basic resources (quickstart, init_database).
Import essential data and initialise data sources (update_sources, update_source).
Update derived information such as lexicon tables, scores for essential data (update_derived_sources).
Import textual data and initialise textual data sources (update_text_source).
Update text search results and related information such as result scores (update_text).
Initialise the Web database in order to present a coherent view of the system (init_web_database).

The bioscape/sql directory (in the bsadmin distribution) provides a reasonable overview of the different activities, containing activity-specific directories which each contain templates for manipulating the database. The activities involved include the following:

Basic resource initialisation: dictionaries, score
Data source initialisation: chebi, gene, go, taxonomy
Derived data preparation: bioentities, searchscore, termscore
Textual data source initialisation: text
Search results and related data preparation: search, sentencescore, results, resultscore, evidence, evidencescore
Web data preparation: web-bioentities, web-index, web-search, web-results, web-evidence, web-resultscore

Labelling Data

Throughout the Bioscape schema the notion of a generation is employed to label a particular version or batch of data. Thus, such labels are applied to data from sources as such data is imported. When derived data is prepared, separate generation labels are applied because although a batch of derived data may be completely derived from a batch of source data, the possibility exists to derive multiple batches from the original source data, perhaps employing different settings to define the nature of the derived data.

Such a labelling policy is most obvious when dealing with textual data:

Text is indexed and its details recorded in the database, labelled using a text generation (and recorded in the text_generations table).
Searches are performed on text and their results are recorded, labelled using a search generation (and recorded in the text_search_generations table). Since the search parameters can be changed, potentially many search generations can exist for a single text generation.
Concrete bioentity suggestions are then compiled for each search result, labelled using a result generation (and recorded in the text_result_generations table). Since the parameters for this activity can be changed, potentially many result generations can exist for a single search generation.
Evidence suggestions are then compiled for pairs of concrete bioentity results, labelled using an evidence generation (and recorded in the text_evidence_generations table). Since the parameters for this activity can be changed, potentially many evidence generations can exist for a single result generation.

For example:

Indexed text text generation 1
	Search results search generation 1
		Bioentity results result generation 1
	Search results search generation 2
		Bioentity results result generation 2	Evidence results evidence generation 1
	Search results search generation 3
		Bioentity results result generation 3	Evidence results evidence generation 2
Indexed text text generation 2
	Search results search generation 4

The consequence of this is a tree-like structure of data, with each branch defined by the original data supporting a number of branches of derived data, each supporting a number of branches, and so on.

@@ Line 3: / Line 3: @@
 # Initialise basic resources (<tt>quickstart</tt>, <tt>init_database</tt>).
 # Import essential data and initialise data sources (<tt>update_sources</tt>, <tt>update_source</tt>).
-# Update derived information such as lexicon tables, scores for essential data (<tt>update_derived_sources</tt>).
+# Update derived information such as lexicon tables, [[Bioscape Methods|scores for essential data]] (<tt>update_derived_sources</tt>).
 # Import textual data and initialise textual data sources (<tt>update_text_source</tt>).
-# Update text search results and related information such as result scores (<tt>update_text</tt>).
+# Update text search results and related information such as [[Bioscape Methods|result scores]] (<tt>update_text</tt>).
 # Initialise the Web database in order to present a coherent view of the system (<tt>init_web_database</tt>).

Anonymous

Search

Difference between revisions of "Bioscape Workflow"

Namespaces

More

Page actions

Revision as of 12:29, 23 November 2009

Labelling Data

Navigation

Navigation

Internal Links

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Bioscape Workflow"

Revision as of 12:29, 23 November 2009

Labelling Data

Navigation

Wiki tools

Page tools

Categories