Difference between revisions of "Bioscape Result Assessment"

From irefindex
(New page: The suggestions produced by Bioscape's search activities can be assessed subject to the availability of "gold standard" data which confirms whether each particular result can be regarded a...)
 
(Added BioCreative import details.)
Line 37: Line 37:
  
 
The above combination should sort the entries on the precision column in order of increasing precision.
 
The above combination should sort the entries on the precision column in order of increasing precision.
 +
 +
== Comparing BioCreative Results and Bioscape Results ==
 +
 +
In order to compare results from BioCreative and Bioscape in the Web interface, the gold standard data must be imported; this involves the following processes:
 +
 +
# Import of the gene identifiers and names referenced in the gold standard data file.
 +
# Text searching using these names in the appropriate documents, so that regions of text may be shown to provide results.
 +
# Propagation of region and gene name information in order to produce specific gene references.
 +
 +
With this information available to Bioscape, it becomes possible to see each result set in the same document and to perform further analysis on the accuracy of Bioscape results.
 +
 +
=== Isolating Correct and Incorrect Bioscape Results ===
 +
 +
Using BioCreative results, it is possible to take a selection of Bioscape results and to assess them according to a number of criteria:
 +
 +
* Correctness: whether each Bioscape result is correct or not - this can already be assessed using the export and scoring scripts described above, but only at the document level.
 +
* Correspondence: whether each BioCreative result corresponds to any Bioscape results - although this can be done using the scripts at the document level, it now becomes possible to consider the correspondence at the mention level.
 +
* Ambiguity: the ambiguity of Bioscape suggestions for each BioCreative result - where many Bioscape suggestions indicate ambiguity, and a single suggestion indicates an unambiguous suggestion.
 +
* Whether Bioscape results appear in places not associated with BioCreative results, and whether these happen to be correspond to BioCreative suggestions for a particular document.
 +
 +
Thus, each Bioscape result can be classified as follows:
 +
 +
{| border="1" cellspacing="0" cellpadding="5" style="margin: 2em"
 +
! width="40%" | Class
 +
! width="20%" | At known location
 +
! width="20%" | Predicts correct gene at location
 +
! width="20%" | Predicts correct gene for document
 +
|-
 +
| True positive at "true" BioCreative mention location
 +
| Yes
 +
| Yes
 +
| Yes
 +
|-
 +
| False positive at "true" BioCreative mention location
 +
| Yes
 +
| No (may co-exist with correct suggestion)
 +
| No
 +
|-
 +
| True positive at wrong "true" BioCreative mention location
 +
| Yes
 +
| No (may co-exist with correct suggestion)
 +
| Yes
 +
|-
 +
| True positive at "false" unknown-to-BioCreative mention location
 +
| No
 +
| No
 +
| Yes
 +
|-
 +
| False positive at "false" unknown-to-BioCreative mention location
 +
| No
 +
| No
 +
| No
 +
|}
 +
 +
Another way of expressing these result categories is as follows:
 +
 +
{| border="1" cellspacing="0" cellpadding="5" style="margin: 2em"
 +
! width="25%" |
 +
! width="25%" | At "true" known location
 +
! width="25%" | At "wrong" known location
 +
! width="25%" | At "false" unknown location
 +
|-
 +
! True positive
 +
| Bioscape suggestion matches
 +
| colspan="2" | Bioscape suggestion matches a suggestion for the document ("accidental" true positive)
 +
|-
 +
! False positive
 +
| colspan="2" | Bioscape suggestion does not match (and is inappropriate for the document)
 +
| Bioscape suggestion neither appears at a recognised place or is appropriate for the document
 +
|}
  
 
[[Category:Bioscape]]
 
[[Category:Bioscape]]

Revision as of 19:17, 10 March 2010

The suggestions produced by Bioscape's search activities can be assessed subject to the availability of "gold standard" data which confirms whether each particular result can be regarded as genuine.

BioCreative 2 Gene Normalisation

In the bsindex distribution, a script is available to export filtered results from Bioscape for assessment against the BioCreative gold standard:

python scripts/bsindex_export_bc2gn_results.py --bionames <generation> --results <generation> --methods human_gene --min-score 1 --output <output>

Once result data is available, this data can be scored through comparison to the gold standard file:

python scripts/bsindex_score_bc2gn_results.py gold <output>

A number of options to the scoring script help compare different sets of results:

python scripts/bsindex_score_bc2gn_results.py gold <output files> --pretty

The --pretty option provides a table with the following columns:

  1. Output filename
  2. Number of true positive results
  3. Number of false positive results
  4. Number of false negative results
  5. Precision
  6. Recall

Combining the output of this script with other Unix commands can be convenient:

python scripts/bsindex_score_bc2gn_results.py gold <output files> --pretty | sort -n -k 5

The above combination should sort the entries on the precision column in order of increasing precision.

Comparing BioCreative Results and Bioscape Results

In order to compare results from BioCreative and Bioscape in the Web interface, the gold standard data must be imported; this involves the following processes:

  1. Import of the gene identifiers and names referenced in the gold standard data file.
  2. Text searching using these names in the appropriate documents, so that regions of text may be shown to provide results.
  3. Propagation of region and gene name information in order to produce specific gene references.

With this information available to Bioscape, it becomes possible to see each result set in the same document and to perform further analysis on the accuracy of Bioscape results.

Isolating Correct and Incorrect Bioscape Results

Using BioCreative results, it is possible to take a selection of Bioscape results and to assess them according to a number of criteria:

  • Correctness: whether each Bioscape result is correct or not - this can already be assessed using the export and scoring scripts described above, but only at the document level.
  • Correspondence: whether each BioCreative result corresponds to any Bioscape results - although this can be done using the scripts at the document level, it now becomes possible to consider the correspondence at the mention level.
  • Ambiguity: the ambiguity of Bioscape suggestions for each BioCreative result - where many Bioscape suggestions indicate ambiguity, and a single suggestion indicates an unambiguous suggestion.
  • Whether Bioscape results appear in places not associated with BioCreative results, and whether these happen to be correspond to BioCreative suggestions for a particular document.

Thus, each Bioscape result can be classified as follows:

Class At known location Predicts correct gene at location Predicts correct gene for document
True positive at "true" BioCreative mention location Yes Yes Yes
False positive at "true" BioCreative mention location Yes No (may co-exist with correct suggestion) No
True positive at wrong "true" BioCreative mention location Yes No (may co-exist with correct suggestion) Yes
True positive at "false" unknown-to-BioCreative mention location No No Yes
False positive at "false" unknown-to-BioCreative mention location No No No

Another way of expressing these result categories is as follows:

At "true" known location At "wrong" known location At "false" unknown location
True positive Bioscape suggestion matches Bioscape suggestion matches a suggestion for the document ("accidental" true positive)
False positive Bioscape suggestion does not match (and is inappropriate for the document) Bioscape suggestion neither appears at a recognised place or is appropriate for the document