Difference between revisions of "iRefIndex Testing 7.0"
From irefindex
(9 intermediate revisions by the same user not shown) | |||
Line 85: | Line 85: | ||
|} | |} | ||
+ | ===UID overlap testing=== | ||
+ | After parsing it is important to make sure there is no overlap in the UID: | ||
+ | The following queries should return empty set: | ||
+ | |||
+ | *select * from int_object where int_object.uid in (select uid from int_source) | ||
+ | *select * from int_object where int_object.uid in (select uid from int_experiment) | ||
+ | *select * from int_source where int_source.uid in (select uid from int_experiment) | ||
+ | |||
+ | ==Check five records each from all data sources== | ||
+ | *Check with the file | ||
+ | *Check with the website if available | ||
+ | |||
+ | The method is to find the UID range for the source from the int_surce2object | ||
+ | table. | ||
+ | e.g for IntAct | ||
+ | # Select max(sourceid) as max_id , min(sourceid) as min_id from int_source2object where source=5; | ||
+ | # List the first interaction (min_id) the last (max_id) and few from the middle | ||
+ | # To list node attribute use the int_xref and int_name tables with the objectid | ||
+ | e.g select * from int_xref where int_xref.uid = <the objectid from int_source2object table> | ||
+ | # To get the the interaction attributes use the int_xref and int_name tables with the sourceid | ||
+ | # to get Experiment attributes. First get the experimental uid from int_experiment table using sourceid of the int_source2object table. Then for attribute use the int_xref and int_name tables with the uid of the int_experiment table. | ||
+ | |||
+ | ==Check the legacy data-sources== | ||
+ | These are data source where the source data has not change. | ||
+ | * verify the the reasons for differences in numbers if any. | ||
+ | |||
+ | ==Check SQL tables== | ||
+ | The following tables should be checked for: | ||
+ | * The expected number of rows and columns | ||
+ | * Null values (these included MySQL null, 0, -1, -8 and -10).(Some time the null values are allowed and the attempt here is to verify there is no systematic error) | ||
+ | * Reserved characters in PSI-MI Tab and XML. | ||
+ | * Problems in character encoding. | ||
+ | {| {{table}} cellpadding="10" cellspacing="0" border="1" | ||
+ | |||
+ | |||
+ | | align="center" style="background:#f0f0f0;"|'''Table name''' | ||
+ | | align="center" style="background:#f0f0f0;"|'''Check''' | ||
+ | | align="center" style="background:#f0f0f0;"|'''What to expect''' | ||
+ | |- | ||
+ | | acc_multiples ||NO|| | ||
+ | |- | ||
+ | | addeds ||NO|| | ||
+ | |- | ||
+ | | arbitrary ||NO|| | ||
+ | |- | ||
+ | | colon_patch ||NO|| | ||
+ | |- | ||
+ | | colon_patch_bk ||NO|| | ||
+ | |- | ||
+ | | config ||NO|| | ||
+ | |- | ||
+ | | cy_edgeatrib ||YES||This table is a denormalized table with all interaction attributes. Used when making the RIGID centric TAB file. This is also used when making iRefScape data. Blank values are "-". No fileds should contain NULL.Chack for columns with only "-" as value. | ||
+ | |- | ||
+ | | cy_nodeatrib ||YES||This table is a denormalized table with all interactor attributes. Used when making the ROGID centric TAB file. This is also used when making iRefScape data.Blank values are "-". No fileds should contain NULL.Chack for columns with only "-" as value. Oly ROGIDs used in interactions will apear here | ||
+ | |- | ||
+ | | equa_score_multiple ||NO|| | ||
+ | |- | ||
+ | | equa_score_multiple_reset ||NO|| | ||
+ | |- | ||
+ | | eutils ||Yes||This table contains sequences for deprecated protein sequences.(removed from current RefSeq, UniProt or other databases archived by Entrez. Row count in this table should not be significantly different from the previous release. The SEGUID column should be checked and should make sure the Eutil web service client has performed as expected. This is also a good point to check the "E" scores in the int_xref_mod table. Also cross check with the SEGUID table (entries here should also appear in the SEGUID table if they have a valid SEGUID) | ||
+ | |- | ||
+ | | gene_acc ||No|| | ||
+ | |- | ||
+ | | gene2refseq ||YES|| This table has information on the protein products of each gene. This is the primary table from RGG (Redundant Gene Group) assignment. Check 5 records against NCBI web site. Select 5 RGGs and check that the assignment is performed correctly. Check for entries that does not appear in SEGUID table. Check for consistent null or balnk values. | ||
+ | |- | ||
+ | | geneinfo ||YES||This table provides information about entrez gene Gene records. Check for consistent null values in columns. Check 5 rows against NCBI web site.Labels for genes are extracted from this table. | ||
+ | |- | ||
+ | | int_category ||YES|| | ||
+ | |- | ||
+ | | int_db ||YES|| | ||
+ | |- | ||
+ | | int_deleted ||YES|| | ||
+ | |- | ||
+ | | int_experiment ||YES|| | ||
+ | |- | ||
+ | | int_generation ||YES|| | ||
+ | |- | ||
+ | | int_name ||YES|| | ||
+ | |- | ||
+ | | int_object ||YES|| | ||
+ | |- | ||
+ | | int_objecttype ||YES|| | ||
+ | |- | ||
+ | | int_participants ||YES|| | ||
+ | |- | ||
+ | | int_proteinUIDs ||YES|| | ||
+ | |- | ||
+ | | int_recordtype ||YES|| | ||
+ | |- | ||
+ | | int_seguerror ||YES|| | ||
+ | |- | ||
+ | | int_sequence ||YES|| | ||
+ | |- | ||
+ | | int_source ||YES|| | ||
+ | |- | ||
+ | | int_source2object ||YES|| | ||
+ | |- | ||
+ | | int_xref ||YES|| | ||
+ | |- | ||
+ | | int_xref_mod ||YES|| | ||
+ | |- | ||
+ | | intacc2rig ||YES|| | ||
+ | |- | ||
+ | | ipi2seq ||YES|| | ||
+ | |- | ||
+ | | ipi2xref ||YES|| | ||
+ | |- | ||
+ | | maxvals ||YES|| | ||
+ | |- | ||
+ | | none_prots ||YES|| | ||
+ | |- | ||
+ | | pdb ||YES|| | ||
+ | |- | ||
+ | | pdb_mmdb ||YES|| | ||
+ | |- | ||
+ | | pluses ||YES|| | ||
+ | |- | ||
+ | | pmid2int ||YES|| | ||
+ | |- | ||
+ | | pmid2rig ||YES|| | ||
+ | |- | ||
+ | | PPI_sourceid ||YES|| | ||
+ | |- | ||
+ | | ref_main ||YES|| | ||
+ | |- | ||
+ | | ref_xref ||YES|| | ||
+ | |- | ||
+ | | refseq ||YES|| | ||
+ | |- | ||
+ | | rig2rigid ||YES|| | ||
+ | |- | ||
+ | | rig2rog ||YES|| | ||
+ | |- | ||
+ | | risg2risgid ||YES|| | ||
+ | |- | ||
+ | | rog_found ||YES|| | ||
+ | |- | ||
+ | | rog_mult ||YES|| | ||
+ | |- | ||
+ | | rog_multiple ||YES|| | ||
+ | |- | ||
+ | | rog_reset ||YES|| | ||
+ | |- | ||
+ | | rog2rig ||YES|| | ||
+ | |- | ||
+ | | rog2rogid ||YES|| | ||
+ | |- | ||
+ | | score_multiple ||YES|| | ||
+ | |- | ||
+ | | segu2seq ||YES|| | ||
+ | |- | ||
+ | | seguid ||YES|| | ||
+ | |- | ||
+ | | seguid_aded ||YES|| | ||
+ | |- | ||
+ | | seguid_complex ||YES|| | ||
+ | |- | ||
+ | | seguid_gbnk ||YES|| | ||
+ | |- | ||
+ | | seguid_pdbd ||YES|| | ||
+ | |- | ||
+ | | seguid_refs ||YES|| | ||
+ | |- | ||
+ | | seguid_remv ||YES|| | ||
+ | |- | ||
+ | | seguid_rest ||YES|| | ||
+ | |- | ||
+ | | seguid_unip ||YES|| | ||
+ | |- | ||
+ | | sha_seguid ||YES|| | ||
+ | |- | ||
+ | | sha_seguid_redund ||YES|| | ||
+ | |- | ||
+ | | summary_rig ||YES|| | ||
+ | |- | ||
+ | | summary_rog ||YES|| | ||
+ | |- | ||
+ | | summary_score ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interaction_experiment ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interaction_name ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interaction_source ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interaction_xref ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interactions ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interactor_name ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interactor_object ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interactor_xref ||YES|| | ||
+ | |- | ||
+ | | tmp_orphaned_interactors ||YES|| | ||
+ | |- | ||
+ | | uid2rig ||YES|| | ||
+ | |- | ||
+ | | uid2rog ||YES|| | ||
+ | |- | ||
+ | | uniprot_fly_acc ||YES|| | ||
+ | |- | ||
+ | | uniprot_isoforms ||YES|| | ||
+ | |- | ||
+ | | uniprot_main ||YES|| | ||
+ | |- | ||
+ | | uniprot_ref ||YES|| | ||
+ | |- | ||
+ | | uniprot_sequence ||YES|| | ||
+ | |- | ||
+ | | uniprot_yeast_acc ||YES|| | ||
+ | |- | ||
+ | | unique_rigids ||YES|| | ||
+ | |- | ||
+ | | unique_rogs ||YES|| | ||
+ | |- | ||
+ | | used_rogs ||YES|| | ||
+ | |- | ||
+ | | | ||
+ | |} | ||
+ | |||
+ | |||
+ | ==Categories== | ||
+ | The int_category table defines the categories used in int_xref, int_name and int_participant tables | ||
+ | {| {{table}} cellpadding="10" cellspacing="0" border="1" | ||
+ | | align="center" style="background:#f0f0f0;"|'''Category reference number''' | ||
+ | | align="center" style="background:#f0f0f0;"|'''Tag''' | ||
+ | | align="center" style="background:#f0f0f0;"|'''Description''' | ||
+ | |- | ||
+ | | 0||interaction_primaryref||NA | ||
+ | |- | ||
+ | | 1||interaction_secondaryref||NA | ||
+ | |- | ||
+ | | 2||object_primaryref||NA | ||
+ | |- | ||
+ | | 3||object_secondaryref||NA | ||
+ | |- | ||
+ | | 4||bib_primaryref||NA | ||
+ | |- | ||
+ | | 5||bib_secondaryref||NA | ||
+ | |- | ||
+ | | 6||int_ditection_primaryref||NA | ||
+ | |- | ||
+ | | 7||int_ditection_secondaryref||NA | ||
+ | |- | ||
+ | | 8||parti_dentifi_primaryref||NA | ||
+ | |- | ||
+ | | 9||parti_dentifi_secondaryref||NA | ||
+ | |- | ||
+ | | 10||interaction_shortlbl||NA | ||
+ | |- | ||
+ | | 11||interaction_alias||NA | ||
+ | |- | ||
+ | | 12||interaction_full||NA | ||
+ | |- | ||
+ | | 13||object_shortlbl||NA | ||
+ | |- | ||
+ | | 14||object_alias||NA | ||
+ | |- | ||
+ | | 15||object_full||NA | ||
+ | |- | ||
+ | | 16||experim_primaryref||NA | ||
+ | |- | ||
+ | | 17||experim_secondaryref||NA | ||
+ | |- | ||
+ | | 18||experim_shortlbl||NA | ||
+ | |- | ||
+ | | 19||experim_alias||NA | ||
+ | |- | ||
+ | | 20||experim_full||NA | ||
+ | |- | ||
+ | | 21||parti_dentifi_shortlbl||NA | ||
+ | |- | ||
+ | | 22||parti_dentifi_alias||NA | ||
+ | |- | ||
+ | | 23||parti_dentifi_full||NA | ||
+ | |- | ||
+ | | 24||int_ditection_shortlbl||NA | ||
+ | |- | ||
+ | | 25||int_ditection_alias||NA | ||
+ | |- | ||
+ | | 26||int_ditection_fullnm||NA | ||
+ | |- | ||
+ | | 27||int_type_sh||NA | ||
+ | |- | ||
+ | | 28||int_type_al||NA | ||
+ | |- | ||
+ | | 29||int_type_fl||NA | ||
+ | |- | ||
+ | | 30||object_giref||NA | ||
+ | |- | ||
+ | | 31||object_revers_primary_using_primary||NA | ||
+ | |- | ||
+ | | 32||uniprot_secondary||NA | ||
+ | |- | ||
+ | | 33||object_revers_primary_using_secondary||NA | ||
+ | |- | ||
+ | | 34||primary_uniprimary_fetch_dif_taxon||NA | ||
+ | |- | ||
+ | | 36||Interactor_ref_biological_role||NA | ||
+ | |- | ||
+ | | 37||Interactor_ref_Experimental_role||NA | ||
+ | |- | ||
+ | | 38||Experiment_Host_organism||NA | ||
+ | |- | ||
+ | | 39||Depricated_1||NA | ||
+ | |- | ||
+ | | 40||int_type_primaryref||NA | ||
+ | |- | ||
+ | | 41||int_type_secondaryref||NA | ||
+ | |} | ||
Follow this link for a listing of all iRefIndex related pages (archived and current). | Follow this link for a listing of all iRefIndex related pages (archived and current). | ||
[[Category:iRefIndex]] | [[Category:iRefIndex]] |
Latest revision as of 10:21, 14 December 2010
The testing procedure for iRefIndex
Contents
Cross check with output of element counter
Program to use : biotek.uio.no.XML.Element_Counter (SaxValidator package)
- For each interaction source </interactor> count should match the UID count int_object (select (select name from int_db where int_db.id=source) as intSource, count(uid) from int_object group by source; ).
- For each interaction source </interactor> count should match the UID count int_source (select (select name from int_db where int_db.id=source) as intSource, count(uid) from int_source group by source;).
- When </interactor> is not usable to count distinct objects (when this occurs as part of interaction and repeated in interactorList) some other suitable element has to be used (e.g </participant>)
- Why count the closing elements in the above cases (e.g. </interactor> , instead of <interaction> or </interaction ). The reason is interaction elements may have attributes and elements starting with interaction may be ambiguous. This program uses text matching (to be independent of any XML parsing).
Check SEGUID. Check one record each to very the process worked
Test SEGUID updating process
*SQL query = select orid, count(distinct rog) as rog_C from seguid where orid<0 group by orid;
orid | Record_count |
-30 | 16983 |
-26 | 2 |
-24 | 78 |
-23 | 14 |
-22 | 1043258 |
-21 | 669761 |
-12 | 2679 |
-11 | 1665 |
-8 | 6525 |
-7 | 6547 |
-6 | 5235 |
-5 | 50305 |
-3 | 10853842 |
-2 | 11972291 |
- All entries with orid<0 are altered during update. All interies with orid>=0 are original entries from seguid annotation file.
ORID | Description |
-30 | This is a iRefIndex Complex (RIGID used as ROGID), included in a previous process |
-26 | Is a OLN dead yeast_acc mapped using UniProt cross reference |
-25 | Is a SGD acc dead yeast_acc mapped using UniProt cross reference |
-24 | Is a dead fly_acc mapped using UniProt cross reference |
-23 | Is a dead PDB |
-22 | Is a dead RefSeq |
-21 | Is a dead UniProtKB |
-12 | Added to SEGUID from original sequence record (N-Scores) in a previous process |
-11 | Added to SEGUID using Eutils in a previous process |
-8 | Is a live OLN acc yeast_acc mapped using UniProt cross reference |
-7 | Is a live SGD acc yeast_acc mapped using UniProt cross reference |
-6 | Is a live fly_acc mapped using UniProt cross reference |
-5 | Is a alive PDB |
-3 | Is a alive RefSeq |
-2 | Is a alive UniProtKB |
UID overlap testing
After parsing it is important to make sure there is no overlap in the UID: The following queries should return empty set:
- select * from int_object where int_object.uid in (select uid from int_source)
- select * from int_object where int_object.uid in (select uid from int_experiment)
- select * from int_source where int_source.uid in (select uid from int_experiment)
Check five records each from all data sources
- Check with the file
- Check with the website if available
The method is to find the UID range for the source from the int_surce2object table. e.g for IntAct
- Select max(sourceid) as max_id , min(sourceid) as min_id from int_source2object where source=5;
- List the first interaction (min_id) the last (max_id) and few from the middle
- To list node attribute use the int_xref and int_name tables with the objectid
e.g select * from int_xref where int_xref.uid = <the objectid from int_source2object table>
- To get the the interaction attributes use the int_xref and int_name tables with the sourceid
- to get Experiment attributes. First get the experimental uid from int_experiment table using sourceid of the int_source2object table. Then for attribute use the int_xref and int_name tables with the uid of the int_experiment table.
Check the legacy data-sources
These are data source where the source data has not change.
- verify the the reasons for differences in numbers if any.
Check SQL tables
The following tables should be checked for:
- The expected number of rows and columns
- Null values (these included MySQL null, 0, -1, -8 and -10).(Some time the null values are allowed and the attempt here is to verify there is no systematic error)
- Reserved characters in PSI-MI Tab and XML.
- Problems in character encoding.
Table name | Check | What to expect |
acc_multiples | NO | |
addeds | NO | |
arbitrary | NO | |
colon_patch | NO | |
colon_patch_bk | NO | |
config | NO | |
cy_edgeatrib | YES | This table is a denormalized table with all interaction attributes. Used when making the RIGID centric TAB file. This is also used when making iRefScape data. Blank values are "-". No fileds should contain NULL.Chack for columns with only "-" as value. |
cy_nodeatrib | YES | This table is a denormalized table with all interactor attributes. Used when making the ROGID centric TAB file. This is also used when making iRefScape data.Blank values are "-". No fileds should contain NULL.Chack for columns with only "-" as value. Oly ROGIDs used in interactions will apear here |
equa_score_multiple | NO | |
equa_score_multiple_reset | NO | |
eutils | Yes | This table contains sequences for deprecated protein sequences.(removed from current RefSeq, UniProt or other databases archived by Entrez. Row count in this table should not be significantly different from the previous release. The SEGUID column should be checked and should make sure the Eutil web service client has performed as expected. This is also a good point to check the "E" scores in the int_xref_mod table. Also cross check with the SEGUID table (entries here should also appear in the SEGUID table if they have a valid SEGUID) |
gene_acc | No | |
gene2refseq | YES | This table has information on the protein products of each gene. This is the primary table from RGG (Redundant Gene Group) assignment. Check 5 records against NCBI web site. Select 5 RGGs and check that the assignment is performed correctly. Check for entries that does not appear in SEGUID table. Check for consistent null or balnk values. |
geneinfo | YES | This table provides information about entrez gene Gene records. Check for consistent null values in columns. Check 5 rows against NCBI web site.Labels for genes are extracted from this table. |
int_category | YES | |
int_db | YES | |
int_deleted | YES | |
int_experiment | YES | |
int_generation | YES | |
int_name | YES | |
int_object | YES | |
int_objecttype | YES | |
int_participants | YES | |
int_proteinUIDs | YES | |
int_recordtype | YES | |
int_seguerror | YES | |
int_sequence | YES | |
int_source | YES | |
int_source2object | YES | |
int_xref | YES | |
int_xref_mod | YES | |
intacc2rig | YES | |
ipi2seq | YES | |
ipi2xref | YES | |
maxvals | YES | |
none_prots | YES | |
pdb | YES | |
pdb_mmdb | YES | |
pluses | YES | |
pmid2int | YES | |
pmid2rig | YES | |
PPI_sourceid | YES | |
ref_main | YES | |
ref_xref | YES | |
refseq | YES | |
rig2rigid | YES | |
rig2rog | YES | |
risg2risgid | YES | |
rog_found | YES | |
rog_mult | YES | |
rog_multiple | YES | |
rog_reset | YES | |
rog2rig | YES | |
rog2rogid | YES | |
score_multiple | YES | |
segu2seq | YES | |
seguid | YES | |
seguid_aded | YES | |
seguid_complex | YES | |
seguid_gbnk | YES | |
seguid_pdbd | YES | |
seguid_refs | YES | |
seguid_remv | YES | |
seguid_rest | YES | |
seguid_unip | YES | |
sha_seguid | YES | |
sha_seguid_redund | YES | |
summary_rig | YES | |
summary_rog | YES | |
summary_score | YES | |
tmp_orphaned_interaction_experiment | YES | |
tmp_orphaned_interaction_name | YES | |
tmp_orphaned_interaction_source | YES | |
tmp_orphaned_interaction_xref | YES | |
tmp_orphaned_interactions | YES | |
tmp_orphaned_interactor_name | YES | |
tmp_orphaned_interactor_object | YES | |
tmp_orphaned_interactor_xref | YES | |
tmp_orphaned_interactors | YES | |
uid2rig | YES | |
uid2rog | YES | |
uniprot_fly_acc | YES | |
uniprot_isoforms | YES | |
uniprot_main | YES | |
uniprot_ref | YES | |
uniprot_sequence | YES | |
uniprot_yeast_acc | YES | |
unique_rigids | YES | |
unique_rogs | YES | |
used_rogs | YES | |
Categories
The int_category table defines the categories used in int_xref, int_name and int_participant tables
Category reference number | Tag | Description |
0 | interaction_primaryref | NA |
1 | interaction_secondaryref | NA |
2 | object_primaryref | NA |
3 | object_secondaryref | NA |
4 | bib_primaryref | NA |
5 | bib_secondaryref | NA |
6 | int_ditection_primaryref | NA |
7 | int_ditection_secondaryref | NA |
8 | parti_dentifi_primaryref | NA |
9 | parti_dentifi_secondaryref | NA |
10 | interaction_shortlbl | NA |
11 | interaction_alias | NA |
12 | interaction_full | NA |
13 | object_shortlbl | NA |
14 | object_alias | NA |
15 | object_full | NA |
16 | experim_primaryref | NA |
17 | experim_secondaryref | NA |
18 | experim_shortlbl | NA |
19 | experim_alias | NA |
20 | experim_full | NA |
21 | parti_dentifi_shortlbl | NA |
22 | parti_dentifi_alias | NA |
23 | parti_dentifi_full | NA |
24 | int_ditection_shortlbl | NA |
25 | int_ditection_alias | NA |
26 | int_ditection_fullnm | NA |
27 | int_type_sh | NA |
28 | int_type_al | NA |
29 | int_type_fl | NA |
30 | object_giref | NA |
31 | object_revers_primary_using_primary | NA |
32 | uniprot_secondary | NA |
33 | object_revers_primary_using_secondary | NA |
34 | primary_uniprimary_fetch_dif_taxon | NA |
36 | Interactor_ref_biological_role | NA |
37 | Interactor_ref_Experimental_role | NA |
38 | Experiment_Host_organism | NA |
39 | Depricated_1 | NA |
40 | int_type_primaryref | NA |
41 | int_type_secondaryref | NA |
Follow this link for a listing of all iRefIndex related pages (archived and current).