Difference between revisions of "Statistics iRefIndex 14.0"

From irefindex
 
(6 intermediate revisions by the same user not shown)
Line 8: Line 8:
  
 
== Interactions available from major taxonomies (corrected) ==
 
== Interactions available from major taxonomies (corrected) ==
Taxons of the protein interactors have been corrected to correspond to the taxon provided in the protein sequence record regardless of the taxon listed in the interaction record.  See [PMID:18823568] for details.  
+
Taxons of the protein interactors have been corrected to correspond to the taxon provided in the protein sequence record regardless of the taxon listed in the interaction record.  See PMID 18823568 for details.  
  
 
{| cellspacing="0" cellpadding="5"
 
{| cellspacing="0" cellpadding="5"
Line 44: Line 44:
 
== Summary of mapping interaction records to RIGs (redundant interaction groups) ==
 
== Summary of mapping interaction records to RIGs (redundant interaction groups) ==
  
'''Source''': Interaction data source.  '''Total records''': Total number of interaction records found in source.  '''Protein-only interactors''':Total number of interactions involving only protein interactors.  '''PPI assigned to RIGID''': Number of interactions where all protein interactors were assigned to a ROG. Percentage of column 3 is shown.  '''Unique interactions''': Number of unique protein interactions and complexes (RIGID's) found in the data source (also expressed as a percentage of column 4).  
+
'''Source''': Interaction data source.  '''Total records''': Total number of interaction records found in source.  '''Protein-only interactors''':Total number of interactions involving only protein interactors.  '''PPI assigned to RIGID''': Number of interactions where all protein interactors were assigned to a ROG. Percentage of column 3 is shown.  '''Unique RIGIDs (interactions)''': Number of unique protein interactions and complexes (RIGID's) found in the data source (also expressed as a percentage of column 4).  For a description of the term RIGs, see [[README_MITAB2.6_for_iRefIndex#Understanding_the_iRefIndex_MITAB_format]] and the original paper PMID 18823568.
  
 
{| cellspacing="0" cellpadding="5"
 
{| cellspacing="0" cellpadding="5"
Line 99: Line 99:
  
 
== Assignment of protein interactors to ROGs (redundant object group) ==
 
== Assignment of protein interactors to ROGs (redundant object group) ==
'''Source''': Interaction data source (see methods). '''Protein interactors''': Total number of interactors found in all interaction records. '''Assigned''': Number of proteins assigned unambiguously to a ROG. Assignments listed in columns 5 and 6 are not included here. '''%''': Column 3 expressed as a percentage of column 2. '''Arbitrary''': Total number of ROG assignments that were ambiguous and resolved with an arbitrary method (see ROG scores with 'L').  '''Matching sequence''': Total number of assignments made where a sequence in the interaction record matched a known sequence. '''Unassigned''':Total number of protein interactors that could not be assigned to a ROG. '''Unique''': Total number of unique proteins (ROG's).  
+
'''Source''': Interaction data source (see methods). '''Protein interactors''': Total number of interactors found in all interaction records. '''Assigned''': Number of proteins assigned unambiguously to a ROG. Assignments listed in columns 5 and 6 are not included here. '''%''': Column 3 expressed as a percentage of column 2. '''Arbitrary''': Total number of ROG assignments that were ambiguous and resolved with an arbitrary method (see ROG scores with 'L').  '''Matching sequence''': Total number of assignments made where a sequence in the interaction record matched a known sequence. '''Unassigned''':Total number of protein interactors that could not be assigned to a ROG. '''Unique''': Total number of unique proteins (ROG's). For a description of the term ROGs, see [[README_MITAB2.6_for_iRefIndex#Understanding_the_iRefIndex_MITAB_format]] and the original paper PMID 18823568.
  
 
{| cellspacing="0" cellpadding="5"
 
{| cellspacing="0" cellpadding="5"
Line 153: Line 153:
 
|}
 
|}
  
== ROG summary ==
+
== Mapping score summary ==
 +
 
 +
See below for definitions of the mapping score codes.
  
 
{| cellspacing="0" cellpadding="5"
 
{| cellspacing="0" cellpadding="5"
Line 306: Line 308:
 
| SUTD+O ||  ||  ||  ||  ||  ||  ||  ||131 ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  
 
| SUTD+O ||  ||  ||  ||  ||  ||  ||  ||131 ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  ||  
 
|}
 
|}
 +
 +
== Mapping score code definitions ==
 +
 +
{|
 +
| align="center" style="background:#f0f0f0;"|'''Character'''||align="center" style="background:#f0f0f0;"|'''Description of feature (when the value is 1)'''||align="center" style="background:#f0f0f0;"
 +
|-
 +
| D||The source database (D) listed in the interaction record is different than what is expected for the given accession for the protein. In specific cases, this difference is tolerated and the assignment is made.
 +
|-
 +
| E||The protein reference was a retired NCBI Identifier or a UniProt identifier. NCBI's eUtils (E) were used to retrieve the current accession and/or sequence. For the identifiers still with no sequence after going through eUtils, sequence information obtained from UniProt.
 +
|-
 +
| G||The interaction record's reference for the protein was an EntrezGene (G) identifier. The corresponding products of the gene were used to make the assignment.
 +
|-
 +
| L||More than one possible assignment is possible (see + above). (e.g. isoforms for a geneid) In such a situation, references are picked using a ranking system (first look for RefSeq, then UniProt). Even after this ranking if ambiguity exists, the reference with lengthiest sequences selected. (Please note that this score class definition is different from originally published one)
 +
|-
 +
| M||The protein reference listed by the interaction record was a typographical modification (M) of a known accession. In specific cases, this variation is tolerated and the assignment is made.
 +
|-
 +
| +||More than one possible assignment is possible (+). This case may arise in one of three ways. 1) The refere[[Category:iRefIndex]]nce supplied by the interaction record requires updating but more than one possibility exists. For example, Q7XJL8 was found to be a secondary accession in three separate UniProt records (Q3EBZ2, Q6DR20, and Q8GWA9). 2) The secondary references supplied by the interaction record point to more than one unique protein sequence. 3) An EntrezGene identifier is provided in the interaction record as a protein reference. This identifier points to more than one protein product. An attempt is made to resolve this ambiguity as indicated by ROG score features O, X or L (see below).
 +
|-
 +
| N||The protein reference, taxonomy identifier and sequence for the protein as provided in the interaction record are used to make a new entry in the SEGUID table. The protein interactor is assigned the newly (N) generated ROG identifier.
 +
|-
 +
| O||More than one possible assignment is possible (see + above). The assignment chosen has a SEGUID that is identical to the SEGUID of the original (O) sequence provided in the interaction record.
 +
|-
 +
| I||The protein reference used was an NCBI GenInfo Identifier (I).
 +
|-
 +
| U||The protein reference listed in the interaction record and used to make the assignment was a secondary UniProt accession and was updated (U) to a primary UniProt accession in order to make the assignment.
 +
|-
 +
| T||The taxonomy (T) identifier for the protein (as supplied by the interaction record) differed from what was found in the protein sequence record. This discrepancy was tolerated and the assignment was made
 +
|-
 +
| V||The protein reference listed by the interaction record contained version (V) information that was ignored. For example, RefSeq accession.version NP_012420.1 was listed but treated as RefSeq accession NP_012420.
 +
|-
 +
| Q||The protein reference used to make the assignment was of the type 'see-also'. See PSI-MI Path: entrySet/entry/interactorList/interactor/xref/primaryRef/refType = 'see-also'.
 +
|-
 +
| P||The interaction record's primary (P) reference for the protein was used to make the assignment
 +
|-
 +
| S||One of the interaction record's secondary (S) references for the protein was used to make the assignment
 +
|-
 +
| Y|| the accession was referring an accession which was removed from RefSeq or UniProt after beta3 build of iRefIndex (March 9th, 2009)
 +
|-
 +
| X||More than one possible assignment is possible (see + above). The assignment chosen has the same taxonomy (X) identifier as listed in the interaction record
 +
|}
 +
 +
 +
  
 
[[Category:iRefIndex]]
 
[[Category:iRefIndex]]

Latest revision as of 14:08, 21 April 2015


These statistics apply to the extended version of iRefIndex. See the iRefIndex_Release_Notes for details.



Interactions available from major taxonomies (corrected)

Taxons of the protein interactors have been corrected to correspond to the taxon provided in the protein sequence record regardless of the taxon listed in the interaction record. See PMID 18823568 for details.

NCBI taxonomy identifier Scientific name Number of interactions
9606 Homo sapiens 472494
559292 Saccharomyces cerevisiae S288c 122323
7227 Drosophila melanogaster 60888
10090 Mus musculus 35318
3702 Arabidopsis thaliana 24946
6239 Caenorhabditis elegans 17843
83333 Escherichia coli K-12 16450
192222 Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819 11971
10116 Rattus norvegicus 9679
284812 Schizosaccharomyces pombe 972h- 9387
381518 Influenza A virus (A/Wilson-Smith/1933(H1N1)) 4087
632 Yersinia pestis 3956
243276 Treponema pallidum subsp. pallidum str. Nichols 3642
1111708 Synechocystis sp. PCC 6803 substr. Kazusa 3232

Summary of mapping interaction records to RIGs (redundant interaction groups)

Source: Interaction data source. Total records: Total number of interaction records found in source. Protein-only interactors:Total number of interactions involving only protein interactors. PPI assigned to RIGID: Number of interactions where all protein interactors were assigned to a ROG. Percentage of column 3 is shown. Unique RIGIDs (interactions): Number of unique protein interactions and complexes (RIGID's) found in the data source (also expressed as a percentage of column 4). For a description of the term RIGs, see README_MITAB2.6_for_iRefIndex#Understanding_the_iRefIndex_MITAB_format and the original paper PMID 18823568.

Source Total records Protein-related interactions PPI assigned to RIGID % Unique RIGIDs %
BHF_UCL 928 915 915 100.00 518 56.61
BIND 157736 91309 90816 99.46 62858 69.21
BIND_TRANSLATION 192923 84138 81773 97.19 60720 74.25
BIOGRID 790004 493818 491294 99.49 324083 65.97
CORUM 2844 2844 2844 100.00 2607 91.67
DIP 78781 77225 77052 99.78 74638 96.87
HPIDB 1458 1405 1405 100.00 725 51.60
HPRD 83022 83022 82983 99.95 40536 48.85
I2D_IMEX 892 891 891 100.00 434 48.71
INNATEDB 17496 17496 7111 40.64 4932 69.36
INTACT 344906 327730 327637 99.97 224568 68.54
INTCOMPLEX 1100 982 982 100.00 968 98.57
MATRIXDB 596 575 575 100.00 324 56.35
MBINFO 542 521 521 100.00 330 63.34
MOLCON 377 375 375 100.00 212 56.53
MPACT 16504 16504 16373 99.21 13398 81.83
MPIDB 1505 1504 1504 100.00 954 63.43
MPPI 1814 1758 1578 89.76 776 49.18
OPHID 73257 73257 73257 100.00 47464 64.79
REACTOME 141996 141996 141993 100.00 141818 99.88
SPIKE 29686 29686 28323 95.41 27824 98.24
UNIPROTPP 8952 8890 8890 100.00 5049 56.79
VIRUSHOST 45540 45540 45539 100.00 45538 100.00
(All) 1992859 1502381 1484631 98.82 797994 53.75

Assignment of protein interactors to ROGs (redundant object group)

Source: Interaction data source (see methods). Protein interactors: Total number of interactors found in all interaction records. Assigned: Number of proteins assigned unambiguously to a ROG. Assignments listed in columns 5 and 6 are not included here. %: Column 3 expressed as a percentage of column 2. Arbitrary: Total number of ROG assignments that were ambiguous and resolved with an arbitrary method (see ROG scores with 'L'). Matching sequence: Total number of assignments made where a sequence in the interaction record matched a known sequence. Unassigned:Total number of protein interactors that could not be assigned to a ROG. Unique: Total number of unique proteins (ROG's). For a description of the term ROGs, see README_MITAB2.6_for_iRefIndex#Understanding_the_iRefIndex_MITAB_format and the original paper PMID 18823568.

Source Protein interactors Assigned % Arbitrary Matching sequence New or obsolete sequence Unassigned Unique proteins
BHF_UCL 2060 2060 100.00 0 0 0 0 494
BIND 252251 251706 99.78 0 0 0 545 37441
BIND_TRANSLATION 257681 251597 97.64 40883 0 0 6084 36124
BIOGRID 53047 52116 98.24 11433 0 0 931 51873
CORUM 12916 12916 100.00 7 0 0 0 4363
DIP 26633 26551 99.69 2084 0 0 82 25804
HPIDB 3221 3221 100.00 0 0 0 0 782
HPRD 123812 123812 100.00 13563 95615 130 0 9841
I2D_IMEX 1932 1932 100.00 0 0 0 0 448
INNATEDB 40104 24918 62.13 0 0 0 15186 3619
INTACT 265428 265292 99.95 115 39 74 136 73955
INTCOMPLEX 3256 3256 100.00 0 0 0 0 2194
MATRIXDB 1171 1171 100.00 5 0 0 0 231
MBINFO 1134 1134 100.00 0 0 0 0 273
MOLCON 862 862 100.00 0 0 0 0 275
MPACT 40349 40199 99.63 0 0 0 150 4995
MPIDB 3238 3238 100.00 0 0 0 0 995
MPPI 3568 3361 94.20 16 0 0 207 833
OPHID 146514 146514 100.00 405 20 1014 0 9476
REACTOME 283992 283988 100.00 19 0 0 4 6013
SPIKE 65934 64561 97.92 967 0 0 1373 8811
UNIPROTPP 21185 21185 100.00 1 0 0 0 4642
VIRUSHOST 94874 94873 100.00 22 0 0 1 10283
(All) 1705162 1680463 98.55 69520 95674 1218 24699 122677

Mapping score summary

See below for definitions of the mapping score codes.

BHF_UCL BIND BIND_TRANSLATION BIOGRID CORUM DIP HPIDB HPRD I2D_IMEX INNATEDB INTACT INTCOMPLEX MATRIXDB MBINFO MOLCON MPACT MPIDB MPPI OPHID REACTOME SPIKE UNIPROTPP VIRUSHOST
P 2060 173353 33822 12875 3221 1932 24918 264204 3256 1166 1134 862 3238 283963 55189 21180 94851
P+IN 6
P+L 19764 746 2 22
P+N 64
P+X 3 2
PD 116272 2996 124085
PD+IN 2
PD+LQ 10197
PD+N 1014
PD+X 10
PD+XQ 26
PDIQ 732
PDQ 30573
PGD 613 2079 306
PGD+L 6300 10659 6 962
PGD+X 13
PI 418
PT 2084 2579 1 30579
PT+L 541 1
PTD 84164 1 44 114
PTD+LQ 4022
PTDIQ 13
PTDQ 2492
PTGD 17 1
PTGD+L 21 2
PTI 16
PTM 3
PU 16 34 396 6 8099 4
PU+L 17 7 76 5 19 5
PU+O 23
PU+X 610 2
PUD 7 143 17341
PUD+L 13 265
PUD+O 20
PUD+X 60 162 3526
PUT 4 15 2527
PUT+L 21 30 1
PUT+O 16
PUTD 4 9
PUTD+L 3 140
PV 7
PV+L 1
S 146 831 12454 115 1
S+L 25 1560 634
S+N 2
S+O 275
S+X 263
SD 1338 4690 3119
SD+L 215 327
SD+N 130
SD+O 11114
SD+X 1173
SGD 680
SGD+L 2124
SGD+O 15462
SI 45114
ST 4557 112 7093
ST+L 243 3767
ST+O 852
STD 18 702 8455
STD+L 5 645
STD+O 28208
STGD 2023
STGD+L 6026
STGD+O 39571
STI 6075
SU 32
SUD 47
SUD+L 33 25
SUD+O 2
SUD+X 568
SUTD 13
SUTD+L 28 15
SUTD+O 131

Mapping score code definitions

Character Description of feature (when the value is 1) align="center" style="background:#f0f0f0;"
D The source database (D) listed in the interaction record is different than what is expected for the given accession for the protein. In specific cases, this difference is tolerated and the assignment is made.
E The protein reference was a retired NCBI Identifier or a UniProt identifier. NCBI's eUtils (E) were used to retrieve the current accession and/or sequence. For the identifiers still with no sequence after going through eUtils, sequence information obtained from UniProt.
G The interaction record's reference for the protein was an EntrezGene (G) identifier. The corresponding products of the gene were used to make the assignment.
L More than one possible assignment is possible (see + above). (e.g. isoforms for a geneid) In such a situation, references are picked using a ranking system (first look for RefSeq, then UniProt). Even after this ranking if ambiguity exists, the reference with lengthiest sequences selected. (Please note that this score class definition is different from originally published one)
M The protein reference listed by the interaction record was a typographical modification (M) of a known accession. In specific cases, this variation is tolerated and the assignment is made.
+ More than one possible assignment is possible (+). This case may arise in one of three ways. 1) The reference supplied by the interaction record requires updating but more than one possibility exists. For example, Q7XJL8 was found to be a secondary accession in three separate UniProt records (Q3EBZ2, Q6DR20, and Q8GWA9). 2) The secondary references supplied by the interaction record point to more than one unique protein sequence. 3) An EntrezGene identifier is provided in the interaction record as a protein reference. This identifier points to more than one protein product. An attempt is made to resolve this ambiguity as indicated by ROG score features O, X or L (see below).
N The protein reference, taxonomy identifier and sequence for the protein as provided in the interaction record are used to make a new entry in the SEGUID table. The protein interactor is assigned the newly (N) generated ROG identifier.
O More than one possible assignment is possible (see + above). The assignment chosen has a SEGUID that is identical to the SEGUID of the original (O) sequence provided in the interaction record.
I The protein reference used was an NCBI GenInfo Identifier (I).
U The protein reference listed in the interaction record and used to make the assignment was a secondary UniProt accession and was updated (U) to a primary UniProt accession in order to make the assignment.
T The taxonomy (T) identifier for the protein (as supplied by the interaction record) differed from what was found in the protein sequence record. This discrepancy was tolerated and the assignment was made
V The protein reference listed by the interaction record contained version (V) information that was ignored. For example, RefSeq accession.version NP_012420.1 was listed but treated as RefSeq accession NP_012420.
Q The protein reference used to make the assignment was of the type 'see-also'. See PSI-MI Path: entrySet/entry/interactorList/interactor/xref/primaryRef/refType = 'see-also'.
P The interaction record's primary (P) reference for the protein was used to make the assignment
S One of the interaction record's secondary (S) references for the protein was used to make the assignment
Y the accession was referring an accession which was removed from RefSeq or UniProt after beta3 build of iRefIndex (March 9th, 2009)
X More than one possible assignment is possible (see + above). The assignment chosen has the same taxonomy (X) identifier as listed in the interaction record