Difference between revisions of "README iRefIndex Feedback 2.0"

From irefindex
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Last edited: October 13th, 2008
+
Last edited: , January 17th, 2009
 +
Applies to iRefIndex release: 2.0 beta.
  
Applies to iRefIndex release: 1.1 beta.
+
Release date: January 13th, 2009
 
 
Release date: July 11th, 2008
 
  
 
Authors: Ian Donaldson and Sabry Razick
 
Authors: Ian Donaldson and Sabry Razick
Line 14: Line 13:
 
== Description ==
 
== Description ==
 
   
 
   
This file describes the contents of the xxx/feedback directory and the  
+
This file describes the contents of the ftp://ftp.no.embnet.org/irefindex/feedback/ directory and the  
 
format of the tab-delimited text files contained within.  
 
format of the tab-delimited text files contained within.  
  
 
== Directory contents ==
 
== Directory contents ==
 +
{| {{table}}
 +
| align="center" style="background:#f0f0f0;"|'''File name'''
 +
| align="center" style="background:#f0f0f0;"|'''Description of file'''
 +
|-
 +
| README||Pointer to this page
 +
|-
 +
| db_name_feedback_v.v.txt.zip||feedback for some database (db_name) for irefindex version v.v
 +
|-
 +
| db_name_not_mappedv.v.txt.zip||accession provided by some database (db_name) that were not found for irefindex version v.v
 +
|-
 +
|}
  
README this file
 
xxxx.feedback.y.y.txt.zip feedback for some database (xxxx) for release y.y
 
  
  
Line 33: Line 41:
 
== License ==
 
== License ==
 
   
 
   
This directory is private and only released to invited source databases.
+
This directory is intended for source databases incorporated by the irefindex.
 
These data are released under the Creative Commons Attribution License http://creativecommons.org/licenses/by/2.5/. This means that you are free to use, modify and redistribute these data for personal or commercial use so long as you provide appropriate credit.  See next section.
 
These data are released under the Creative Commons Attribution License http://creativecommons.org/licenses/by/2.5/. This means that you are free to use, modify and redistribute these data for personal or commercial use so long as you provide appropriate credit.  See next section.
  
Copyright © 2008 Ian Donaldson
+
Copyright © 2009 Ian Donaldson
  
 
== Citation ==
 
== Citation ==
Line 48: Line 56:
 
FITNESS FOR A PARTICULAR PURPOSE.
 
FITNESS FOR A PARTICULAR PURPOSE.
  
== Understanding the Feedback file ==
+
== Understanding the Feedback files ==
  
Insert explanatory text here
+
The feedback file consists of 15 columns. 
 +
Each line in this file represents
  
{| {{table}}
+
  1. a reference to a protein interactor found in some source db record and
| align="center" style="background:#f0f0f0;"|'''Score'''
+
  2. the iRefIndex mapping to a current protein sequence record.
| align="center" style="background:#f0f0f0;"|'''Description of feature'''
+
 
|-
+
Columns 1 - 3 point to an interaction record where the protein reference was found.
| P||The interaction record\'s primary (P) reference for the protein was used to make the assignment.
+
 
|-
+
Columns 4 - 6 describe the ***primary*** reference for the protein as listed in the interaction record.
| ||
+
 
|-
+
Columns 7 - 9 describe the protein reference in the interaction record that was ***used*** by the irefindex process to locate a current protein sequence record. In many (most) cases, this will be the same as coumns 4 - 6 unless the primary reference could not be found and one of the secondary references supplied in the interaction record was used.
| D||The source database (D) listed in the interaction record is different than what is expected for the given accession for the protein. In specific cases, this difference is tolerated and the assignment is made.
+
 
|-
+
Columns 10 - 12 describe the protein reference that the ***used*** reference was ***mapped*** to. In most cases, this will be the same as the reference that was used to do the mapping (columns 7 - 9) unless the used reference had to be updated or was an entrez gene id.
| ||
+
 
|-
+
Column 13 lists the rogid (see PMID 18823568) of the mapped protein.  
| T||The taxonomy (T) identifier for the protein (as supplied by the interaction record) differed from what was found in the protein sequence record. This discrepancy was tolerated and the assignment was made.
+
 
|-
+
Column 14 lists the rogscore (see PMID 18823568 and below) that describes operations performed during the mapping (such as updating identifiers or converting a gene identifier to a protein accession).
| ||
+
 
|-
+
Column 15 lists the score_type (see PMID 18823568 and below). Rogscores (column 14) are grouped into one of six different score_types that indicate the severity of the operations required to perform the mapping.  So, for instance, a score_type of one indicates a non-problematic mapping whereas a score_type of six indicates that the protein reference supplied in the interaction record could not be found and the rog assignment was based on the sequence of the protein provided in the interaction record.
| M||The protein reference listed by the interaction record was a typographical modification (M) of a known accession. In specific cases, this variation is tolerated and the assignment is made.
+
 
|-
+
A number of protein references could not be found in our current database of proteins and no sequence was provided in the interaction record.  These protein references are listed in the dbname_not_mapped_v.v.txt.zip file for each database. See [[#Description_of_Not_found_file|Description_of_Not_found_file]] below.
| ||
 
|-
 
| V||The protein reference listed by the interaction record contained version (V) information that was ignored. For example, RefSeq accession.version NP_012420.1 was listed but treated as RefSeq accession NP_012420.
 
|-
 
| ||
 
|-
 
| Q||The protein reference used to make the assignment was of the type \"see-also\". See PSI-MI Path: entrySet/entry/interactorList/interactor/xref/primaryRef/refType = \"see-also\".
 
|-
 
| ||
 
|-
 
| U||The protein reference listed in the interaction record and used to make the assignment was a secondary UniProt accession and was updated (U) to a primary UniProt accession in order to make the assignment.
 
|-
 
| ||
 
|-
 
| E||The protein reference was a retired NCBI Identifier. NCBI\'s eUtils (E) were used to retrieve the current accession and/or sequence.
 
|-
 
| ||
 
|-
 
| I||The protein reference used was an NCBI GenInfo Identifier (I).
 
|-
 
| ||
 
|-
 
| G||The interaction record\'s reference for the protein was an EntrezGene (G) identifier. The corresponding products of the gene were used to make the assignment.
 
|-
 
| ||
 
|-
 
| S||One of the interaction record\'s secondary (S) references for the protein was used to make the assignment.
 
|-
 
| ||
 
|-
 
| +||More than one possible assignment is possible (+). This case may arise in one of three ways. 1) The reference supplied by the interaction record requires updating but more than one possibility exists. For example, Q7XJL8 was found to be a secondary accession in three separate UniProt records (Q3EBZ2, Q6DR20, and Q8GWA9). 2) The secondary references supplied by the interaction record point to more than one unique protein sequence. 3) An EntrezGene identifier is provided in the interaction record as a protein reference. This identifier points to more than one protein product. An attempt is made to resolve this ambiguity as indicated by ROG score features O, X or L (see below).
 
|-
 
| ||
 
|-
 
| O||More than one possible assignment is possible (see + above). The assignment chosen has a SEGUID that is identical to the SEGUID of the original (O) sequence provided in the interaction record.
 
|-
 
| ||
 
|-
 
| X||More than one possible assignment is possible (see + above). The assignment chosen has the same taxonomy (X) identifier as listed in the interaction record.
 
|-
 
| ||
 
|-
 
| L||More than one possible assignment is possible (see + above). The assignment with the largest (L) SEGUID is arbitrarily chosen (see Methods).
 
|-
 
| ||
 
|-
 
| N||The protein reference, taxonomy identifier and sequence for the protein as provided in the interaction record are used to make a new entry in the SEGUID table. The protein interactor is assigned the newly (N) generated ROG identifier.
 
|-
 
| ||
 
|-
 
|
 
|}
 
  
 
== Description of Feedback file ==
 
== Description of Feedback file ==
Line 131: Line 88:
  
 
=== Column number: 1 ===
 
=== Column number: 1 ===
 
{|
 
|Column name: ||int_acc
 
|-
 
|Column type: ||string
 
|-
 
|Description: ||accession for interaction record
 
|-
 
|Example: || intact
 
|}
 
 
'''Notes'''
 
 
 
=== Column number: 2 ===
 
 
 
{|
 
{|
 
|Column name: ||int_db
 
|Column name: ||int_db
Line 180: Line 121:
 
|}
 
|}
  
=== Column number: 3 ===
+
=== Column number: 2 ===
  
 
{|
 
{|
|Column name: ||primary_acc
+
|Column name: ||int_acc
 +
|-
 +
|Column type: ||string
 +
|-
 +
|Description: ||accession for interaction record
 +
|-
 +
|Example: || intact
 +
|}
 +
 
 +
'''Notes'''
 +
 
 +
 
 +
=== Column number: 3 === 
 +
{|
 +
|Column name: ||source_file
 
|-
 
|-
 
|Column type: ||string
 
|Column type: ||string
 
|-
 
|-
|Description: ||An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
+
|Description: ||source file of interaction record
 
|-
 
|-
|Example: ||Q9Y6Q9
+
|Example: || pmid_2006_14691232.xml
 
|}
 
|}
  
Line 202: Line 157:
 
|Column type: ||string
 
|Column type: ||string
 
|-
 
|-
|Description: ||source db for accession listed in column 4
+
|Description: ||source db for accession listed in column 5
 
|-
 
|-
 
|Example: ||uniprotkb
 
|Example: ||uniprotkb
Line 214: Line 169:
  
 
{|
 
{|
|Column name: ||primary_taxid
+
|Column name: ||primary_acc
 
|-
 
|-
|Column type: ||integer
+
|Column type: ||string
 
|-
 
|-
|Description: ||taxonomy of protein interactor as listed in the source interaction record
+
|Description: ||An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
 
|-
 
|-
|Example: ||9606
+
|Example: ||Q9Y6Q9
 
|}
 
|}
  
Line 229: Line 184:
  
 
{|
 
{|
|Column name: ||used_acc
+
|Column name: ||primary_taxid
 
|-
 
|-
|Column type: ||string
+
|Column type: ||integer
 
|-
 
|-
|Description: ||An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
+
|Description: ||taxonomy of protein interactor as listed in the source interaction record
 
|-
 
|-
|Example: ||Q9Y6Q9
+
|Example: ||9606
 
|}
 
|}
  
 
'''Notes'''
 
'''Notes'''
 +
  
  
Line 255: Line 211:
 
'''Notes'''
 
'''Notes'''
  
This is the primary protein sequence database referenced in the interaction record.
+
This is the protein sequence database referenced in the interaction record that was used to perform the mappping (columns 10-12).
 +
 
  
 
=== Column number: 8 ===
 
=== Column number: 8 ===
 +
 +
{|
 +
|Column name: ||used_acc
 +
|-
 +
|Column type: ||string
 +
|-
 +
|Description: ||An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
 +
|-
 +
|Example: ||Q9Y6Q9
 +
|}
 +
 +
'''Notes'''
 +
 +
=== Column number: 9 ===
  
 
{|
 
{|
Line 271: Line 242:
 
'''Notes'''
 
'''Notes'''
  
=== Column number: 9 ===
+
 
 +
=== Column number: 10 ===
  
 
{|
 
{|
|Column name: ||mapped_acc
+
|Column name: ||mapped_db
 
|-
 
|-
 
|Column type: ||string
 
|Column type: ||string
 
|-
 
|-
|Description: ||the accession that this interactor was mapped to by iRefIndex
+
|Description: ||the source protein db that this interactor was mapped to by iRefIndex
 
|-
 
|-
|Example: ||Q9Y6Q9
+
|Example: ||uniprot
 
|}
 
|}
  
 
'''Notes'''
 
'''Notes'''
This will most likely be the same accession as listed in column 4 unless:
+
 
 +
This will most likely be the same as the db listed in column 7 unless:
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
|reason||example||see scores with...
 
|reason||example||see scores with...
 
|-
 
|-
|a modified version of the accession has been used||NP_0001 in place of NP 0001||M
+
|the db name is not valid or is a variation of a cv db||uniprot in place of "protein database"||D
|-
 
|an updated version of the accession has been used||xxx in place of xxx||U or E
 
 
|}
 
|}
  
=== Column number: 10 ===
+
 
 +
=== Column number: 11 ===
  
 
{|
 
{|
|Column name: ||mapped_db
+
|Column name: ||mapped_acc
 
|-
 
|-
 
|Column type: ||string
 
|Column type: ||string
 
|-
 
|-
|Description: ||the source protein db that this interactor was mapped to by iRefIndex
+
|Description: ||the accession that this interactor was mapped to by iRefIndex
 
|-
 
|-
|Example: ||uniprot
+
|Example: ||Q9Y6Q9
 
|}
 
|}
  
 
'''Notes'''
 
'''Notes'''
 
+
This will most likely be the same accession as listed in column 8 unless:
This will most likely be the same as the db listed in column 6 unless:
 
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
|reason||example||see scores with...
 
|reason||example||see scores with...
 
|-
 
|-
|the db name is not valid or is a variation of a cv db||uniprot in place of "protein database"||D
+
|a modified version of the accession has been used||NP_0001 in place of NP 0001||M
 +
|-
 +
|an updated version of the accession has been used||xxx in place of xxx||U or E
 
|}
 
|}
  
=== Column number: 11 ===
+
 
 +
=== Column number: 12 ===
  
 
{|
 
{|
Line 321: Line 295:
 
|Column type: ||integer
 
|Column type: ||integer
 
|-
 
|-
|Description: ||Taxonomy identifier for interactor as found in source protein db for record specified in column 8 and 9.
+
|Description: ||Taxonomy identifier for interactor as found in source protein db for record specified in column 10 and 11.
 
|-
 
|-
 
|Example: ||9606
 
|Example: ||9606
Line 328: Line 302:
 
'''Notes'''
 
'''Notes'''
  
This will most likely be the same as the taxid listed in column 7 unless:
+
This will most likely be the same as the taxid listed in columns 6 and 9 unless:
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
|reason||example||see scores with...
 
|reason||example||see scores with...
 
|-
 
|-
|the listed taxid is different from that found in the mapped record||xxx in place of xxx||T
+
|the listed taxid is different from that found in the mapped record||9606(human) in place of 40674 (mammalia)||T
 
|}
 
|}
  
=== Column number: 12 ===
+
=== Column number: 13 ===
  
 
{|
 
{|
Line 350: Line 324:
 
See iRefIndex paper. PMID 18823568.  
 
See iRefIndex paper. PMID 18823568.  
  
=== Column number: 13 ===
+
=== Column number: 14 ===
  
 
{|
 
{|
Line 366: Line 340:
 
See iRefIndex paper. PMID 18823568.  
 
See iRefIndex paper. PMID 18823568.  
  
=== Column number: 14 ===
+
The table below is a legend of the characters used in the rogscore (column 14).
  
{|
+
{| {{table}}
|Column name: ||score_type
+
| align="center" style="background:#f0f0f0;"|'''Score'''
 +
| align="center" style="background:#f0f0f0;"|'''Description of feature'''
 +
|-
 +
| P||The interaction record\'s primary (P) reference for the protein was used to make the assignment.
 +
|-
 +
| ||
 +
|-
 +
| D||The source database (D) listed in the interaction record is different than what is expected for the given accession for the protein. In specific cases, this difference is tolerated and the assignment is made.
 +
|-
 +
| ||
 +
|-
 +
| T||The taxonomy (T) identifier for the protein (as supplied by the interaction record) differed from what was found in the protein sequence record. This discrepancy was tolerated and the assignment was made.
 +
|-
 +
| ||
 +
|-
 +
| M||The protein reference listed by the interaction record was a typographical modification (M) of a known accession. In specific cases, this variation is tolerated and the assignment is made.
 +
|-
 +
| ||
 +
|-
 +
| V||The protein reference listed by the interaction record contained version (V) information that was ignored. For example, RefSeq accession.version NP_012420.1 was listed but treated as RefSeq accession NP_012420.
 +
|-
 +
| ||
 +
|-
 +
| Q||The protein reference used to make the assignment was of the type \"see-also\". See PSI-MI Path: entrySet/entry/interactorList/interactor/xref/primaryRef/refType = \"see-also\".
 +
|-
 +
| ||
 +
|-
 +
| U||The protein reference listed in the interaction record and used to make the assignment was a secondary UniProt accession and was updated (U) to a primary UniProt accession in order to make the assignment.
 +
|-
 +
| ||
 +
|-
 +
| E||The protein reference was a retired NCBI Identifier. NCBI\'s eUtils (E) were used to retrieve the current accession and/or sequence.
 
|-
 
|-
|Column type: ||integer
+
| ||
 
|-
 
|-
|Description: ||assignment score type 1-6
+
| I||The protein reference used was an NCBI GenInfo Identifier (I).
 
|-
 
|-
|Example: ||1
+
| ||
|}
 
 
 
'''Notes'''
 
 
 
See iRefIndex paper. PMID 18823568. Table 4 column 1
 
 
 
 
 
 
 
== Not found file==
 
Each row in this table represents an interactor reference which we were unable to map to
 
a sequence.
 
 
 
=== Column number: 1 ===
 
 
 
{|
 
|Column name: ||int_acc
 
 
|-
 
|-
|Column type: ||string
+
| G||The interaction record\'s reference for the protein was an EntrezGene (G) identifier. The corresponding products of the gene were used to make the assignment.
 
|-
 
|-
|Description: ||accession for interaction record
+
| ||
 
|-
 
|-
|Example: || intact
+
| S||One of the interaction record\'s secondary (S) references for the protein was used to make the assignment.
|}
 
 
 
'''Notes'''
 
 
 
 
 
=== Column number: 2 ===
 
 
 
{|
 
|Column name: ||int_db
 
 
|-
 
|-
|Column type: ||string
+
| ||
 
|-
 
|-
|Description: ||name of interaction db
+
| +||More than one possible assignment is possible (+). This case may arise in one of three ways. 1) The reference supplied by the interaction record requires updating but more than one possibility exists. For example, Q7XJL8 was found to be a secondary accession in three separate UniProt records (Q3EBZ2, Q6DR20, and Q8GWA9). 2) The secondary references supplied by the interaction record point to more than one unique protein sequence. 3) An EntrezGene identifier is provided in the interaction record as a protein reference. This identifier points to more than one protein product. An attempt is made to resolve this ambiguity as indicated by ROG score features O, X or L (see below).
 
|-
 
|-
|Example: || intact
+
| ||
|}
 
 
 
'''Notes'''
 
Possible values in this field are:
 
{||class="wikitable" style="text-align:left" border="1" cellpadding="5"
 
| bind ||biomolecular interaction network db
 
 
|-
 
|-
| biogrid||the biogrid db
+
| O||More than one possible assignment is possible (see + above). The assignment chosen has a SEGUID that is identical to the SEGUID of the original (O) sequence provided in the interaction record.
 
|-
 
|-
| dip||db of interacting proteins
+
| ||
 
|-
 
|-
| hprd||human protein reference db
+
| X||More than one possible assignment is possible (see + above). The assignment chosen has the same taxonomy (X) identifier as listed in the interaction record.
 
|-
 
|-
| intact||ebi interaction db
+
| ||
 
|-
 
|-
| mint||molecular interaction db
+
| L||More than one possible assignment is possible (see + above). The assignment with the largest (L) SEGUID is arbitrarily chosen (see Methods).
 
|-
 
|-
| mpact||mips yeast protein interaction db
+
| ||
 
|-
 
|-
| mppi||mips mammalian protein interaction db
+
| N||The protein reference, taxonomy identifier and sequence for the protein as provided in the interaction record are used to make a new entry in the SEGUID table. The protein interactor is assigned the newly (N) generated ROG identifier.
 
|-
 
|-
| ophid||online predicted human interaction db
+
| ||
 
|-
 
|-
 +
|
 
|}
 
|}
  
=== Column number: 3 ===
+
=== Column number: 15 ===
  
 
{|
 
{|
|Column name: ||unfound_acc
+
|Column name: ||score_type
 
|-
 
|-
|Column type: ||string
+
|Column type: ||integer
 
|-
 
|-
|Description: ||An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
+
|Description: ||assignment score type 1-6
 
|-
 
|-
|Example: ||Q9Y6Q9
+
|Example: ||1
 
|}
 
|}
  
 
'''Notes'''
 
'''Notes'''
  
 +
See iRefIndex paper. PMID 18823568. Table 4 column 1.  Descriptions of score types are repeated here for convenience.
  
=== Column number: 4 ===
+
{| {{table}}
 
+
| align="center" style="background:#f0f0f0;"|'''Score type'''
{|
+
| align="center" style="background:#f0f0f0;"|'''Description'''
|Column name: ||unfound_db
+
|-
 +
| 1 ||Type 1 assignments were least problematic. In all cases, an unambiguous assignment to a ROG was possible using either a primary or secondary reference (P, S). In a few cases, version information was ignored (V), the source database (D) was relaxed or minor modifications (M) to the accession were allowed in order to find the corresponding entry in our SEGUID table.
 +
|-
 +
| 2 ||Type 2 assignments required that the accession provided by the source database be updated using either UniProt secondary accessions (U) or NCBI eUtils (E). In all cases, an unambiguous assignment was made. In a few cases, the sequence provided by the interaction database was required to accomplish this (score PUO+).
 +
|-
 +
| 3 ||Type 3 assignments involved references where the taxonomy identifier provided by the interaction database was different than the 'true' taxon provided by the source sequence record.
 
|-
 
|-
|Column type: ||string
+
| 4 ||Type 4 assignments represent those rare cases where both an update to an accession was required and the true taxonomy identifier was different than expected.
 
|-
 
|-
|Description: ||source db for accession listed in column 4
+
| 5 ||Type 5 assignments involved references that could be mapped to a number of different proteins (see + in assignment score) and that could not be resolved using sequence data provided in the record. In some cases, this was resolved by choosing the ROG that had the expected taxonomy identifier (X) or by the arbitrary method of choosing the assignment with the largest SEGUID (L) according to its ASCII value.
 
|-
 
|-
|Example: ||uniprotkb
+
| 6 ||Finally, type 6 assignments involved interactors for which no matching reference or sequence existed in our SEGUID table. The protein sequence provided by the interaction record (or retrieved from archival sources) was used to construct a new (N) SEGUID entry. This served to group together any other interactors that might have the same sequence in the current build of the index. This is a stop-gap measure and new SEGUID entries are discarded from one build of the database to another.
 
|}
 
|}
  
'''Notes'''
 
  
This is the primary protein sequence database referenced in the interaction record.
+
== Description of Not found file==
 +
 
 +
Protein references that were "not found" by the irefindex procedure are listed in these files.
 +
Each source database has a file named after itself
 +
 
 +
dbname_not_mappedv.v.txt.zip
 +
 
 +
where db name is the name of the database and v.v is the iRefIndex release number. For example,
 +
 
 +
bind_not_mapped2.0.txt.zip
 +
 
 +
These files follow the same general format as the feedback files (above) except:
 +
Only the first six columns are informative.  These columns point to a protein reference in some interaction record that could not be mapped.
 +
See the notes above for columns 1 - 6.
 +
 
 +
Columns 7 - 9 (used db, used accession and used taxon id) will be non informative.
 +
 
 +
Columns 10 - 12 (mapped db, mapped accession and mapped taxon id) will be non-informative
 +
 
 +
Columns 13 - 15 (rogid, rogscore and score type) will be non-informative.
 +
 
 +
 
 +
== Contact ==
  
=== Column number: 5 ===
+
Questions regarding these files may be sent to ian.donaldson at bio.uio.no.
 +
Comments or questions that may be of general interest may be posted to [http://groups.google.com/group/irefindex?hl=en irefindex google group].
  
{|
 
|Column name: ||unfound_taxid
 
|-
 
|Column type: ||integer
 
|-
 
|Description: ||taxonomy of protein interactor as listed in the source interaction record
 
|-
 
|Example: ||9606
 
|}
 
  
'''Notes'''
+
[[Category:iRefIndex]]

Latest revision as of 20:14, 26 April 2009

Last edited: , January 17th, 2009 Applies to iRefIndex release: 2.0 beta.

Release date: January 13th, 2009

Authors: Ian Donaldson and Sabry Razick

Database: iRefIndex (http://irefindex.uio.no)

Organization: Biotechnology Centre of Oslo, University of Oslo (http://www.biotek.uio.no/)

Description

This file describes the contents of the ftp://ftp.no.embnet.org/irefindex/feedback/ directory and the format of the tab-delimited text files contained within.

Directory contents

File name Description of file
README Pointer to this page
db_name_feedback_v.v.txt.zip feedback for some database (db_name) for irefindex version v.v
db_name_not_mappedv.v.txt.zip accession provided by some database (db_name) that were not found for irefindex version v.v


Changes from last version

None. First release of this file.

Known Issues

None.

License

This directory is intended for source databases incorporated by the irefindex. These data are released under the Creative Commons Attribution License http://creativecommons.org/licenses/by/2.5/. This means that you are free to use, modify and redistribute these data for personal or commercial use so long as you provide appropriate credit. See next section.

Copyright © 2009 Ian Donaldson

Citation

Razick, S., Magklaras, G., Donaldson, IM. iRefIndex: A consolidated protein interaction database with provenance. BMC Bioinformatics. 2008. 9(1):405. PMID 18823568.

Disclaimer

Data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Understanding the Feedback files

The feedback file consists of 15 columns. Each line in this file represents

  1. a reference to a protein interactor found in some source db record and
  2. the iRefIndex mapping to a current protein sequence record. 

Columns 1 - 3 point to an interaction record where the protein reference was found.

Columns 4 - 6 describe the ***primary*** reference for the protein as listed in the interaction record.

Columns 7 - 9 describe the protein reference in the interaction record that was ***used*** by the irefindex process to locate a current protein sequence record. In many (most) cases, this will be the same as coumns 4 - 6 unless the primary reference could not be found and one of the secondary references supplied in the interaction record was used.

Columns 10 - 12 describe the protein reference that the ***used*** reference was ***mapped*** to. In most cases, this will be the same as the reference that was used to do the mapping (columns 7 - 9) unless the used reference had to be updated or was an entrez gene id.

Column 13 lists the rogid (see PMID 18823568) of the mapped protein.

Column 14 lists the rogscore (see PMID 18823568 and below) that describes operations performed during the mapping (such as updating identifiers or converting a gene identifier to a protein accession).

Column 15 lists the score_type (see PMID 18823568 and below). Rogscores (column 14) are grouped into one of six different score_types that indicate the severity of the operations required to perform the mapping. So, for instance, a score_type of one indicates a non-problematic mapping whereas a score_type of six indicates that the protein reference supplied in the interaction record could not be found and the rog assignment was based on the sequence of the protein provided in the interaction record.

A number of protein references could not be found in our current database of proteins and no sequence was provided in the interaction record. These protein references are listed in the dbname_not_mapped_v.v.txt.zip file for each database. See Description_of_Not_found_file below.

Description of Feedback file

Each line in this file represents

  1. a protein interactor found in some source db record and
  2. the iRefIndex mapping to a current protein sequence record.

Column number: 1

Column name: int_db
Column type: string
Description: name of interaction db
Example: intact

Notes Possible values in this field are:

bind biomolecular interaction network db
biogrid the biogrid db
dip db of interacting proteins
hprd human protein reference db
intact ebi interaction db
mint molecular interaction db
mpact mips yeast protein interaction db
mppi mips mammalian protein interaction db
ophid online predicted human interaction db

Column number: 2

Column name: int_acc
Column type: string
Description: accession for interaction record
Example: intact

Notes


Column number: 3

Column name: source_file
Column type: string
Description: source file of interaction record
Example: pmid_2006_14691232.xml

Notes


Column number: 4

Column name: primary_db
Column type: string
Description: source db for accession listed in column 5
Example: uniprotkb

Notes

This is the primary protein sequence database referenced in the interaction record.

Column number: 5

Column name: primary_acc
Column type: string
Description: An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
Example: Q9Y6Q9

Notes


Column number: 6

Column name: primary_taxid
Column type: integer
Description: taxonomy of protein interactor as listed in the source interaction record
Example: 9606

Notes


Column number: 7

Column name: used_db
Column type: string
Description: source db for accession listed in column 4
Example: uniprotkb

Notes

This is the protein sequence database referenced in the interaction record that was used to perform the mappping (columns 10-12).


Column number: 8

Column name: used_acc
Column type: string
Description: An accession for a protein interactor in some database as supplied in the interaction record (see columns 1-2)
Example: Q9Y6Q9

Notes

Column number: 9

Column name: used_taxid
Column type: integer
Description: taxonomy of protein interactor as listed in the source interaction record
Example: 9606

Notes


Column number: 10

Column name: mapped_db
Column type: string
Description: the source protein db that this interactor was mapped to by iRefIndex
Example: uniprot

Notes

This will most likely be the same as the db listed in column 7 unless:

reason example see scores with...
the db name is not valid or is a variation of a cv db uniprot in place of "protein database" D


Column number: 11

Column name: mapped_acc
Column type: string
Description: the accession that this interactor was mapped to by iRefIndex
Example: Q9Y6Q9

Notes This will most likely be the same accession as listed in column 8 unless:

reason example see scores with...
a modified version of the accession has been used NP_0001 in place of NP 0001 M
an updated version of the accession has been used xxx in place of xxx U or E


Column number: 12

Column name: mapped_taxid
Column type: integer
Description: Taxonomy identifier for interactor as found in source protein db for record specified in column 10 and 11.
Example: 9606

Notes

This will most likely be the same as the taxid listed in columns 6 and 9 unless:

reason example see scores with...
the listed taxid is different from that found in the mapped record 9606(human) in place of 40674 (mammalia) T

Column number: 13

Column name: rogid
Column type: string
Description: rogid of the interactor assigned by iRefIndex
Example: HWcRyNPgZ0dLD9cb5iuiarsGG8E9606

Notes See iRefIndex paper. PMID 18823568.

Column number: 14

Column name: rogscore
Column type: string
Description: description of assignment score for this interactor
Example: PUTO+

Notes

See iRefIndex paper. PMID 18823568.

The table below is a legend of the characters used in the rogscore (column 14).

Score Description of feature
P The interaction record\'s primary (P) reference for the protein was used to make the assignment.
D The source database (D) listed in the interaction record is different than what is expected for the given accession for the protein. In specific cases, this difference is tolerated and the assignment is made.
T The taxonomy (T) identifier for the protein (as supplied by the interaction record) differed from what was found in the protein sequence record. This discrepancy was tolerated and the assignment was made.
M The protein reference listed by the interaction record was a typographical modification (M) of a known accession. In specific cases, this variation is tolerated and the assignment is made.
V The protein reference listed by the interaction record contained version (V) information that was ignored. For example, RefSeq accession.version NP_012420.1 was listed but treated as RefSeq accession NP_012420.
Q The protein reference used to make the assignment was of the type \"see-also\". See PSI-MI Path: entrySet/entry/interactorList/interactor/xref/primaryRef/refType = \"see-also\".
U The protein reference listed in the interaction record and used to make the assignment was a secondary UniProt accession and was updated (U) to a primary UniProt accession in order to make the assignment.
E The protein reference was a retired NCBI Identifier. NCBI\'s eUtils (E) were used to retrieve the current accession and/or sequence.
I The protein reference used was an NCBI GenInfo Identifier (I).
G The interaction record\'s reference for the protein was an EntrezGene (G) identifier. The corresponding products of the gene were used to make the assignment.
S One of the interaction record\'s secondary (S) references for the protein was used to make the assignment.
+ More than one possible assignment is possible (+). This case may arise in one of three ways. 1) The reference supplied by the interaction record requires updating but more than one possibility exists. For example, Q7XJL8 was found to be a secondary accession in three separate UniProt records (Q3EBZ2, Q6DR20, and Q8GWA9). 2) The secondary references supplied by the interaction record point to more than one unique protein sequence. 3) An EntrezGene identifier is provided in the interaction record as a protein reference. This identifier points to more than one protein product. An attempt is made to resolve this ambiguity as indicated by ROG score features O, X or L (see below).
O More than one possible assignment is possible (see + above). The assignment chosen has a SEGUID that is identical to the SEGUID of the original (O) sequence provided in the interaction record.
X More than one possible assignment is possible (see + above). The assignment chosen has the same taxonomy (X) identifier as listed in the interaction record.
L More than one possible assignment is possible (see + above). The assignment with the largest (L) SEGUID is arbitrarily chosen (see Methods).
N The protein reference, taxonomy identifier and sequence for the protein as provided in the interaction record are used to make a new entry in the SEGUID table. The protein interactor is assigned the newly (N) generated ROG identifier.

Column number: 15

Column name: score_type
Column type: integer
Description: assignment score type 1-6
Example: 1

Notes

See iRefIndex paper. PMID 18823568. Table 4 column 1. Descriptions of score types are repeated here for convenience.

Score type Description
1 Type 1 assignments were least problematic. In all cases, an unambiguous assignment to a ROG was possible using either a primary or secondary reference (P, S). In a few cases, version information was ignored (V), the source database (D) was relaxed or minor modifications (M) to the accession were allowed in order to find the corresponding entry in our SEGUID table.
2 Type 2 assignments required that the accession provided by the source database be updated using either UniProt secondary accessions (U) or NCBI eUtils (E). In all cases, an unambiguous assignment was made. In a few cases, the sequence provided by the interaction database was required to accomplish this (score PUO+).
3 Type 3 assignments involved references where the taxonomy identifier provided by the interaction database was different than the 'true' taxon provided by the source sequence record.
4 Type 4 assignments represent those rare cases where both an update to an accession was required and the true taxonomy identifier was different than expected.
5 Type 5 assignments involved references that could be mapped to a number of different proteins (see + in assignment score) and that could not be resolved using sequence data provided in the record. In some cases, this was resolved by choosing the ROG that had the expected taxonomy identifier (X) or by the arbitrary method of choosing the assignment with the largest SEGUID (L) according to its ASCII value.
6 Finally, type 6 assignments involved interactors for which no matching reference or sequence existed in our SEGUID table. The protein sequence provided by the interaction record (or retrieved from archival sources) was used to construct a new (N) SEGUID entry. This served to group together any other interactors that might have the same sequence in the current build of the index. This is a stop-gap measure and new SEGUID entries are discarded from one build of the database to another.


Description of Not found file

Protein references that were "not found" by the irefindex procedure are listed in these files. Each source database has a file named after itself

dbname_not_mappedv.v.txt.zip

where db name is the name of the database and v.v is the iRefIndex release number. For example,

bind_not_mapped2.0.txt.zip

These files follow the same general format as the feedback files (above) except: Only the first six columns are informative. These columns point to a protein reference in some interaction record that could not be mapped. See the notes above for columns 1 - 6.

Columns 7 - 9 (used db, used accession and used taxon id) will be non informative.

Columns 10 - 12 (mapped db, mapped accession and mapped taxon id) will be non-informative

Columns 13 - 15 (rogid, rogscore and score type) will be non-informative.


Contact

Questions regarding these files may be sent to ian.donaldson at bio.uio.no. Comments or questions that may be of general interest may be posted to irefindex google group.