Difference between revisions of "Protein identifier mapping"

From irefindex
Line 3: Line 3:
  
 
We have made a file which provides mappings between iRefIndex identifiers and popular external identifiers.  
 
We have made a file which provides mappings between iRefIndex identifiers and popular external identifiers.  
The current files contains all UniProt, allRefSeq identifier (please refer http://irefindex.uio.no/wiki/Sources_iRefIndex_7.0 for version information) and an other identifiers in selected cases. Other identifiers are provided as accession/identifiers for iRefindex identifiers provided only when they do not have a UniProt or RefSeq identifier.
+
The current files contains all UniProt and RefSeq identifiers (please refer to http://irefindex.uio.no/wiki/Sources_iRefIndex_7.0 for version information) and an other identifiers in selected cases. Other database identifiers are provided as accession/identifiers for iRefindex identifiers provided only when they do not have a UniProt or RefSeq identifier that maps to an identical sequence.
  
 
File download location:  
 
File download location:  
Line 18: Line 18:
 
| 2||acc||The external identifier (e.g. Q4U9M9)
 
| 2||acc||The external identifier (e.g. Q4U9M9)
 
|-
 
|-
| 3||entrezGeneid||Entrez gene id. This is provided only for RefSeq identifiers for other identifiers the value is -1 from this field.  
+
| 3||entrezGeneid||Entrez Gene ID. This is provided '''only''' for RefSeq identifiers; for other identifiers the value is -1 for this field. See note 1.
 
|-
 
|-
| 4||irogid||Integer version redundant group identifier(e.g. 3156116, current maximum value=14005379, this is a MySQL int(11) field).  
+
| 4||irogid||Integer version of the redundant group identifier(rogid)(e.g. 3156116, current maximum value=14005379, this is a MySQL int(11) field).  
 
|-
 
|-
| 5||rogid||String version of the redundant object group (64 bit version of the hash digest of primary amino acid sequence with the NSBI taxonomy identifier appended at the end)
+
| 5||rogid||String version of the redundant object group identifier (64 bit version of the hash digest of primary amino acid sequence with the NCBI taxonomy identifier appended at the end).  See note 2.
 
|-
 
|-
| 6||icrogid||Integer version of the canonical(1) redundant object group (A selected irogid to represent the canonical group)
+
| 6||icrogid||Integer version of the canonical redundant object group (crogid) (A selected irogid to represent the canonical group).  See note 3.
 
|-
 
|-
| 7||crogid||String version of the canonical(1) redundant object group (A selected rogid to represent the canonical group)
+
| 7||crogid||String version of the canonical(1) redundant object group (A selected rogid to represent the canonical group). See note 3.
 
|-
 
|-
 
|  
 
|  
 
|}
 
|}
  
(1) Please refer the following page for details on canonicalization process.
+
(1) Some protein sequence records can be mapped to an EntrezGene record but will not have an entry in this column because they are not RefSeq records.  In these cases, use the irogid (or icrogid) to retrieve all other entries in this table with identical sequences (or belonging to the same canonical group)- one of these may have  an entry in this Entrez Gene Id in this column.
 +
(2) Please see http://www.ncbi.nlm.nih.gov/pubmed/18823568 for algorithm describing how you can generate this key from a protein sequence.
 +
(3) Please refer the following page for details on canonicalization process.
 
http://irefindex.uio.no/wiki/Canonicalization
 
http://irefindex.uio.no/wiki/Canonicalization

Revision as of 09:59, 12 January 2011

Last edited: 2011-01-12


We have made a file which provides mappings between iRefIndex identifiers and popular external identifiers. The current files contains all UniProt and RefSeq identifiers (please refer to http://irefindex.uio.no/wiki/Sources_iRefIndex_7.0 for version information) and an other identifiers in selected cases. Other database identifiers are provided as accession/identifiers for iRefindex identifiers provided only when they do not have a UniProt or RefSeq identifier that maps to an identical sequence.

File download location: ftp://ftp.no.embnet.org/irefindex/data/current/Mappingfiles/

The column descriptions:

Column number Column name Description
1 db Source of the external identifier (e.g. UniProt, RefSeq)
2 acc The external identifier (e.g. Q4U9M9)
3 entrezGeneid Entrez Gene ID. This is provided only for RefSeq identifiers; for other identifiers the value is -1 for this field. See note 1.
4 irogid Integer version of the redundant group identifier(rogid)(e.g. 3156116, current maximum value=14005379, this is a MySQL int(11) field).
5 rogid String version of the redundant object group identifier (64 bit version of the hash digest of primary amino acid sequence with the NCBI taxonomy identifier appended at the end). See note 2.
6 icrogid Integer version of the canonical redundant object group (crogid) (A selected irogid to represent the canonical group). See note 3.
7 crogid String version of the canonical(1) redundant object group (A selected rogid to represent the canonical group). See note 3.

(1) Some protein sequence records can be mapped to an EntrezGene record but will not have an entry in this column because they are not RefSeq records. In these cases, use the irogid (or icrogid) to retrieve all other entries in this table with identical sequences (or belonging to the same canonical group)- one of these may have an entry in this Entrez Gene Id in this column. (2) Please see http://www.ncbi.nlm.nih.gov/pubmed/18823568 for algorithm describing how you can generate this key from a protein sequence. (3) Please refer the following page for details on canonicalization process. http://irefindex.uio.no/wiki/Canonicalization