Canonicalization

From irefindex

What is canonicalization and how does it alter iRefIndex and iRefWeb data?

Canonicalization refers to the process of mapping protein interactors to a single canonical representative of a family of proteins that are all products of the same gene or related genes (related gene group).

Canonicalization is a post-processing step on the iRefIndex build process where all ROGIDs (interactors) are mapped to canonical ROGIDs. The net effect of the process was to minimize the number of canonical proteins by utilizing information from both UniProt and EntrezGene.

Data in iRefWeb will undergo this canonicalization procedure as of version 6.0. iRefIndex MITAB files (available for download from the ftp server) and MITAB data retrieved from our PSICQUIC web-services have NOT undergone this post-processing step. However, canonical data will eventually be made available in these MITAB files to allow users to group interactors (and interactions) according to their canonical representatives.

Summary of the canonicalization method.

EntrezGene records are associated with a list of zero or more distinct protein products (as indicated by the ROGIDs for these proteins). EntrezGene identifiers were grouped together into related gene groups (RGGs) if they shared at least one identical protein product. Therefore each RGG has an initial list of distinct protein products encoded by at least one of its member genes and represented by a set of RefSeq protein records. This initial list was expanded to include (1) distinct proteins from UniProt proteins that were isoforms related to one of the proteins already existing in this list and/or (2) UniProt proteins that cross-referenced one of the EntrezGene identifiers in the RGG. From this expanded list of proteins, one distinct protein was chosen to represent the canonical isoform for the entire list. If one of the proteins was a canonical sequence (as defined by UniProt: see http://www.uniprot.org/faq/30) then this was chosen as the canonical form. If two or more such proteins existed, the longest was chosen. If no canonical UniProt sequences existed, the longest protein sequence associated with the RGG was chosen.

Background on representation of protein isoforms by NCBI versus UniProt

UniProt

A UniProt record (e.g. accession O43497) contains a protein sequence. Some proteins may have related isoforms resulting from different splice isoforms of the mRNA or from alternative promoter usage. These isoform proteins may not have their own UniProt records and accession, but instead are referred to inside a canonical sequence record using a modification of the accession (like O43497-2) along with a description of how the canonical sequence can be modified to arrive at the isoform sequence. Accession identifiers like O43497 are referred to as "canonical" accessions and those like O43497-2 are referred to as isoform accessions. A canonical sequence will have a corresponding isoform accession (like O43497-1) but it is not always the -1 variant of the canonical accession (i.e., it could be O43497-4). The protein isoform that is chosen to be the canonical form is selected on the following criteria:

1. It is the most prevalent.

2. It is the most similar to orthologous sequences found in other species.

3. By virtue of its length or amino acid composition, it allows the clearest description of domains, isoforms, polymorphisms, post-translational modifications, etc.

4. In the absence of any information, choose the longest sequence.


In addition, the UniProt canonical group is not just by alternative splicing, it contains;

1. Isoforms generated by alternative splicing

2. Alternative promoter usage

3. Alternative translation initiation


Additional information is available at http://www.uniprot.org/faq/30


It is also important to note that not all groups of related proteins are represented using this method. A group of proteins that are all products of the same gene may be represented by separate UniProt entries. This appears in part to depend on what stage of curation the group of proteins is in.

NCBI's RefSeq and EntrezGene

An Entrez Gene record (e.g. 4105) contains multiple cross-references to records that describe regions of DNA and the protein products that they encode. In this example, Entrez GeneId 4105 points to two RefSeq protein records (accessions NP_000964.1 and NP_150644.1) and these records in turn refer back to GeneId 4105.

It is important to note that RefSeq protein records represent protein products of specific mRNAs. Therefore, if two genes (two distinct regions of DNA) encode identical protein sequences, there will be separate RefSeq protein records. For example, EntrezGene 1104 and 751867 each code for three mRNAs and there are six corresponding RefSeq protein records. However, there are only three distinct protein sequences because four of the mRNAs encode identical protein sequences (one from 1104 and all three from 751867: NP_001260.1, NP_001041662.1, NP_0010416623.1 and NP_001041664.1). These four records describe four different proteins (according to NCBI's model) because each is encoded by a separate mRNA. In this specific instance, the Entrez GeneID of 1104 refers to the RCC1 gene and 751867 refers to a similar region of the chromosome that is extended upstream and encodes a NHG3-RCC1 readthrough transcript.

There are three important differences between NCBI and UniProt to note here. First the UniProt model of a protein does not create separate protein records for identical sequences just because they have separate mRNA or DNA origins. Second, NCBI groups related proteins together using GeneIds but will create separate records to reflect different DNA origins of mRNA transcripts. Third, Entrez Gene will not designate a protein product as canonical. Each protein product from each gene is assigned its own RefSeq record.