README iRefIndex MITAB 5.0
Last edited: 10th, August 2009
Applies to iRefIndex release: 5.0
Release date: 10th, August 2009
Download location: ftp://ftp.no.embnet.org/irefindex/data/archive/release_5.0/
Authors: Ian Donaldson and Sabry Razick (with changes from Paul Boddie)
Database: iRefIndex (http://irefindex.uio.no)
Organization: Biotechnology Centre of Oslo, University of Oslo (http://www.biotek.uio.no/)
Note: this distribution includes only those data that may be freely distributed under the copyright license of the source database. See Description below.
Contents
- 1 Description
- 2 Directory contents
- 3 Changes from last version
- 4 Known Issues
- 5 Understanding the iRefIndex MITAB format
- 6 License
- 7 Citation
- 8 Disclaimer
- 9 Description of PSI-MITAB2.5 file
- 9.1 Column number: 1
- 9.2 Column number: 2
- 9.3 Column number: 3
- 9.4 Column number: 4
- 9.5 Column number: 5
- 9.6 Column number: 6
- 9.7 Column number: 7
- 9.8 Column number: 8
- 9.9 Column number: 9
- 9.10 Column number: 10
- 9.11 Column number: 11
- 9.12 Column number: 12
- 9.13 Column number: 13
- 9.14 Column number: 14
- 9.15 Column number: 15
- 9.16 Column number: 16
- 9.17 Column number: 17
- 9.18 Column number: 18
- 9.19 Column number: 19
- 9.20 Column number: 20
- 9.21 Column number: 21
- 9.22 Column number: 22
- 9.23 Column number: 23
- 9.24 Column number: 24
Description
This file describes the contents of the irefindex/current/data directory and the format of the tab-delimited text files contained within. Each index file follows the PSI-MITAB2.5 format with additional columns for annotating edges and nodes. Each line in PSI-MITAB2.5 format represents a group of interaction records that all describe the same protein-protein interaction or protein- membership in a complex. Assignment of source interaction records to these redundant groups is described at http://irefindex.uio.no. The PSI-MI2.5 format plus additional columns are described at the end of the file.
Details on the build process are available from the publication PMID 18823568.
iRefIndex data distributed on the FTP site includes only those data that may be freely distributed under the copyright license of the source database. This includes data from BIND, BioGRID, IntAct, MINT, MPPI and OPHID.
iRefIndex also integrates data from CORUM, DIP, HPRD and MPact. These data are not distributed publicly. These data may be made available to academic users under a collaborative agreement.
Contact ian.donaldson at biotek.uio.no if you are interested in using the iRefIndex database or would like your database included in the public release of the index.
Directory contents
README | pointer to this file at http://irefindex.uio.no/wiki/README_iRefIndex_MITAB_5.0 |
Sources | pointer to data files for this release at http://irefindex.uio.no/wiki/Sources_iRefIndex_5.0 |
Statistics | pointer to statisitics for this release at http://irefindex.uio.no/wiki/Statistics_iRefIndex_5.0 |
xxxx.mitab.mmddyyyy.txt.zip | individual indices in PSI-MITAB2.5 format |
iRefIndex data is distributed as a set of tab-delimited text files with names of the form xxxx.mitab.mmddyyyy.txt.zip where mmddyyyy represents the file's creation date.
The complete index is available as All.mitab.mmddyyyy.txt.zip .
Taxon specific data sets are also available for:
Taxon Id | |
Homo sapiens | 9606 (human) |
Mus musculus | 10090 (mouse) |
Rattus norvegicus | 10116 (brown rat) |
Caenorhabditis elegans | 6239 (nematode) |
Drosophila melanogaster | 7227 (fruit fly) |
Saccharomyces cerevisiae | 4932 (baker's yeast) |
Escherichia coli. | 562 (E. Coli) |
Other | other |
All | all |
Taxon specific subsets of the data are named xxxx.mitab.mmddyyyy.txt.zip where xxxx is the taxonomy identifier of at least one of the interactors according to either the source interaction database or the sequence database record. Each zip compressed file contains a single text file with the corresponding name xxxx.mitab.mmddyyyy.txt.
In some cases, other objects may belong to other taxons if a virus-host interaction is being represented or if a protein from another organism has been used to model a protein in the specified organism.
Taxonomy identifiers are provided in the data sets allowing these exceptions to be identified. The taxonomy identifiers listed are derived from the source protein sequence record. In some cases, this taxonomy identifier will be a child of the taxon listed in the file's title; for example, Escherichia coli K12 (taxonomy identifier 83333) will appear in the Escherichia coli (taxonomy identifier 562) file.
A description of the NCBI taxon identifiers is available at the following location:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy
The above data taxon division scheme leads to duplications; for instance, an interaction present in the mouse index could also appear in the human index if the interaction record lists protein sequence records from both human and mouse. The All.mitab.mmddyyyy file is a complete and non-redundant listing.
The data format and divisions provided in this initial release were chosen in the hopes that they would be immediately useful to the largest possible set of users. Other formats and divisions are possible and we welcome your input on future releases.
Changes from last version
Mapping to MI (PSIMI CV identifier) improved. Colon in grid identifiers corrected (earlier MI:0463:(grid) now MI:0463(grid))
Known Issues
-
To be edited
Understanding the iRefIndex MITAB format
iRefIndex is distributed in PSI-MITAB format. This format was originally described in a recent PSI-MI paper (PMID 17925023).
Since this PSI-MITAB format allows for only two interactors to be described on each line, it is best suited for describing binary interaction data (the original experiment, say yeast two hybrid, gives a binary readout). However, other source PSI-MI XML source records will describe interactions involving only one interactor type (dimers or multimers) or they will contain associative interaction data (say from immunoprecipitation experiments where the exact interactions between any pair of interactors are unknown. These cases are problematic for the PSI-MITAB format. This README describes exactly how we use the MITAB format to describe these alternate (non-binary) interaction types.
Each row in the MITAB file represents a **group** of interaction records from primary sources. Each member of this group describes an interaction involving the exact same set of proteins (as defined by their primary sequence and taxon ids).
The natural keys for each interaction record in this group (i.e. the record identifiers from the source database) are listed in column 14. For example:
intact:EBI-761694|intact:EBI-762624|mint:MINT-15283
Our surrogate (primary) key for this group of redundant interaction records (RIG) is also listed in row 14, as the very first entry (see example below). Keys in row 14 are bar (|) delimited. The source db name and record id in each key are separated by a colon.
The RIG identifier is a 27 character key that is derived from the ROGIDs of the interactors involved in the interaction record (see columns 1 and 2). Our RIG identifier is also listed (by itself) in column 20 for convenience. The ROGID is a SHA-1 digest of the protein interactor's primary amino acid sequence concatenated with the NCBI taxon id (see the paper for details).
Sometimes source interaction records in PSI-MI format only list one interactor. These are cases where either 1) an intramolecular interaction is being represented or 2) a multimer (3 or more) of some protein is being represented. These records are difficult to represent in the PSI-MITAB format because PSI-MITAB requires that each row (interaction) list two interactors. The way we handle this is to list the ROG identifier for the single interactor twice (once in each of columns 1 and 2) of the MITAB. The RIG identifier for these interactions will be the SHA-1 digest of the interactor’s ROG id (see column 20). These interactions are marked by a Y in column 21 (see the README).
Note that column 21 may also contain a C. This indicates that the MITAB entry describes membership of a protein in some complex. These entries correspond to PSI-MI records where more than two interactors are listed (associative interaction data). In these cases, the first column holds the ROG identifier of the complex and the second column contains the ROG id of the protein. We chose this method of representation so you can distinguish true binary interactions from complex membership. Other people might take an interaction record with multiple interactors and make a list of binary interactions (based on the spoke or matrix model) and then list these binary interactions in the MITAB. This would be wrong. It implies interactions that may not exist and hides the fact that the entry is not a binary interaction but a binary representation of a complex-interaction.
As an example, let’s say that a source interaction record contained interactors A, B and C found by affinity purification and mass-spec. We would calculate the RIG id as being X (a SHA-1 digest of A, B and C ROG ids concatenated together).
Then we would represent the complex in the MITAB file using three lines: X-A, X-B, and X-C. All three entries would have the same string in column 1 (the RIG id for the complex) All three entries would have the same string in column 21 (again, the RIG id for the complex)
This allows you to reconstruct the members of the original interaction record that describes a complex of proteins (say from an affinity purification experiment). From there, you can choose to make a spoke or matrix model by yourself if you want.
For binary interaction data, column 21 will contain an X. Two protein interactor ROGIDs will be listed in columns 1 and 2.
License
Data released on this public ftp site are released under the Creative Commons Attribution License http://creativecommons.org/licenses/by/2.5/. This means that you are free to use, modify and redistribute these data for personal or commercial use so long as you provide appropriate credit. See next section.
iRefIndex data distributed on the FTP site includes only those data that may be freely distributed under the copyright license of the source database. This includes data from BIND, BioGRID, IntAct, MINT, MPPI and OPHID.
iRefIndex also integrates data from CORUM, DIP, HPRD and MPact. These data are not distributed publicly. These data may be made available to academic users under a collaborative agreement.
Contact ian.donaldson at biotek.uio.no if you are interested in using the iRefIndex database or would like your database included in the public release of the index.
Copyright © 2008, 2009 Ian Donaldson
Citation
Credit should include citing the iRefIndex paper (PMID 18823568) and any of the source databases upon which this resource is based. See http://irefindex.uio.no for appropriate citations.
Disclaimer
Data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Description of PSI-MITAB2.5 file
Each line in this file represents either
- an interaction between two proteins (binary interaction) or
- the membership of a protein in some complex (complex membership) or
- an interaction that involves only one protein type (multimer or self-interaction).
See column 21 for more details.
Column number: 1
Column name: | uidA |
Column type: | String |
Description: | Unique identifier for interactor A |
Example: | irefindex:hhZYhMtr5JC1lGIKtR1wxHAd3JY83333 |
Notes
If this line (entry) describes a binary interaction between two proteins, then the protein with the 'ascibetically' (ASCII value sort order) larger ROGID is listed first as uidA. If this entry describes the membership of a protein in a complex, then the ROGID of the complex is always listed first as uidA and the protein's ROGID is listed as uidB (column 2). If this entry describes a an interaction involving only one protein type, then the ROGID of that protein is listed both as uidA and uidB.
The ROGID (redundant object group identifier) for proteins, consists of the SEGUID for the protein concatenated with the taxon identifier for the protein. For complex nodes, the ROGID is calculated as the SHA-1 digest of the ROGID's of all the protein participants (after first ordering them by ASCII-based lexicographical sorting in ascending order and concatenating them) See the iRefIndex paper for details. The SEGUID is always 27 characters long. So the ROGID will be composed of 27 characters concatenated with a taxon identifier for proteins.
Column number: 2
Column name: | uidB |
Column type: | String |
Description: | Unique identifier for interactor B |
Example: | irefindex:ImnYkXur2U4xVdz5PVvprq8Zgd483333 |
Notes
See notes for column 1.
Column number: 3
Column name: | altA |
Column type: | a|b: pipe-delimited set of strings |
Description: | Alternative identifiers for interactor A |
Example: | uniprotkb:P23367|refseq:NP_418591|entrezgene/locuslink:948691 |
Notes
Each pipe-delimited entry is a database_name:accession pair delimited by a colon. Database names are taken from the MI controlled vocabulary at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI Database references listed in this column may include the following:
- uniprotkb
- The accessions this protein is known by in UniProt(http://www.uniprot.org/). More information regarding this protein can be retrieved using this accession from UniProt. See the AC line in the flat file. http://au.expasy.org/sprot/userman.html#AC_line. Uniprot accessions are mapped to nodes using an exact match to the ROGID. If the node's ROGID maps to a specific isoform of the protein, then the Uniprot accession for the isoform is given.
- refseq
- If a protein accession exists in the RefSeq data base (http://www.ncbi.nlm.nih.gov/RefSeq/) that reference is indicated here. More information about this protein can be obtained from RefSeq using this accession. Refseq accessions are mapped to nodes using an exact match between the node's ROGID and the ROGID for the most recent version of the RefSeq accession.
- entrezgene/locuslink
- NCBI gene Identifiers for the gene encoding this protein. See ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq column GeneID given protein's accession.version
- other
- If none of the three identifier types are available then other databasename:accession pairs will be listed. Thes database names may not follow the MI controlled vocabulary.
Example:
emb:CAA44868.1|gb:AAA23715.1|gb:AAB02995.1|emb:CAA56736.1|uniprot:P24991
- irefindex
- If the node represents a complex, then the rogid for the complex will be listed here, such as the following:
irefindex:xBr9cTXgzPLNxsaKiYyHcoEm/DM
Note that this column value may contain duplicate identifiers.
Column number: 4
Column name: | altB |
Column type: | a|b: pipe-delimited set of strings |
Description: | Alternative identifiers for interactor B |
Example: | uniprotkb:P06722|refseq:NP_417308|entrezgene/locuslink:947299 |
Notes
See notes for column 3.
Column number: 5
Column name: | aliasA |
Column type: | a|b: pipe-delimited set of strings |
Description: | Aliases for interactor A |
Example: | uniprotkb:MUTL_ECOLI|entrezgene/locuslink:mutL |
Notes
Each pipe-delimited entry is a databasename:alias pair delimited by a colon. Database names are taken from the PSI-MI controlled vocabulary at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI Database names and sources listed in this column may include the following:
- uniprotkb:entry name
- the entry name given by UniProt. See Entry name in the ID line of the flat file. http://au.expasy.org/sprot/userman.html#ID_line
- entrezgene/locuslink:symbol
- the NCBI gene symbol for the gene encoding this protein. See ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info column Symbol given GeneID
- irefindex:complex
- If the node is a complex then irefindex:complex will be listed here.
- NA
- NA may be listed here if aliases are Not Available
Column number: 6
Column name: | aliasB |
Column type: | a|b: pipe-delimited set of strings |
Description: | Aliases for interactor B |
Example: | uniprotkb:MUTH_ECOLI|entrezgene/locuslink:mutH |
Notes
See notes for column 5.
Column number: 7
Column name: | Method |
Column type: | a|b: pipe-delimited set of strings |
Description: | Interaction detection methods |
Example: | MI:0000(2 hybrid|affinity chrom|adenylate cyclase|two hybrid) |
Notes
This is a non-redundant list of method short labels found in interaction records. Detection method. Path for PSI-MI 2.5:
entrySet/entry/experimentList/experimentDescription/interactionDetectionMethod/names/shortLabel/
When available, the PSI-MI controlled vocabulary term for the method will be provided such as MI:0399(2h fragment pooling). Otherwise, MI:0000 will appear before the list of pipe-delimited shortLabels.
NA or -1 may appear in place of a recognised shortLabel.
For example:
MI:0000(-1) MI:0000(NA)
Column number: 8
Column name: | author |
Column type: | a|b: pipe-delimited set of strings |
Description: | |
Example: | hall-1999-1|hall-1999-2|mansour-2001-1|mansour-2001-2|hall-1999 |
Notes
According to MITAB2.5 format this column should contain a pipe-delimited list of author surnames in which the interaction has been shown.
Note that this column value may contain duplicate identifiers.
Column number: 9
Column name: | pmids |
Column type: | a|b: pipe-delimited set of strings |
Description: | PubMed Identifiers |
Example: | pubmed:9880500|pubmed:11585365 |
Notes
This is a non-redundant list of PubMed identifiers pointing to literature that supports the interaction. According to MITAB2.5 format, this column should contain a pipe delimited set of databaseName:identifier pairs such as pubmed:12345. The source database name is always pubmed.
The special value - may appear in place of the identifiers.
Column number: 10
Column name: | taxa |
Column type: | string |
Description: | Taxonomy identifier for interactor A |
Example: | taxid:83333 |
Notes
The NCBI taxonomy identifier listed here is that of the sequence record for the interactor and may be different than what is listed in the interaction record. See the methods section for more details. See the NCBI taxonomy database at http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy . According to MITAB2.5 format, this column should contain a pipe delimited set of databaseName:identifier pairs such as taxid:12345. The source database name has been listed as taxid since it is always NCBI's taxonomy database. The value in this column will be NA if the interactor is a complex.
Column number: 11
Column name: | taxb |
Column type: | string |
Description: | Taxonomy identifier for interactor B |
Example: | taxid:83333 |
Notes
See notes for column 10.
Column number: 12
Column name: | interactionType |
Column type: | a|b: pipe-delimited set of strings |
Description: | Interaction Type from controlled vocabulary or short label |
Example: | MI:0218(physical interaction) |
Notes
Taken from the PSI-MI controlled vocabulary and represented as...
database:identifier(interaction type)
...(when available in the interaction record) or Path for PSI-MI 2.5:
entrySet/entry/interactionList/interaction/interactionType/names/shortLabel
If the MI controlled vocabulary identifier is unavailable then MI:0000 is listed.
NA may be listed here if the interaction type in not available (meaning that we could not find the interaction type in the record provided by the source database).
Column number: 13
Column name: | sourcedb |
Column type: | a|b: pipe-delimited set of strings |
Description: | Source databases containing this interaction |
Example: | MI:0469(intact)|MI:0471(mint) |
Notes
Taken from the PSI-MI controlled vocabulary and represented as...
database:identifier(sourceName)
Column number: 14
Column name: | interactionIdentifiers |
Column type: | a|b: pipe-delimited set of strings |
Description: | source interaction database and accession |
Example: | intact:EBI-761694|intact:EBI-762624|mint:MINT-15283 |
Notes
Each reference is presented as a databaseName:identifier pair.
The identifier given for irefindex is the RIGID. The RIGID (for redundant interaction group identifier) consists of the two node identifiers (see columns 1 and 2) concatenated and then digested with the SHA-1 algorithm. The two node identifiers are ordered in columns one and two according to ASCII-based lexicographical sorting in ascending order. See the iRefIndex paper for details. The RIGID points to a set of redundant protein-protein interactions that involve the same set of proteins with the exact same primary sequences. The source databaseNames that appear in this column are taken from the PSI-MI controlled vocabulary at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI where possible
Interaction record identifiers are not available for mppi and ophid so these entries will appear as mppi:found and ophid:found
Sometimes, FOUND will appear as in:
intact:EBI-861910|intact:FOUND:1
Column number: 15
Column name: | confidence |
Column type: | a|b: pipe-delimited set of strings |
Description: | Confidence scores |
Example: | lpr:0|hpr:12 |
Notes
Each reference is presented as a scoreName:score pair. Three confidence scores are provided: lpr, hpr and np.
PubMed Identifiers (PMIDs) point to literature references that support an interaction. A PMID may be used to support more than one interaction.
The lpr score (lowest pmid re-use) is the lowest number of distinct interactions (RIGIDs: see column 14) that any one PMID (supporting the interaction in this row) is used to support. A value of one indicates that at least one of the PMIDs supporting this interaction has never been used to support any other interaction. This likely indicates that only one interaction was described by that reference and that the present interaction is not derived from high throughput methods.
The hpr score (highest pmid re-use) is the highest number of interactions (RIGIDs: see column 14) that any one PMID (supporting the interaction in this row) is used to support. A high value (e.g. greater than 50) indicates that one PMID describes at least 50 other interactions and it is more likely that high-throughput methods were used.
The np score (number pmids) is the total number of unique PMIDs used to support the interaction described in this row.
- may appear in the score field, indicating the absence of a score value.
COLUMNS PAST THIS POINT ARE NOT DEFINED BY THE PSI-MITAB2.5 STANDARD. THESE COLUMNS MAY CHANGE FROM ONE RELEASE TO ANOTHER
Column number: 16
Column name: | entrezGeneA |
Column type: | pipe delimited list of integers or a string |
Description: | EntrezGene identifier(possibly a pipe-delimited list) for interactor A |
Example: | 947299 |
Notes
If an EntrezGene identifier is not found for the interactor, then a ROGID will appear in this column (see notes to column 1).
If the interactor is a node representing a complex, then the ROGID for the complex will appear here.
Column number: 17
Column name: | entrezGeneB |
Column type: | Integer |
Description: | EntrezGene identifier for interactor B |
Example: | 948691 |
Notes
See notes for column 16.
Column number: 18
Column name: | atype |
Column type: | string |
Description: | Is interactor A a protein or a complex? |
Example: | MI:0326(protein) |
Notes
This will always be one of...
MI:0326(protein) MI:0315(protein complex)
Column number: 19
Column name: | btype |
Column type: | string |
Description: | Is interactor B a protein or a complex? |
Example: | MI:0326(protein) |
Notes
See column 18.
Column number: 20
Column name: | rigid |
Column type: | string |
Description: | Redundant interaction group identifier |
Example: | 3ERiFkUFsm7ZUHIRJTx8ZlHILRA |
Notes
The RIGID (for redundant interaction group identifier) consists of the ROG identifiers for each of the protein participants (see notes above) ordered by ASCII-based lexicographic sorting in ascending order, concatenated and then digested with the SHA-1 algorithm. See the iRefIndex paper for details. This identifier points to a set of redundant protein-protein interactions that involve the same set of proteins with the exact same primary sequences.
Column number: 21
Column name: | edgetype |
Column type: | Character |
Description: | Does the edge represent a binary interaction (X), complex membership (C), or a multimer (Y)? |
Example: | X |
Notes
Edges can be labelled as either X, C or Y:
- X
- a binary interaction with two protein participants
- C
- denotes that a protein is part of some complex of proteins. One of the nodes is a protein. The other node represents a complex of proteins (see columns 18-19). The edge represents the idea that the protein is a member of the complex.
- Y
- for dimers and polymers. In case of dimers and polymers when the number of subunits is not described in the original interaction record, the edge is labelled by a Y. Interactor A (column 1) will be identical to the Interactor B (column 2). The graphical representation of this will appear as a single node connected to itself (loop). The actual number of self-interacting subunits may be 2 (dimer) or more (say 5 for a pentamer). Refer to the original interaction record for more details and see column 22.
Column 1, Column 21 and Column 2 are all concatenated into a single space-delimited identifier to create the interaction identifier when this file is imported into Cytoscape; for example:
qE03bpSsQiJouJbn6ISt8DnD/pA9606 (X) E4tMCoGbaqAfjU8QXbQzbAb8fGQ9606
Column number: 22
Column name: | numParticipants |
Column type: | Integer |
Description: | Number of participants in the interaction |
Example: | 2 |
Notes
- For edges labelled X (see column 21) this value will be two.
- For edges labelled C, this value will be equivalent to the number of protein subunits included in the complex(represented by the complex node).
- For interactions labelled Y, this value will either be the number of self-interacting subunits (if present in the original interaction record) or 1 where the exact number of subunits is unknown or unspecified.
Column number: 23
Column name: | ROGA |
Column type: | Integer |
Description: | Integer representation of uidA |
Example: | 617653 |
Column number: 24
Column name: | ROGB |
Column type: | Integer |
Description: | Integer representation of uidB |
Example: | 4052696 |