Difference between revisions of "Sources and Issues Next Release"
(One intermediate revision by the same user not shown) | |||
Line 27: | Line 27: | ||
== Issues == | == Issues == | ||
+ | ===Deprecated taxids appear in export for iRefWeb=== | ||
+ | |||
+ | See list from Yuri. Examples are 273, 510, 515, 591, 592, 601, 602, 677, 887, 1139, 1156, 1312...133899, 137208, 144556, 150147, 160268, 163106, 163653, 196590, 216593. | ||
+ | |||
+ | These appear in the 'interactor' and 'interaction_interactor_assignment' , but are not in new taxonomy 'taxonomy_scientific' and 'names" tables. A random selection of these do not appear in the mitab files so the fault likely lies in the export script for irfweb. | ||
=== BioGRID interaction record ids (pre-build issue) === | === BioGRID interaction record ids (pre-build issue) === |
Latest revision as of 11:47, 26 February 2014
Note |
This is a planning template for the next release. It does not correspond to a released product. See http://irefindex.org/ for the most recent release and related documentation. This page can be used to create the sources page. Check for xxx before copying and pasting to the appropriate sources page for the new release. Do not edit xxx in this page. Leave this page as a template. After making a new release page, update the general Sources for iRefIndex redirect page. |
Last edited: 2014-02-26
Applies to iRefIndex release: xxx
Release date: xxx
Authors: Ian Donaldson
Database: iRefIndex (http://irefindex.org)
Organization: http://irefindex.org
Description: This file lists interaction and protein sequence related resources used for the current build of the iRefIndex. Statistics for the iRefIndex are available and include a breakdown of interactors and interactions from each data source.
- For statistics on full public dataset please refer to: http://irefindex.uio.no/wiki/Statistics_iRefIndex_xxx
- For statistics on the public dataset (distributed on the FTP site contains) please refer to:http://irefindex.uio.no/wiki/Statistics_iRefIndex_free_xxx
Contents
Issues
Deprecated taxids appear in export for iRefWeb
See list from Yuri. Examples are 273, 510, 515, 591, 592, 601, 602, 677, 887, 1139, 1156, 1312...133899, 137208, 144556, 150147, 160268, 163106, 163653, 196590, 216593.
These appear in the 'interactor' and 'interaction_interactor_assignment' , but are not in new taxonomy 'taxonomy_scientific' and 'names" tables. A random selection of these do not appear in the mitab files so the fault likely lies in the export script for irfweb.
BioGRID interaction record ids (pre-build issue)
Capture BioGRID interaction record ids so iRefWeb can link out to BioGRID.
The only interaction id available from the BioGRID files are already being used and also there in the iRefWeb, such as...
<primaryRef db="grid" id="103" refType="identity" refTypeAc="MI:0356" dbAc="MI:0463" />
See Bugzilla:250.
MITAB/iRefScape canonicalization
Change this to choose canonical sequence rather than longest sequence (mapping score L). Examples GeneID 84148 and 512564 unnecessarily separates Grid interaction data from interaction data from other databases.
See Bugzilla:255.
PDB identifiers
In previous releases we have replaced the pipe character (|) of the PDB identifiers with an underscore character (_) . In this release, this is only done when there are multiple database:accession entries in a field otherwise the |) character is maintained as part of the PDB identifier. This is a regression and will be corrected in a future release.
IMEX identifiers
IMEx identifiers should be present in column 52 but appear to be missing. This is a regression and will be corrected in a future release. There are 6004 lines in release 10 with imex:... This number needs to be cross-checked before the next release. This is still an issue as of release 13.
Compatibility with Java PSI parser needs to be improved
Java parser from psimi https://code.google.com/p/psimi/downloads/detail?name=psimitab-1.8.3-distribution.zip. But there are at least a few examples where the files don't follow the specs:
-reserved characters are not quoted.
Like for instance in file for human:
taxid:11706(HIV-1 M:B_HXB2R) taxid:10299(Herpes simplex virus (type 1 / strain 17)) go:GO:0005783|rigid:d//bz+DaMrbuxGA3i1Xe4hqlrXI|edgetype:X
In case of controlled terms to be standard conform it should look like this:
psi-mi:"MI:0496"(bait)
-empty columns need to be consistently filled with '-' . For example, column 15 in the human file.
-dates should be represented as yyyy/mm/dd but look like yyyy-mm-dd Thanks to Thomas Schmitt for pointing out these problems
Various issues reported by Andrei Turinsky
There are a few remaining issues, as follows:
6630 interactors have obsolete Entrez Gene IDs shown (of which 3449 in Human).
2696 RefSeq IDs (of which 935 are Human) have no Entrez ID shown in the MITAB, but such ID is actually known from NCBI maps - not sure whether your canonicalization process may recover these IDs. Only one such case remains for UniProt IDs (uniprotkb:P0CE96 a.k.a. YL156_YEAST actually has known genes 850851, 850856, 850858 but none of these are shown) - so the rest of UniProts have been resolved, which is great.
A minor thing: 6 interactors are shown with either two different taxons due to different strains of either E.coli or yeast.
A minor thing: 5 interactors are sometimes shown with no taxon at all (4 human and 1 from A. fulgidus).
A noticeable percentage of PubMeds have been lost for some of the source DBs, with InnateDB and DIP having lost hundreds of Pubs. Human annotations especially affected: e.g. Human DIP lost 266 pubmeds, or 16%; Human InnateDB lost 347 pubs, or 13%).
There are 17 obsolete PSI-MI IDs that appear in the MITAB files, of which 15 are detection methods and 2 interaction types (MI:0191 "aggregation" and MI:0218 "physical interaction"). Their listing is attached. Of these, MI:0229 is actually still valid (it's an alt_id) but should be replaced with MI:0809 -- see the last line in the attached list.
Also, the detection method id "MI:0044" is not valid -- could be a typo? (it's not listed in the attached file). In column 13, for mpi-imex and mpi-lit, should the code "MI:0000" be changed to MI:0903? In column 14, CORUM records are referenced by their publication ID, not by their complex ID.
Outdated Entrez Gene identifier
Some users have reported that retired Entrez Gene identifiers have appeared in release 10 that were correctly updated in release 9.
Examples
release 9:
uniprotkb:P21675|refseq:NP_620278|entrezgene/locuslink:6872|rogid:P0LoULOvon+Wp2G17lBlqn3Fo4E9606|irogid:3476704
release 10:
entrezgene/locuslink:100287968|entrezgene/locuslink:100291704|entrezgene/locuslink:1863|rogid:P0LoULOvon+Wp2G17lBlqn3Fo4E9606|irogid:3476704
Release 9 used the correct id for TAF1, 6872. In release 10, 3 outdated entrez gene ids are used instead, which all say in their record: This record was replaced with Gene ID: 6872
Build issues
Source | Format | Location | Version (date) |
BIND | Tab-delimited text file. | ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/ (no longer available - see below).
20050525.complex2refs.txt 20050525.ints.txt 20050525.refs.txt 20050525.complexes.txt 20050525.labels.txt 20050525.complex2subunits.txt These file are no longer available via ftp but are available from the authors. BIND archival content is now managed by Thomson Scientific. See http://bond.unleashedinformatics.com/ and http://bond.unleashedinformatics.com/downloads/data/BIND/ For historical purposes, a snapshot of the the Blueprint web-site may be viewed at... http://web.archive.org/web/20050204013426/www.blueprint.org/index.html ...via the internet archive at... |
2005-05-25 |
BIND Translation | PSI-MI 2.5 | http://download.baderlab.org/BINDTranslation/release1_0/BINDTranslation_v1_xml_AllSpecies.tar.gz | Version 1.0 (2010-12-15) |
BioGRID | PSI-MI 2.5 | http://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.1.81/BIOGRID-ALL-3.1.81.psi25.zip | Version 3.1.81 (2011-10-01) |
CORUM | PSI-MI 2.5 | http://mips.gsf.de/genre/proj/corum/index.html http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip |
2009-12-02 |
DIP | PSI-MI 2.5 | http://dip.doe-mbi.ucla.edu/dip/Download.cgi
|
2010-10-10 |
HPRD | PSI-MI 2.5 | http://www.hprd.org/download HPRD_PSIMI_041310.tar.gz |
Release 9 (2010-04-13) |
IntAct | PSI-MI 2.5 | ftp://ftp.ebi.ac.uk/pub/databases/intact/2011-09-29/psi25/pmidMIF25.zip | 2011-09-29 |
MINT | PSI-MI 2.5 | ftp://mint.bio.uniroma2.it/pub/release/psi/current/psi25/pmid/ | 2010-12-21 |
MPACT | PSI-MI 2.5 | ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz | 2008-01-10 |
MPPI | PSI-MI 1.0 | http://mips.gsf.de/proj/ppi/data/mppi.gz | 2004-06-01 (from archive) |
OPHID | PSI-MI 1.0 | http://ophid.utoronto.ca/ophid/downloads.html (This service no longer available, please refer to http://ophid.utoronto.ca/ophidv2.201/) | 2006-07-07 |
New for this release | |||
InnateDB | PSI-MI 2.5 | http://www.innatedb.com/download.jsp Curated InnateDB Data |
2011-03-06 |
MPIDB | MITAB format file | http://www.jcvi.org/mpidb (information) http://www.jcvi.org/mpidb/download.php (general downloads) |
Downloaded on 2011-10-03 |
MatrixDB | PSI-MI 2.5 | http://matrixdb.ibcp.fr/ MatrixDB_20100826.xml.zip |
2010-08-26 (timestamp) |
Source | Format | Location | Version (date) |
SEGUID | Tab-delimited text | ftp://bioinformatics.anl.gov/seguid/ seguidannotation |
2007-07-24 (timestamp) |
UniProt | Text | http://www.uniprot.org/downloads UniProtKB/Swiss-Prot (uniprot_sprot.dat.gz) |
UniProt Knowledgebase Release 2011_09 (2011-09-21) (Downloaded on 2011-10-04): UniProtKB/Swiss-Prot UniProtKB/TrEMBL (from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt) |
UniProt | Text | http://www.uniprot.org/downloads UniProtKB/TrEMBL (uniprot_trembl.dat.gz) | |
UniProt, IsoForms | FASTA | http://www.uniprot.org/downloads uniprot_sprot_varsplic.fasta.gz | |
UniProt, SGD | Tab-delimited text file. | http://www.expasy.org/cgi-bin/lists?yeast.txt Yeast (Saccharomyces cerevisiae): entries, gene names and cross-references to SGD | |
UniProt, FLY | Tab-delimited text file. | http://www.expasy.org/cgi-bin/lists?fly.txt Drosophila: entries, gene names and cross-references to FlyBase. | |
NCBI, RefSeq | GenPept | ftp://ftp.ncbi.nih.gov/refseq/release/complete see *.protein.gpff.gz files |
Release 49 (2011-09-09) (Downloaded on 2011-10-04) (from http://www.ncbi.nlm.nih.gov/refseq/) |
NCBI, MMDB/PDB | Tab-delimited text | ftp://ftp.ncbi.nih.gov/mmdb/pdbeast/table | (Downloaded on 2011-10-04) |
NCBI, PDB sequences | FASTA | ftp://ftp.ncbi.nih.gov/blast/db/FASTA/pdbaa.gz | (Downloaded on 2011-10-03) |
NCBI Gene2Refseq | Tab-delimited text | ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ gene2refseq.gz |
(Downloaded on 2011-10-04) |
All iRefIndex Pages
Follow this link for a listing of all iRefIndex related pages (archived and current).