Sources iRefIndex 9.0/Issues

From irefindex
Revision as of 14:38, 7 November 2011 by PaulBoddie (talk | contribs) (Added issues as a separate page.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This page documents the issues for iRefIndex 9.0 as experienced during the build process.

General Issues

Yeast taxon id changes

See http://www.uniprot.org/news/2011/05/03/release

Correction of taxids was already introduced to iRefIndex before the 9.0 build, but additional measures were required.

Internal link: Bugzilla:247

New databases

InnateDB, MatrixDB, MPIDB are added.

BioGrid interaction record ids (pre-build issue)

Capture Biogrid interaction record ids so iRefWeb can link out to BioGrid.

The only interaction id available from the BioGrid files are already being used and also there in the iRefWeb, such as...

<primaryRef db="grid" id="103" refType="identity" refTypeAc="MI:0356" dbAc="MI:0463" />

Internal link: Bugzilla:250

RIGID recalculation (pre-build issue)

Modify existing RIGID table or lose continuity of iRIGIDs with last release.

A program has been added to the iRefIndex sources which produces a mapping from correct to "legacy" RIGIDs. The decision was made to assign new iRIGIDs to the new RIGIDs.

Internal link: bug #242

Taxon specific MITAB files (post-processing issue)

Taxon specific files should contain interactions ONLY if one or both taxa, taxb have the appropriate taxon (regardless of what the source database said the interaction taxon was). Change README. For example, see PMID...

http://wodaklab.org/iRefWeb/pubReport/detail?pubmed=12565857+

A "mouse" interaction from HPRD lists only human interactors (the paper is about mouse and they have made a transfer to human without noting what they have done.) As a result, this human interaction ends up in the mouse MITAB (because HPRD says it was mouse). BioGRID correctly curates the paper as about mouse.

Internal link: Bugzilla:248

CORUM methods (code change implemented)

Ensure that all CORUM methods (with MI terms) are parsed.

This was partially fixed in the parser, but additional measures were required to prevent PubMed identifiers from being written over the method information.

Internal link: Bugzilla:249

Repeated lines (post-processing issue)

There are multiple lines that are repeated many times. These appear to arise from BIND 3DBP division (see for example lines 5,13,117,125 in Ecoli MITAB and others arising from BIND ID 92720 - 44 pieces of experimental evidence and 5 PMIDs) because the accessions for the different experimental forms are not present in MITAB. See Antonio and bug #245. Could be handled as a post-processing step on MITAB to take the unique set of all MITAB lines.

MITAB/iRefScape canonicalization (post-processing issue)

Change this to choose canonical sequence rather than longest sequence (mapping score L). Examples GeneID 84148 and 512564 unnecessarily separates Grid interaction data from interaction data from other databases.

Decided not to change L method...instead:

Resolve by distributing non-canonicalized data as before AND a canonicalized MITAB file with complete provenance info (this will become the main MITAB file we release and it will support PSICQUIC services and we will drop non-canonicalised version in future releases). Also, canonicalize iRefScape data and include provenance data for interactors in edge attribute viewer.

Required review of current MITAB file format by Ian.

Build issues

Two BIND Translation files use non-ASCII byte values that are not part of valid UTF-8 byte sequences, but do not declare an encoding explicitly:

  • taxid10090_PSIMI25.xml
  • taxid9606_PSIMI25.xml

MPIDB

The MPIDB data files are non-standard in various respects and require some special measures to structure the data for iRefIndex use. See iRefIndex MITAB Mapping for details of the way iRefIndex should retain MITAB-originating data.

InnateDB

Innatedb has data from other sources as well. I see in the download page that these is a link for curated innateDB data and we should find out whether this is a collection of all data or are these curated by innatedb. Paul has made a parser for the PSI XML and this data will be from 2011-03-06. They say however that they update the MI TAB version every week.

MatrixDB

They have non-proteins and protein fragments not only proteins as interactors. This database must be tested before homogenizing.