Difference between revisions of "iRefIndex MITAB2.5 Parser"

From irefindex
(New page: A tool has been developed to parse the MITAB files produced in the iRefIndex Build Process. == Obtaining the MITAB Parser == The parser and associated resources can be obtained from ...)
 
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
A tool has been developed to parse the MITAB files produced in the [[iRefIndex Build Process]].
+
{{Note|
 +
This page describes previous versions of the MITAB parser which could only process MITAB2.5 format files.
 +
 
 +
* See [[iRefIndex MITAB2.6 Parser]] for information about the updated parser which can handle both MITAB2.5 and MITAB2.6 format files.
 +
}}
 +
 
 +
A tool has been developed to parse the MITAB files produced in the [[iRefIndex Build Process]]. Currently, the tool is capable of parsing the MITAB format described on the page [[README_iRefIndex_MITAB_7.0]].
  
 
== Obtaining the MITAB Parser ==
 
== Obtaining the MITAB Parser ==
Line 21: Line 27:
 
The following programs are required to use the parser:
 
The following programs are required to use the parser:
  
* Python (tested with 2.3.5)
+
* [http://www.python.org/ Python] (tested with 2.3.5)
* PostgreSQL (tested with 8.1.9)
+
* [http://www.postgresql.org/ PostgreSQL] (tested with 8.1.9)
  
 
== Running the Parser ==
 
== Running the Parser ==
Line 40: Line 46:
 
A database can be created using the usual PostgreSQL tools:
 
A database can be created using the usual PostgreSQL tools:
  
<pre>createdb mitab_irefindex</pre>
+
<pre>createdb -E unicode mitab_irefindex</pre>
  
 
This database is initialised as follows:
 
This database is initialised as follows:
Line 58: Line 64:
 
As a result, a number of tables representing the structure of the data should be available in the database. For applications built to use this data, indexes may need creating in order to make querying more efficient.
 
As a result, a number of tables representing the structure of the data should be available in the database. For applications built to use this data, indexes may need creating in order to make querying more efficient.
  
 +
== Notes on the Populated Database ==
 +
 +
The schema used by the populated database attempts to model the data as effectively as possible using a number of tables:
 +
 +
{| border="1" cellspacing="0" cellpadding="5" style="margin: 2em"
 +
! Entity type
 +
! Tables
 +
! Table purpose
 +
! Notable source columns
 +
|-
 +
| rowspan="5" | Interaction
 +
| mitab_interactions
 +
| Model each interaction referencing interactors
 +
| rigid, edgetype, numParticipants
 +
|-
 +
| mitab_sources
 +
| Represent sources for each interaction
 +
| sourcedb
 +
|-
 +
| mitab_interaction_type_names
 +
| Represent interaction types for each interaction
 +
| interactionType
 +
|-
 +
| mitab_interaction_identifiers
 +
| Represent interaction identifiers for each interaction
 +
| interactionIdentifiers
 +
|-
 +
| mitab_confidence
 +
| Represent confidence scores for each interaction
 +
| confidence
 +
|-
 +
| rowspan="3" | Experiment
 +
| mitab_method_names
 +
| Represent detection methods for each interaction
 +
| method
 +
|-
 +
| mitab_authors
 +
| Represent publication authors for each interaction
 +
| author
 +
|-
 +
| mitab_pubmed
 +
| Represent publication identifiers for each interaction
 +
| pmids
 +
|-
 +
| rowspan="3" | Interactor
 +
| mitab_interactions
 +
| Model each interaction referencing interactors
 +
| uidA, uidB, taxA, taxB, entrezgeneA, entrezgeneB, atype, btype, ROGA, ROGB
 +
|-
 +
| mitab_aliases
 +
| Represent aliases for each interactor
 +
| alias
 +
|-
 +
| mitab_alternatives
 +
| Represent alternative identifiers for each interactor
 +
| altA, altB
 +
|}
 +
 +
Some changes in representation occur when creating the database:
 +
 +
* Prefixed values are generally split to expose the prefix and identifier, name or value following it.
 +
** The <tt>irefindex:</tt> prefix is omitted from rigid columns and uid-related columns in mitab_aliases and mitab_alternatives.
 +
** Source identifiers are split with the prefix (such as <tt>intact:</tt>) used to make a dbname column with the actual identifier stored in its own column (such as alias or alt).
 +
* Information in the Entrez Gene-related columns is excluded if it is not a genuine gene identifier.
 +
* The "empty value" (<tt>-</tt>) should never appear as an identifier, and where such a value is used in a list, that element should be excluded. This is pertinent in the case of vocabulary terms where <tt>MI:0000</tt> might be used together with an empty list of identifiers or names as an "empty collection" indicator.
 +
* Duplicate values in lists are generally discarded.
 +
 +
Further work may include the introduction of a separate interactor table, collecting taxonomy and gene information for each interactor.
 +
 +
 +
== All iRefIndex Pages ==
 +
 +
Follow this link for a listing of all iRefIndex related pages (archived and current).
 
[[Category:iRefIndex]]
 
[[Category:iRefIndex]]

Latest revision as of 14:51, 19 September 2011

NoteNote

This page describes previous versions of the MITAB parser which could only process MITAB2.5 format files.

  • See iRefIndex MITAB2.6 Parser for information about the updated parser which can handle both MITAB2.5 and MITAB2.6 format files.

A tool has been developed to parse the MITAB files produced in the iRefIndex Build Process. Currently, the tool is capable of parsing the MITAB format described on the page README_iRefIndex_MITAB_7.0.

Obtaining the MITAB Parser

The parser and associated resources can be obtained from this location:

https://hfaistos.uio.no/cgi-bin/viewvc.cgi/mitab/

Using CVS with the appropriate CVSROOT setting, run the following command:

cvs co mitab

The CVSROOT environment variable should be set to the following for this to work:

export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot

(The <username> should be replaced with your actual username.)

Prerequisites

The following programs are required to use the parser:

Running the Parser

Given a directory for the iRefIndex output files such as...

/home/irefindex/output

...run the parser as follows:

python parse_mitab.py /home/irefindex/output/All.mitab.03042009.txt

It will be necessary to change the date details included in the above filename to match the actual name of the appropriate file found in your own output directory.

Creating the Database

A database can be created using the usual PostgreSQL tools:

createdb -E unicode mitab_irefindex

This database is initialised as follows:

psql -f init_mitab.sql mitab_irefindex

Should the database tables need to be dropped (perhaps in case of problems with the import), the following command can be used:

psql -f drop_mitab.sql mitab_irefindex

Populating the Database

The database is populated as follows:

python import_mitab.py mitab_irefindex

As a result, a number of tables representing the structure of the data should be available in the database. For applications built to use this data, indexes may need creating in order to make querying more efficient.

Notes on the Populated Database

The schema used by the populated database attempts to model the data as effectively as possible using a number of tables:

Entity type Tables Table purpose Notable source columns
Interaction mitab_interactions Model each interaction referencing interactors rigid, edgetype, numParticipants
mitab_sources Represent sources for each interaction sourcedb
mitab_interaction_type_names Represent interaction types for each interaction interactionType
mitab_interaction_identifiers Represent interaction identifiers for each interaction interactionIdentifiers
mitab_confidence Represent confidence scores for each interaction confidence
Experiment mitab_method_names Represent detection methods for each interaction method
mitab_authors Represent publication authors for each interaction author
mitab_pubmed Represent publication identifiers for each interaction pmids
Interactor mitab_interactions Model each interaction referencing interactors uidA, uidB, taxA, taxB, entrezgeneA, entrezgeneB, atype, btype, ROGA, ROGB
mitab_aliases Represent aliases for each interactor alias
mitab_alternatives Represent alternative identifiers for each interactor altA, altB

Some changes in representation occur when creating the database:

  • Prefixed values are generally split to expose the prefix and identifier, name or value following it.
    • The irefindex: prefix is omitted from rigid columns and uid-related columns in mitab_aliases and mitab_alternatives.
    • Source identifiers are split with the prefix (such as intact:) used to make a dbname column with the actual identifier stored in its own column (such as alias or alt).
  • Information in the Entrez Gene-related columns is excluded if it is not a genuine gene identifier.
  • The "empty value" (-) should never appear as an identifier, and where such a value is used in a list, that element should be excluded. This is pertinent in the case of vocabulary terms where MI:0000 might be used together with an empty list of identifiers or names as an "empty collection" indicator.
  • Duplicate values in lists are generally discarded.

Further work may include the introduction of a separate interactor table, collecting taxonomy and gene information for each interactor.


All iRefIndex Pages

Follow this link for a listing of all iRefIndex related pages (archived and current).