Difference between revisions of "iRefIndex MITAB2.6 Parser"

From irefindex
(Attempted to make a page corresponding to the revised MITAB format.)
 
m (Fixed the snapshot label.)
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
A tool has been developed to parse the MITAB files produced in the [[iRefIndex Build Process]]. Currently, the tool is capable of parsing the MITAB format described on the page [[README_iRefIndex_MITAB2.6_7.0]].
+
A tool has been developed to parse the MITAB files produced in the [[iRefIndex Build Process]]. Currently, the tool is capable of parsing the MITAB format described on the page [[README MITAB2.6 for iRefIndex]].
  
 
== Obtaining the MITAB Parser ==
 
== Obtaining the MITAB Parser ==
  
The parser and associated resources can be obtained from this location:
+
The parser and associated resources are available for download here:
  
https://hfaistos.uio.no/cgi-bin/viewvc.cgi/mitab/
+
* Snapshot 2011-11-29: [http://irefindex.uio.no/hg/mitab/archive/8b936902f616.tar.bz2 tar.bz2 archive], [http://irefindex.uio.no/hg/mitab/archive/8b936902f616.tar.gz tar.gz archive], [http://irefindex.uio.no/hg/mitab/archive/8b936902f616.zip zip archive]
 
+
* [http://irefindex.uio.no/hg/mitab/ mitab repository home]
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
 
 
 
  <pre>cvs co mitab</pre>
 
 
 
The <tt>CVSROOT</tt> environment variable should be set to the following for this to work:
 
 
 
  <pre>export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot</pre>
 
 
 
(The <tt><username></tt> should be replaced with your actual username.)
 
  
 
== Prerequisites ==
 
== Prerequisites ==
Line 21: Line 12:
 
The following programs are required to use the parser:
 
The following programs are required to use the parser:
  
* [http://www.python.org/ Python] (tested with 2.3.5)
+
* [http://www.python.org/ Python] (tested with 2.5.4)
* [http://www.postgresql.org/ PostgreSQL] (tested with 8.1.9)
+
* [http://www.postgresql.org/ PostgreSQL] (tested with 8.1.17, 9.0.4)
  
 
== Running the Parser ==
 
== Running the Parser ==
Line 54: Line 45:
 
The database is populated as follows:
 
The database is populated as follows:
  
<pre>python import_mitab.py mitab_irefindex</pre>
+
<pre>python database_action.py mitab_irefindex import_mitab.sql</pre>
  
 
As a result, a number of tables representing the structure of the data should be available in the database. For applications built to use this data, indexes may need creating in order to make querying more efficient.
 
As a result, a number of tables representing the structure of the data should be available in the database. For applications built to use this data, indexes may need creating in order to make querying more efficient.
Line 69: Line 60:
 
! Source columns (if different or converted)
 
! Source columns (if different or converted)
 
|-
 
|-
| rowspan="5" | Interaction
+
| rowspan="6" | Interaction
 
| mitab_interactions
 
| mitab_interactions
 
| Model each interaction referencing interactors
 
| Model each interaction referencing interactors
| rigid, ''intType'', edgetype, numParticipants, ''crigid''
+
| rigid, edgetype, numParticipants, crigid
 
|
 
|
 
|-
 
|-
Line 94: Line 85:
 
| rigid, ''type'', ''confidence''
 
| rigid, ''type'', ''confidence''
 
| confidence
 
| confidence
 +
|-
 +
| mitab_interaction_rigs
 +
| Represent alternative integer identifiers for each interaction
 +
| ''uid'', ''rig''
 +
| rigid, irigid
 +
|-
 +
| Canonical interaction
 +
| mitab_canonical_interaction_rigs
 +
| Represent alternative integer identifiers for each canonical interaction
 +
| ''uid'', ''rig''
 +
| crigid, icrigid
 
|-
 
|-
 
| rowspan="3" | Experiment
 
| rowspan="3" | Experiment
Line 114: Line 116:
 
| mitab_interactions
 
| mitab_interactions
 
| Model each interaction referencing interactors
 
| Model each interaction referencing interactors
| uidA, uidB, ''intType'', taxA, taxB, atype, btype
+
| uidA, uidB, taxA, taxB, atype, btype, crogidA, crogidB
 
|
 
|
 
|-
 
|-
 
| mitab_aliases
 
| mitab_aliases
 
| Represent aliases for each interactor
 
| Represent aliases for each interactor
| ''uid'', ''intType'', ''dbname'', ''alias''
+
| ''uid'', ''dbname'', ''alias''
 
| uidA or uidB, aliasA or aliasB
 
| uidA or uidB, aliasA or aliasB
 
|-
 
|-
 
| mitab_alternatives
 
| mitab_alternatives
 
| Represent alternative identifiers for each interactor
 
| Represent alternative identifiers for each interactor
| ''uid'', ''intType'', ''dbname'', ''alt''
+
| ''uid'', ''dbname'', ''alt''
 
| uidA or uidB, altA or altB
 
| uidA or uidB, altA or altB
 
|-
 
|-
 
| mitab_interactor_rogs
 
| mitab_interactor_rogs
 
| Represent alternative integer identifiers for each interactor
 
| Represent alternative integer identifiers for each interactor
| ''uid'', ''intType'', ''rog''
+
| ''uid'', ''rog''
 
| uidA or uidB, irogA or irogB
 
| uidA or uidB, irogA or irogB
 +
|-
 +
| Canonical interactor
 +
| mitab_canonical_interactor_rogs
 +
| Represent alternative integer identifiers for each canonical interactor
 +
| ''uid'', ''rog''
 +
| crogidA or crogidB, icrogA or icrogB
 
|}
 
|}
  
Line 136: Line 144:
  
 
* Prefixed values are generally split to expose the prefix and identifier, name or value following it.
 
* Prefixed values are generally split to expose the prefix and identifier, name or value following it.
** The various interaction and interactor prefixes (such as <tt>irefindex:</tt>, <tt>rigid:</tt>, <tt>rogid:</tt>, <tt>crigid:</tt> and <tt>crogid:</tt>) are omitted from interaction and interactor columns.
+
** The various interaction and interactor prefixes (such as <tt>irefindex:</tt>, <tt>rigid:</tt>, <tt>rogid:</tt>, <tt>crigid:</tt> and <tt>crogid:</tt>) are omitted from interaction and interactor columns. '''Note''' that for non-iRefIndex data, any prefixes other than these will be retained, although this approach may be revised in future.
 
** Source identifiers are split with the prefix (such as <tt>intact:</tt>) used to make a dbname column with the actual identifier stored in its own column (such as alias or alt).
 
** Source identifiers are split with the prefix (such as <tt>intact:</tt>) used to make a dbname column with the actual identifier stored in its own column (such as alias or alt).
 
* The "empty value" (<tt>-</tt>) should never appear as an identifier, and where such a value is used in a list, that element should be excluded. This is pertinent in the case of vocabulary terms where <tt>MI:0000</tt> might be used together with an empty list of identifiers or names as an "empty collection" indicator.
 
* The "empty value" (<tt>-</tt>) should never appear as an identifier, and where such a value is used in a list, that element should be excluded. This is pertinent in the case of vocabulary terms where <tt>MI:0000</tt> might be used together with an empty list of identifiers or names as an "empty collection" indicator.
 
* Duplicate values in lists are generally discarded.
 
* Duplicate values in lists are generally discarded.
  
Further work may include the introduction of a separate interactor table, collecting related information for each interactor.
+
Further work may include the introduction of a separate interactor table, collecting related information for each interactor. Support for interactor identifiers other than ROG identifiers may be improved, with a new column potentially being introduced to indicate the type of each identifier.
  
 
=== Canonical interactors and interactions ===
 
=== Canonical interactors and interactions ===
  
An ''intType'' column has been introduced into some tables in order to indicate whether an interactor or interaction involves canonical information.
+
The <tt>mitab_interactions</tt> table incorporates the canonical interaction and interactors alongside the specific interaction and interactors. A separate <tt>mitab_canonical_interactor_rogs</tt> table is used to map canonical interactors to integer identifiers, just as <tt>mitab_interactor_rogs</tt> does so for specific interactors.
 
 
* Where an interactor is a specific, observed interactor, ''intType'' will be set to <tt>S</tt>
 
* Where an interactor is a canonical group, ''intType'' will be set to <tt>C</tt>
 
* Where an interaction involves only specific, observed interactors, ''intType'' will be set to <tt>S</tt>, and the ''crigid'' column will refer to the rigid column of the associated canonical interaction
 
* Where an interaction involves canonical groups, ''intType'' will be set to <tt>C</tt>
 
 
 
Thus, the mitab_interactions table effectively has two levels:
 
 
 
* A "parent" level describing interactions between canonical groups, grouping together records in...
 
* A "child" level describing interactions between specific, observed interactors, each referencing a parent record
 
  
 
== All iRefIndex Pages ==
 
== All iRefIndex Pages ==

Latest revision as of 16:52, 29 November 2011

A tool has been developed to parse the MITAB files produced in the iRefIndex Build Process. Currently, the tool is capable of parsing the MITAB format described on the page README MITAB2.6 for iRefIndex.

Obtaining the MITAB Parser

The parser and associated resources are available for download here:

Prerequisites

The following programs are required to use the parser:

Running the Parser

Given a directory for the iRefIndex output files such as...

/home/irefindex/output

...run the parser as follows:

python parse_mitab.py /home/irefindex/output/All.mitab.03042009.txt

It will be necessary to change the date details included in the above filename to match the actual name of the appropriate file found in your own output directory.

Creating the Database

A database can be created using the usual PostgreSQL tools:

createdb -E unicode mitab_irefindex

This database is initialised as follows:

psql -f init_mitab.sql mitab_irefindex

Should the database tables need to be dropped (perhaps in case of problems with the import), the following command can be used:

psql -f drop_mitab.sql mitab_irefindex

Populating the Database

The database is populated as follows:

python database_action.py mitab_irefindex import_mitab.sql

As a result, a number of tables representing the structure of the data should be available in the database. For applications built to use this data, indexes may need creating in order to make querying more efficient.

Notes on the Populated Database

The schema used by the populated database attempts to model the data as effectively as possible using a number of tables:

Entity type Tables Table purpose Notable columns Source columns (if different or converted)
Interaction mitab_interactions Model each interaction referencing interactors rigid, edgetype, numParticipants, crigid
mitab_sources Represent sources for each interaction rigid, sourcedb, name sourcedb
mitab_interaction_type_names Represent interaction types for each interaction rigid, code, name interactionType
mitab_interaction_identifiers Represent interaction identifiers for each interaction rigid, dbname, uid interactionIdentifiers
mitab_confidence Represent confidence scores for each interaction rigid, type, confidence confidence
mitab_interaction_rigs Represent alternative integer identifiers for each interaction uid, rig rigid, irigid
Canonical interaction mitab_canonical_interaction_rigs Represent alternative integer identifiers for each canonical interaction uid, rig crigid, icrigid
Experiment mitab_method_names Represent detection methods for each interaction rigid, code, name method
mitab_authors Represent publication authors for each interaction rigid, author author
mitab_pubmed Represent publication identifiers for each interaction rigid, pmid pmids
Interactor mitab_interactions Model each interaction referencing interactors uidA, uidB, taxA, taxB, atype, btype, crogidA, crogidB
mitab_aliases Represent aliases for each interactor uid, dbname, alias uidA or uidB, aliasA or aliasB
mitab_alternatives Represent alternative identifiers for each interactor uid, dbname, alt uidA or uidB, altA or altB
mitab_interactor_rogs Represent alternative integer identifiers for each interactor uid, rog uidA or uidB, irogA or irogB
Canonical interactor mitab_canonical_interactor_rogs Represent alternative integer identifiers for each canonical interactor uid, rog crogidA or crogidB, icrogA or icrogB

Some changes in representation occur when creating the database:

  • Prefixed values are generally split to expose the prefix and identifier, name or value following it.
    • The various interaction and interactor prefixes (such as irefindex:, rigid:, rogid:, crigid: and crogid:) are omitted from interaction and interactor columns. Note that for non-iRefIndex data, any prefixes other than these will be retained, although this approach may be revised in future.
    • Source identifiers are split with the prefix (such as intact:) used to make a dbname column with the actual identifier stored in its own column (such as alias or alt).
  • The "empty value" (-) should never appear as an identifier, and where such a value is used in a list, that element should be excluded. This is pertinent in the case of vocabulary terms where MI:0000 might be used together with an empty list of identifiers or names as an "empty collection" indicator.
  • Duplicate values in lists are generally discarded.

Further work may include the introduction of a separate interactor table, collecting related information for each interactor. Support for interactor identifiers other than ROG identifiers may be improved, with a new column potentially being introduced to indicate the type of each identifier.

Canonical interactors and interactions

The mitab_interactions table incorporates the canonical interaction and interactors alongside the specific interaction and interactors. A separate mitab_canonical_interactor_rogs table is used to map canonical interactors to integer identifiers, just as mitab_interactor_rogs does so for specific interactors.

All iRefIndex Pages

Follow this link for a listing of all iRefIndex related pages (archived and current).