Difference between revisions of "iRefIndex Build Process"

From irefindex
(→‎HPRD: Added examples.)
m (→‎Clone seguid: Fixed label reference.)
 
(129 intermediate revisions by 3 users not shown)
Line 6: Line 6:
 
<pre>/home/irefindex/data</pre>
 
<pre>/home/irefindex/data</pre>
  
Download the files to create local copies. This is not possible for all the
+
Some data sources need special links to be obtained from their administrators via e-mail, and in general there is a distinction between free and proprietary data sources, described as follows:
data sources and some need special links to be obtained from the source
 
administrators via e-mail. The <tt>FTPtransfer</tt> program will download data from the
 
following sources:
 
  
* RefSeq
+
; Free
 +
: BIND, BioGRID, Gene2Refseq (NCBI), InnateDB, IntAct, MatrixDB, MINT, MMDB/PDB, MPIDB, MPPI, OPHID, RefSeq, UniProt
 +
; Proprietary
 +
: BIND Translation, CORUM, DIP, HPRD, MPact
 +
 
 +
''I2D, which was considered for iRefIndex 7.0, is currently under review for inclusion in future releases. The status of BIND Translation is currently under review for possible inclusion in the free dataset in future releases.''
 +
 
 +
The <tt>FTPtransfer</tt> program will download data from the following sources:
 +
 
 +
* Gene2Refseq
 +
* IntAct
 +
* MINT
 
* MMDB
 
* MMDB
 
* PDB
 
* PDB
* gene2refseq
+
* RefSeq
* IntAct
+
* UniProt
* MINT
 
  
 
== Manual Downloads ==
 
== Manual Downloads ==
  
More information can be found at the following location:
+
More information can be found at the following location: [[Sources_iRefIndex]]
 
 
ftp://ftp.no.embnet.org/irefindex/data/current/sources.htm
 
  
 
For each manual download, a subdirectory hierarchy must be created in the main
 
For each manual download, a subdirectory hierarchy must be created in the main
Line 35: Line 40:
 
For example, for BIND this directory might be created as follows:
 
For example, for BIND this directory might be created as follows:
  
<pre>mkdir -p /home/irefindex/data/BIND/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BIND/2010-02-08/</pre>
  
 
=== BIND ===
 
=== BIND ===
Line 50: Line 55:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/BIND/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BIND/2010-02-08/</pre>
  
 
Copy the following following files into the newly created data directory:
 
Copy the following following files into the newly created data directory:
Line 61: Line 66:
 
20060525.refs.txt</pre>
 
20060525.refs.txt</pre>
  
=== BioGrid ===
+
=== BIND Translation ===
  
The location of BioGrid downloads is as follows:
+
{{Note|
 +
This source should eventually be incorporated into the automated download functionality.
 +
}}
 +
 
 +
The location of BIND Translation downloads is as follows:
 +
 
 +
http://download.baderlab.org/BINDTranslation/
 +
 
 +
The location of the specific file to be downloaded is the following:
 +
 
 +
http://download.baderlab.org/BINDTranslation/release1_0/BINDTranslation_v1_xml_AllSpecies.tar.gz
 +
 
 +
(Note that the specific file varies from release to release - see the sources page for a particular release for more details.)
 +
 
 +
In the main downloaded data directory, create a subdirectory hierarchy as
 +
noted above. For example:
 +
 
 +
<pre>mkdir -p /home/irefindex/data/BIND_Translation/2010-02-08/</pre>
 +
 
 +
Download the file into the newly created data directory and unpack it as follows:
 +
 
 +
<pre>
 +
cd /home/irefindex/data/BIND_Translation/2010-02-08/
 +
tar zxf BINDTranslation_v1_xml_AllSpecies.tar.gz
 +
</pre>
 +
 
 +
=== BioGRID ===
 +
 
 +
The location of BioGRID downloads is as follows:
  
 
http://www.thebiogrid.org/downloads.php
 
http://www.thebiogrid.org/downloads.php
Line 69: Line 102:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/BioGrid/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BioGRID/2010-02-08/</pre>
  
Select the <tt>BIOGRID-ORGANISM-XXXXX.psi25.zip</tt> file and download/copy it to the newly
+
Select the <tt>BIOGRID-ALL-X.Y.Z.psi25.zip</tt> file (where <tt>X.Y.Z</tt> should be replaced by the actual release number) and download/copy it to the newly created data directory for BioGRID.
created data directory for BioGrid.
 
  
In the data directory for BioGrid, uncompress the downloaded file; for example:
+
In the data directory for BioGRID, uncompress the downloaded file. For example:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/BioGrid/2009-02-19/
+
cd /home/irefindex/data/BioGRID/2010-02-08/
unzip BIOGRID-ORGANISM-2.0.49.psi25.zip</pre>
+
unzip BIOGRID-ALL-2.0.62.psi25.zip</pre>
  
 
=== CORUM ===
 
=== CORUM ===
Line 92: Line 124:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/CORUM/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/CORUM/2010-02-08/</pre>
  
Copy/download the file referenced above and uncompress it in the data directory for CORUM; for example:
+
Copy/download the file referenced above and uncompress it in the data directory for CORUM. For example:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/CORUM/2009-02-19/
+
cd /home/irefindex/data/CORUM/2010-02-08/
 
unzip allComplexes.psimi.zip</pre>
 
unzip allComplexes.psimi.zip</pre>
 
==== Important Note ====
 
 
The CORUM data needs adjusting to work with the <tt>StaxPSIXML</tt> software. See the [[#Running StaxPSIXML]] section for details.
 
  
 
=== DIP ===
 
=== DIP ===
Line 117: Line 145:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/DIP/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/DIP/2010-02-08/</pre>
  
 
Select the <tt>FULL - complete DIP data set</tt> from the <tt>Files</tt> page:
 
Select the <tt>FULL - complete DIP data set</tt> from the <tt>Files</tt> page:
Line 128: Line 156:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/DIP/2009-02-19/
+
cd /home/irefindex/data/DIP/2010-02-08/
gunzip dip20080708.mif25</pre>
+
gunzip dip20080708.mif25.gz</pre>
  
 
=== HPRD ===
 
=== HPRD ===
Line 138: Line 166:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/HPRD/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/HPRD/2010-02-08/</pre>
  
 
Download the PSI-MI single file (<tt>HPRD_SINGLE_PSIMI_<date>.xml.tar.gz</tt>) to the
 
Download the PSI-MI single file (<tt>HPRD_SINGLE_PSIMI_<date>.xml.tar.gz</tt>) to the
Line 148: Line 176:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/HPRD/2009-02-19/
+
cd /home/irefindex/data/HPRD/2010-02-08/
 
tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz</pre>
 
tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz</pre>
  
=== OPHID ===
+
=== I2D ===
 +
 
 +
'''For iRefIndex 7.0, I2D was supposed to replace OPHID, but problems with the source files have excluded I2D from that release.'''
 +
 
 +
http://ophid.utoronto.ca/ophidv2.201/downloads.jsp
 +
 
 +
In the main downloaded data directory, create a subdirectory hierarchy as
 +
noted above. For example:
 +
 
 +
<pre>mkdir -p /home/irefindex/data/I2D/2010-02-08/</pre>
 +
 
 +
For the <tt>Download Format</tt> in the download request form, specify <tt>PSI-MI 2.5 XML</tt>. Unfortunately, each <tt>Target Organism</tt> must be specified in turn when submitting the form: there is no <tt>ALL</tt> option.
 +
 
 +
Uncompress each downloaded file. For example:
 +
 
 +
<pre>
 +
cd /home/irefindex/data/I2D/2010-02-08/
 +
unzip i2d.HUMAN.psi25.zip</pre>
 +
 
 +
=== InnateDB ===
 +
 
 +
{{Note|
 +
This source should eventually be incorporated into the automated download functionality.
 +
}}
 +
 
 +
Select the "Curated InnateDB Data" download from the InnateDB downloads page:
 +
 
 +
http://www.innatedb.com/download.jsp
 +
 
 +
In the main downloaded data directory, create a subdirectory hierarchy as
 +
noted above. For example:
 +
 
 +
<pre>mkdir -p /home/irefindex/data/InnateDB/2010-02-08/</pre>
 +
 
 +
Uncompress the downloaded file. For example:
 +
 
 +
<pre>
 +
cd /home/irefindex/data/InnateDB/2010-02-08/
 +
gunzip innatedb_20100716.xml.gz</pre>
  
OPHID is no longer available, so you have to use the local copy of the data:
+
=== MatrixDB ===
  
<pre>/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16</pre>
+
{{Note|
 +
This source should eventually be incorporated into the automated download functionality.
 +
}}
  
 
In the main downloaded data directory, create a subdirectory hierarchy as
 
In the main downloaded data directory, create a subdirectory hierarchy as
noted above.
+
noted above. For example:
  
Copy the file <tt>ophid1153236640123.xml</tt> to the newly created data directory.
+
<pre>mkdir -p /home/irefindex/data/MatrixDB/2011-06-11/</pre>
 +
 
 +
The data is available from the following site:
 +
 
 +
http://matrixdb.ibcp.fr/
 +
 
 +
Selecting the "Download MatrixDB data" leads to the following page:
 +
 
 +
http://matrixdb.ibcp.fr/cgi-bin/download
 +
 
 +
Here, selecting the "PSI-MI XML 2.5" download under "PSI-MI XML or TAB 2.5 MatrixDB literature curation interactions" will result in a file being downloaded, and this should be placed in the newly created directory.
 +
 
 +
Uncompress the data as follows:
 +
 
 +
<pre>
 +
cd /home/irefindex/data/MatrixDB/2011-06-11/
 +
unzip MatrixDB_20100826.xml.zip
 +
</pre>
  
 
=== MIPS ===
 
=== MIPS ===
 +
 +
{{Note|
 +
This source should eventually be incorporated into the automated download functionality.
 +
}}
  
 
In the main downloaded data directory, create a subdirectory hierarchy as
 
In the main downloaded data directory, create a subdirectory hierarchy as
noted above.
+
noted above for <tt>MIPS</tt> and <tt>MPACT</tt>. For example:
 +
 
 +
<pre>
 +
mkdir -p /home/irefindex/data/MIPS/2010-02-08/
 +
mkdir -p /home/irefindex/data/MPACT/2010-02-08/
 +
</pre>
  
 
For MPPI, download the following file:
 
For MPPI, download the following file:
Line 175: Line 269:
 
ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz
 
ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz
  
Uncompress the downloaded files:
+
Uncompress the downloaded files in their respective directories. For example:
  
 
<pre>
 
<pre>
 +
cd /home/irefindex/data/MPACT/2010-02-08/
 
gunzip mpact-complete.psi25.xml.gz
 
gunzip mpact-complete.psi25.xml.gz
 +
cd /home/irefindex/data/MIPS/2010-02-08/
 
gunzip mppi.gz</pre>
 
gunzip mppi.gz</pre>
  
=== UniProt ===
+
=== MPIDB ===
 +
 
 +
{{Note|
 +
This source should eventually be incorporated into the automated download functionality.
 +
}}
  
 
In the main downloaded data directory, create a subdirectory hierarchy as
 
In the main downloaded data directory, create a subdirectory hierarchy as
noted above.
+
noted above. For example:
  
Visit the following site:
+
<pre>mkdir -p /home/irefindex/data/MPIDB/2011-06-11/</pre>
  
http://www.uniprot.org/downloads
+
For MPI-LIT, download the following resource:
  
Download the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL files in text format:
+
http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-LIT
  
<ul>
+
For MPI-IMEX, download the following resource:
<li>ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz</li>
 
<li>ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz</li>
 
<li>ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz</li>
 
</ul>
 
  
Or from the EBI UK mirror:
+
http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-IMEX
  
<ul>
+
The downloads should be placed in the MPIDB data directory with the names <tt>MPI-LIT.txt</tt> and <tt>MPI-IMEX.txt</tt>, perhaps using the following example download commands:
<li>ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz</li>
 
<li>ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz</li>
 
<li>ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz</li>
 
</ul>
 
 
 
These files should be moved into the newly created data directory and uncompressed:
 
  
 
<pre>
 
<pre>
gunzip uniprot_sprot.dat.gz
+
cd /home/irefindex/data/MPIDB/2011-06-11/
gunzip uniprot_trembl.dat.gz
+
wget -O MPI-LIT.txt 'http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-LIT'
gunzip uniprot_sprot_varsplic.fasta.gz
+
wget -O MPI-IMEX.txt 'http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-IMEX'
 
</pre>
 
</pre>
 +
 +
=== OPHID ===
 +
 +
'''From iRefIndex 8.0, I2D replaces OPHID.'''
 +
 +
OPHID is no longer available, so you have to use the local copy of the data:
 +
 +
<pre>/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16</pre>
 +
 +
In the main downloaded data directory, create a subdirectory hierarchy as
 +
noted above. For example:
 +
 +
<pre>mkdir -p /home/irefindex/data/OPHID/2010-02-08/</pre>
 +
 +
Copy the file <tt>ophid1153236640123.xml</tt> to the newly created data directory.
 +
 +
=== SEGUID ===
 +
 +
Downloading of the SEGUID dataset is described [[#Manual_loading_of_data|below]].
 +
 +
== Build Dependencies ==
 +
 +
To build the software, Apache Ant needs to be available. This software could be retrieved from the Apache site...
 +
 +
http://ant.apache.org/bindownload.cgi
 +
 +
...or from a mirror such as one of the following:
 +
 +
http://mirrorservice.nomedia.no/apache.org//ant/binaries/apache-ant-1.8.2-bin.tar.gz
 +
 +
http://mirrors.powertech.no/www.apache.org/dist//ant/binaries/apache-ant-1.8.2-bin.tar.gz
 +
 +
This software can be extracted as follows:
 +
 +
tar zxf apache-ant-1.8.2-bin.tar.gz
 +
 +
This will produce a directory called <tt>apache-ant-1.8.2</tt> containing a directory called <tt>bin</tt>. The outer directory should be recorded in the <tt>ANT_HOME</tt> environment variable, whereas the <tt>bin</tt> directory should be incorporated into the <tt>PATH</tt> environment variable on your system. For example, for <tt>bash</tt>:
 +
 +
export ANT_HOME=/home/irefindex/apps/apache-ant-1.8.2
 +
export PATH=${PATH}:${ANT_HOME}/bin
 +
 +
It should now be possible to run the <tt>ant</tt> program.
  
 
== Building FTPtransfer ==
 
== Building FTPtransfer ==
Line 221: Line 353:
 
<li>Get the program's source code from this location:
 
<li>Get the program's source code from this location:
  
   <ul>
+
   <p>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/FTPtransfer/</p>
  <li>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/FTPtransfer/</li>
 
  </ul>
 
  
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
Line 237: Line 367:
 
<li>Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...
 
<li>Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...
  
   <ul>
+
   <p>http://commons.apache.org/downloads/download_net.cgi</p>
  <li>http://commons.apache.org/downloads/download_net.cgi</li>
 
  </ul>
 
  
...or from a mirror such as the following:
+
...or from a mirror such as one of the following:
  
   <ul>
+
   <p>http://mirrorservice.nomedia.no/apache.org/commons/net/binaries/commons-net-1.4.1.tar.gz</p>
  <li>http://mirrorservice.nomedia.no/apache.org/commons/net/binaries/commons-net-1.4.1.tar.gz</li>
+
   <p>http://www.powertech.no/apache/dist/commons/net/binaries/commons-net-1.4.1.tar.gz</p>
   <li>http://www.powertech.no/apache/dist/commons/net/binaries/commons-net-1.4.1.tar.gz</li>
+
   </li>
   </ul></li>
 
  
 
<li>Extract the dependencies:
 
<li>Extract the dependencies:
Line 255: Line 382:
  
 
   <pre>
 
   <pre>
  mkdir lib
+
mkdir lib
  cp commons-net-1.4.1/commons-net-1.4.1.jar lib/</pre>
+
cp commons-net-1.4.1/commons-net-1.4.1.jar lib/</pre>
  
 
Alternatively, the external libraries can also be found in the following location:
 
Alternatively, the external libraries can also be found in the following location:
  
 
   <pre>/biotek/dias/donaldson3/iRefIndex/External_libraries</pre></li>
 
   <pre>/biotek/dias/donaldson3/iRefIndex/External_libraries</pre></li>
 
<li>Customise the output locations. Currently, the output locations are hard-coded, and changing them would involve searching for the following...
 
 
  <pre>/biotek/prometheus/storage/Sabry/data</pre>
 
 
...and replacing it with the path to the preferred output directory. The source code is found in the following directory within the <tt>FTPtransfer</tt> directory:
 
 
  <pre>src/ftptransfer</pre></li>
 
  
 
<li>Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the <tt>FTPtransfer</tt> directory:
 
<li>Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the <tt>FTPtransfer</tt> directory:
Line 283: Line 402:
 
To run the program, invoke the <tt>.jar</tt> file as follows:
 
To run the program, invoke the <tt>.jar</tt> file as follows:
  
<pre>java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log</pre>
+
<pre>java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log datadir</pre>
  
The specified <tt>log</tt> argument can be replaced with a suitable location for the program's execution log.
+
The specified <tt>log</tt> argument can be replaced with a suitable location for the program's execution log, whereas the <tt>datadir</tt> argument should be replaced with a suitable location for downloaded data (such as <tt>/home/irefindex/data</tt>).
  
 
== Building SHA ==
 
== Building SHA ==
Line 294: Line 413:
 
<li>Get the program's source code from this location:
 
<li>Get the program's source code from this location:
  
   <ul>
+
   <p>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/SHA/</p>
  <li>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/SHA/</li>
 
  </ul>
 
  
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
Line 314: Line 431:
 
The <tt>SHA.jar</tt> file will be created in the <tt>dist</tt> directory.
 
The <tt>SHA.jar</tt> file will be created in the <tt>dist</tt> directory.
 
</ol>
 
</ol>
 +
 +
== Building SaxValidator ==
 +
 +
The <tt>SaxValidator.jar</tt> file needs to be obtained or built.
 +
 +
<ol>
 +
<li>Get the program's source code from this location:
 +
 +
  <p>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/Parser/SaxValidator/</p>
 +
 +
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
 +
 +
  <pre>cvs co Parser/SaxValidator</pre>
 +
 +
The <tt>CVSROOT</tt> environment variable should be set to the following for this to work:
 +
 +
  <pre>export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot</pre>
 +
 +
(The <tt><username></tt> should be replaced with your actual username.)</li>
 +
 +
<li>Compile and create the <tt>.jar</tt> file as follows:
 +
 +
  <pre>ant jar</pre></li>
 +
</ol>
 +
 +
== Running SaxValidator ==
 +
 +
The program used for validation and integrity checks is called <tt>SaxValidator</tt> and when the name was chosen it was merely a SAX-based validator. However, more functionality has since been included:
 +
 +
# Validate XML files against a schema.
 +
#* http://psidev.sourceforge.net/mi/rel25/src/MIF253.xsd
 +
#* http://psidev.sourceforge.net/mi/rel25/src/MIF254.xsd
 +
#* http://psidev.sourceforge.net/mi/xml/src/MIF.xsd
 +
#* http://dip.doe-mbi.ucla.edu/psimi/MIF254.xsd
 +
# XML parser-independent counting of elements (count number of </interaction> and </interactor> tags in each file). This gives an indication on what to expect at the end of the parsing.
 +
# Count number of lines in BIND text.
 +
# Remove files containing negative interactions.
 +
 +
Run the program as follows:
 +
 +
java -jar -Xms256m -Xmx256m dist/SaxValidator.jar <data root directory> <date extension> <validate true/false> <count elements true/false>
 +
 +
For example:
 +
 +
java -jar -Xms256m -Xmx256m dist/SaxValidator.jar /home/irefindex/data /2010-02-08/ true true
 +
 +
Be sure to include the leading and trailing <tt>/</tt> characters around the date information.
 +
 +
=== Handling Invalid Files ===
 +
 +
For each data source, invalid files will be moved to a subdirectory of that source's data directory. These subdirectories can be found by using the following Unix command:
 +
 +
find /home/irefindex/data -name inValid
 +
 +
=== Known Issues ===
 +
 +
* Some BIND Translation files do not have an appropriate encoding declaration
 +
* The BioGRID file may generate entity-related errors
 +
* The DIP file omits <tt>id</tt> attributes on <tt>experimentDescription</tt> elements
 +
* The HPRD file omits required elements from <tt>experimentDescription</tt> elements
 +
* MIPS MPACT/MPPI files may be flagged as invalid, but can still be parsed using workarounds in the parsing process
 +
 +
To fix the BIND Translation errors, concatenate each incorrect file to the following declaration:
 +
 +
<?xml version="1.0" encoding="iso-8859-1"?>
 +
 +
For example, after saving the above in <tt>declaration.txt</tt>:
 +
 +
cat declaration /home/irefindex/data/BIND_Translation/2010-02-08/inValid/taxid9606_PSIMI25.xml > /home/irefindex/data/BIND_Translation/2010-02-08/taxid9606_PSIMI25.xml
 +
 +
To fix the BioGRID entity errors, run the following script from the <tt>iRef_PSI_XML2RDBMS</tt> directory:
 +
 +
python tools/fix_biogrid.py <BioGRID data file> <new BioGRID data file>
 +
 +
For example:
 +
 +
python tools/fix_biogrid.py /home/irefindex/data/BioGRID/2010-02-08/inValid/BIOGRID-ALL-3.1.69.psi25.xml /home/irefindex/data/BioGRID/2010-02-08/BIOGRID-ALL-3.1.69.psi25.xml
 +
 +
Make sure that only one XML file resides in the date-specific BioGRID data directory. Here, it is assumed that the data file was moved into the <tt>inValid</tt> subdirectory by the validator.
 +
 +
=== Alternatives and Utilities ===
 +
 +
The <tt>xmllint</tt> program provided in the [http://xmlsoft.org/ libxml2] distribution, typically available as standard on GNU/Linux distributions, can be used to check and correct XML files. For example:
 +
 +
xmllint HPRD/2010-09-14/inValid/HPRD_SINGLE_PSIMI_041210.xml > HPRD_SINGLE_PSIMI_041210-corrected.xml
 +
 +
This corrects well-formedness issues with the source file in the output file.
  
 
== Building BioPSI_Suplimenter ==
 
== Building BioPSI_Suplimenter ==
Line 322: Line 526:
 
<li>Get the program's source code from this location:
 
<li>Get the program's source code from this location:
  
   <ul>
+
   <p>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/</p>
  <li>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/</li>
 
  </ul>
 
  
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
Line 338: Line 540:
 
<li>Obtain the program's dependencies. This program uses the <tt>SHA.jar</tt> file created above as well as the MySQL Connector/J library which can be found at the following location:
 
<li>Obtain the program's dependencies. This program uses the <tt>SHA.jar</tt> file created above as well as the MySQL Connector/J library which can be found at the following location:
  
   <ul>
+
   <p>http://www.mysql.com/products/connector/j/</p>
  <li>http://www.mysql.com/products/connector/j/</li>
+
   </li>
   </ul></li>
 
  
<li>Extract the dependencies:
+
<li>Extract the dependencies. For example:
  
 
   <pre>tar zxf mysql-connector-java-5.1.6.tar.gz</pre>
 
   <pre>tar zxf mysql-connector-java-5.1.6.tar.gz</pre>
Line 349: Line 550:
  
 
   <pre>
 
   <pre>
  mkdir lib
+
mkdir lib
  cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/</pre>
+
cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/</pre>
 +
 
 +
The filenames in the above example will need adjusting, depending on the exact version of the library downloaded.
  
 
The <tt>SHA.jar</tt> file needs copying from its build location:
 
The <tt>SHA.jar</tt> file needs copying from its build location:
Line 363: Line 566:
  
 
   <pre>cp Build_files/build.xml .</pre>
 
   <pre>cp Build_files/build.xml .</pre>
 +
 +
It might be necessary to edit the <tt>build.xml</tt> file, changing the particular filename for the <tt>.jar</tt> file whose name begins with <tt>mysql-connector-java</tt>, since this name will change between versions of that library.
  
 
Compile and create the <tt>.jar</tt> file as follows:
 
Compile and create the <tt>.jar</tt> file as follows:
Line 394: Line 599:
 
create user 'irefindex'@'%' identified by 'mysecretpassword';
 
create user 'irefindex'@'%' identified by 'mysecretpassword';
 
grant all privileges on irefindex.* to 'irefindex'@'%';
 
grant all privileges on irefindex.* to 'irefindex'@'%';
 +
</pre>
 +
 +
If difficulties occur granting privileges in this way, try the following statements:
 +
 +
<pre>
 +
grant select, insert, update, delete, create, drop, references, index, alter, create temporary tables, lock tables, execute, create view, show view, create routine, alter routine on <database>.* to '<username>'@'%';
 +
grant process, file on *.* to '<username>'@'%';
 +
</pre>
 +
 +
=== Manual loading of data ===
 +
 +
In order to get the sequence of SEGUIDs not retrieved in later stages the table "seguid2sequence" has to be made as follows.
 +
 +
Where SEGUID identifier consistency is required with a previous database, copy the table from the previous release:
 +
 +
<pre>
 +
create table seguid2sequence as
 +
  select * from olddb.seguid2sequence;
 +
</pre>
 +
 +
Otherwise, perform the following steps:
 +
 +
# Obtain the file "seguidflat" from ftp://bioinformatics.anl.gov/seguid/ or (locally) <tt>/biotek/dias/donaldson3/DATA/SEGUID</tt>
 +
# Use the following SQL commands to load this into a table:
 +
 +
<pre>
 +
create table seguid2sequence (
 +
  seguid char(27) default '0',
 +
  sequence varchar(49152) default 'NA',
 +
  noaa int(11) default -1
 +
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
 +
 +
load data infile '......../seguidflat' into table seguid2sequence FIELDS TERMINATED BY '\t';
 +
 +
update seguid2sequence set noaa=length(replace(sequence,' ',''));
 +
</pre>
 +
 +
After populating the table in either situation, add an index as follows:
 +
 +
<pre>
 +
alter table seguid2sequence add index seguid(seguid);
 
</pre>
 
</pre>
  
 
== Running BioPSI_Suplimenter ==
 
== Running BioPSI_Suplimenter ==
 +
 +
Please make sure that the [[#Manual_loading_of_data|manual loading of data]] was completed before this step, if appropriate.
  
 
Run the program as follows:
 
Run the program as follows:
  
<pre>java -jar -Xms256m -Xmx256m build/jar/BioPSI_Suplimenter.jar &</pre>
+
<pre>java -jar -Xms256m -Xmx768m build/jar/BioPSI_Suplimenter.jar &</pre>
  
 
In the dialogue box that appears, the following details must always be filled out:
 
In the dialogue box that appears, the following details must always be filled out:
  
 
<dl>
 
<dl>
 +
<dt>Server
 +
<dd>the <tt><host></tt> value specified when creating the database
 
<dt>Database
 
<dt>Database
 
<dd>the <tt><database></tt> value specified when creating the database
 
<dd>the <tt><database></tt> value specified when creating the database
Line 415: Line 665:
 
</dl>
 
</dl>
  
Make sure that the log file will be written to a directory which already exists.
+
Make sure that the log file will be written to a directory which already exists. For example:
  
A number of steps or programs will need to be executed, and these are described in separate sections below. For each one, select the corresponding
+
<pre>mkdir /home/irefindex/logs/</pre>
menu item in the "Program" field shown in the dialogue.
+
 
 +
The program will need to be executed a number of times for different activities, and these are described in separate sections below. For each one, select the corresponding menu item in the <tt>Program</tt> field shown in the dialogue.
  
 
=== Create tables ===
 
=== Create tables ===
 +
 +
{{Note|
 +
Before beginning, test the following:
 +
 +
* Verify the SQL file is up-to-date (especially if new database names were added during the previous build).
 +
* Make sure the permissions are appropriate for the user running the program.
 +
}}
  
 
The <tt>SQL file</tt> field should refer to the <tt>Create_iRefIndex.sql</tt> file in the SQL
 
The <tt>SQL file</tt> field should refer to the <tt>Create_iRefIndex.sql</tt> file in the SQL
Line 428: Line 686:
  
 
Click the <tt>OK</tt> button to create the tables.
 
Click the <tt>OK</tt> button to create the tables.
 +
 +
A brief review of the database is recommended to check whether all the tables in the SQL file were successfully created.
 +
 +
=== Clone seguid ===
 +
 +
{{Note|
 +
Before beginning, make sure the database user is privileged to read from the source SEGUID table. The following command can be used to grant privileges to access an earlier database:
 +
 +
grant select on <database>.* to '<username>'@'%';
 +
 +
For example:
 +
 +
grant select on olddb.* to 'irefindex'@'%';
 +
}}
 +
 +
 +
From iRefIndex 6.0, each ROG is consistent with the previous release. Therefore, the first operation when creating the new SEGUID table is to copy the SEGUID table from the previous release.
 +
This is included as an option in BioPSI_Suplimenter. The <tt>seguidannotation</tt> file is no longer parsed and if there is an updated version of this file from the SEGUID project it has to be used as an updating step.
 +
 +
{{Note|
 +
A process for updating from a newer version of the <tt>seguidannotation</tt> file is not currently defined in BioPSI_Suplimenter.
 +
}}
 +
 +
When the "Clone SEGUID" option is selected from the BioPSI_Suplimenter GUI, the <tt>SEGUID file</tt> is the source <tt>seguid</tt> table (so for the database <tt>beta7</tt>, the <tt>seguid</tt> table used would be <tt>beta6.seguid</tt>). The target database selected should not have a <tt>seguid</tt> table and if it has, this will throw an error.
 +
 +
====Free and proprietary releases & clone seguid ====
 +
 +
{{Note|
 +
This section has been retained for historical purposes.
 +
}}
 +
 +
iRefIndex previously supplied two subversions or distributions for every release - free and proprietary - requiring that each ROG not merely be consistent with the previous release, but also be consistent between the free and proprietary versions of the release being made. Thus, the cloning always had to be done using the earlier full/proprietary version as the source and the current full/proprietary as target. In other words, the proprietary version was made first and then the free version.
 +
Once the proprietary version had been made, the SEGUID table of the free version was made by cloning the current proprietary version's SEGUID (not the previous version's).
 +
 +
It is recommended to check that the record counts match between source and target and that all the indices are properly made.
  
 
=== Recreate SEGUID ===
 
=== Recreate SEGUID ===
  
The <tt>File</tt> field should refer to the <tt>seguidannotation</tt> file in the SEGUID
+
{{Note|
subdirectory hierarchy. For example, given the following data directory...
+
This activity needs to be performed as an update step after the Clone SEGUID process has been performed.
 +
}}
 +
 
 +
The <tt>SEGUID file</tt> field should refer to the <tt>seguidannotation</tt> file in the SEGUID
 +
subdirectory hierarchy if a new release has been made available. For example, given the following data directory...
  
 
<pre>/home/irefindex/data</pre>
 
<pre>/home/irefindex/data</pre>
  
...an appropriate value for the <tt>File</tt> field might be this:
+
...an appropriate value for the <tt>SEGUID file</tt> field might be this:
  
<pre>/home/irefindex/data/SEGUID/09_22_2008/seguidannotation</pre>
+
<pre>/home/irefindex/data/SEGUID/2010-02-08/seguidannotation</pre>
  
For the following fields, indicate the locations of the corresponding files similarly:
+
Where the Clone SEGUID process has already populated the database with SEGUID information and no new file has been made available, the <tt>SEGUID file</tt> field can be left blank.
 +
 
 +
For the following fields, indicate the locations of the corresponding files and directories similarly:
  
 
<dl>
 
<dl>
 
<dt>Unip_SP_file
 
<dt>Unip_SP_file
<dd>uniprot_sprot.dat (from UniProt)
+
<dd><tt>uniprot_sprot.dat</tt> (from UniProt); for example:
 +
  <pre>/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot.dat</pre>
 
<dt>Unip_Trm_file
 
<dt>Unip_Trm_file
<dd>uniprot_trembl.dat (from UniProt)
+
<dd><tt>uniprot_trembl.dat</tt> (from UniProt); for example:
 +
  <pre>/home/irefindex/data/UniProt/2010-02-08/uniprot_trembl.dat</pre>
 
<dt>unip_Isoform_file
 
<dt>unip_Isoform_file
<dd>uniprot_sprot_varsplic.fasta (from UniProt)
+
<dd><tt>uniprot_sprot_varsplic.fasta</tt> (from UniProt); for example:
 +
  <pre>/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot_varsplic.fasta</pre>
 
<dt>Unip_Yeast_file
 
<dt>Unip_Yeast_file
<dd>yeast.txt (in the yeast directory)
+
<dd><tt>yeast.txt</tt> (in the <tt>yeast</tt> directory); for example:
 +
  <pre>/home/irefindex/data/yeast/2010-02-08/yeast.txt</pre>
 
<dt>Unip_Fly_file
 
<dt>Unip_Fly_file
<dd>fly.txt (in the fly directory)
+
<dd><tt>fly.txt</tt> (in the <tt>fly</tt> directory); for example:
 +
  <pre>/home/irefindex/data/fly/2010-02-08/fly.txt</pre>
 +
<dt>RefSeq DIR
 +
<dd>The specific download directory for RefSeq; for example:
 +
  <pre>/home/irefindex/data/RefSeq/2010-02-08/</pre>
 
<dt>Fasta 4 PDB
 
<dt>Fasta 4 PDB
<dd>pdbaa.fasta (from PDB)
+
<dd><tt>pdbaa.fasta</tt> (from PDB); for example:
 +
  <pre>/home/irefindex/data/PDB/2010-02-08/pdbaa.fasta</pre>
 
<dt>Tax Table 4 PDB
 
<dt>Tax Table 4 PDB
<dd>tax.table (from PDB)
+
<dd><tt>tax.table</tt> (from PDB); for example:
 +
  <pre>/home/irefindex/data/PDB/2010-02-08/tax.table</pre>
 
<dt>Gene info file
 
<dt>Gene info file
<dd>gene_info.txt (in the geneinfo directory)
+
<dd><tt>gene_info.txt</tt> (in the <tt>geneinfo</tt> directory); for example:
 +
  <pre>/home/irefindex/data/geneinfo/2010-02-08/gene_info.txt</pre>
 
<dt>gene2Refseq
 
<dt>gene2Refseq
<dd>gene2refseq.txt (in the NCBI_Mappings directory)
+
<dd><tt>gene2refseq.txt</tt> (in the <tt>NCBI_Mappings</tt> directory); for example:
 +
  <pre>/home/irefindex/data/NCBI_Mappings/2010-02-08/gene2refseq.txt</pre>
 
</dl>
 
</dl>
  
For <tt>RefSeq DIR</tt>, a directory needs to be given instead of individual filenames.
+
The <tt>SEGUID table</tt> field should specify the current database. For example:
 +
 
 +
beta9.seguid
  
 
=== Fill Bind info ===
 
=== Fill Bind info ===
Line 470: Line 783:
 
directory hierarchy. For example, for <tt>Bind Ints file</tt>:
 
directory hierarchy. For example, for <tt>Bind Ints file</tt>:
  
<pre>/home/irefindex/data/BIND/09_22_2008/20060525.ints.txt</pre>
+
<pre>/home/irefindex/data/BIND/2010-02-08/20060525.ints.txt</pre>
 +
 
 +
To conveniently edit all file fields, you can edit the <tt>Base loc</tt> field, inserting the top-level data directory. For example:
 +
 
 +
<pre>/home/irefindex/data</pre>
 +
 
 +
In addition, the <tt>Date info</tt> can also be changed to indicate the common date directory name used by the data sources. For example:
 +
 
 +
<pre>2010-02-08</pre>
 +
 
 +
Be sure to check the final values of the file fields themselves before activating the operation.
 +
 
 +
== Importing MPIDB Data ==
 +
 
 +
A distribution called mpidb2mitab has been created for the purpose of parsing and correcting the MPIDB data files, preparing the files for import into iRefIndex. An overview of the complete process is given as follows:
 +
 
 +
<ol>
 +
<li>Create a database for processing purposes. This currently uses PostgreSQL but could be changed to run within the database being built for iRefIndex.</li>
 +
<li>Parse the MPIDB data files.</li>
 +
<li>Initialise the processing database for MITAB-related data.</li>
 +
<li>Import the MITAB-related data.</li>
 +
<li>Convert the data to iRefIndex-compatible data.</li>
 +
<li>Export the data for presentation to iRefIndex.</li>
 +
<li>Inspect the iRefIndex database to obtain a starting unique identifier (uid) for the import.</li>
 +
<li>Import the iRefIndex-compatible data into iRefIndex, specifying the uid.</li>
 +
</ol>
 +
 
 +
== Creating taxid2name table ==
  
== Building StaxPSIXML ==
+
This table is created by manually loading data.
  
The <tt>StaxPSIXML.jar</tt> file needs to be obtained or built.
+
<ol>
 +
<li>Download the SQL dump from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz :
 +
<pre>wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz</pre></li>
 +
<li>Create a database table:
 +
<pre>create table taxid2name(
 +
  taxid int default -1,
 +
  name varchar(256) default 'NA',
 +
  unq_name varchar(256) default 'NA',
 +
  cla_name varchar(256) default 'NA'
 +
) ENGINE=InnoDB DEFAULT CHARSET=latin1;</pre></li>
 +
<li>Unpack the data into a directory:
 +
<pre>mkdir taxdump
 +
mv taxdump.tar.gz taxdump
 +
tar zxf taxdump.tar.gz -C taxdump</pre>
 +
<li>Import the name data:
 +
<pre>load data infile 'taxdump/names.dmp' into table taxid2name FIELDS TERMINATED BY '\|';</pre>
 +
(Note that if <tt>--local-infile</tt> is specified when logging into MySQL, client-side and relative paths can be used with the <tt>load data local infile</tt> command.)</li>
 +
<li>Post-processing:
 +
<pre>update taxid2name set name=(replace(name,'\t',''));
 +
update taxid2name set unq_name=(replace(unq_name,'\t',''));
 +
update taxid2name set cla_name=(replace(cla_name,'\t',''));
 +
alter table taxid2name add index taxid(taxid);
 +
alter table taxid2name add index name(name);
 +
alter table taxid2name add index unq_name(unq_name);</pre></li>
 +
</ol>
 +
 
 +
== Building iRef_PSI_XML2RDBMS ==
 +
 
 +
{{Note|
 +
iRef_PSI_XML2RDBMS replaces StaxPSIXML as the PSI-MI XML parsing component in iRefIndex from release 8.0.
 +
}}
 +
 
 +
*Before beginning test: Make sure the validator was run on all the source files. Check the log of validator to locate any anomalies. Check whether the files are in the place where the config file will search.  Please note that if a file was found to be invalid then this will be moved a sub-folder called "invalid".
 +
 
 +
The <tt>iRef_PSI_XML2RDBMS.jar</tt> file needs to be obtained or built.
  
 
<ol>
 
<ol>
 
<li>Get the program's source code from this location:
 
<li>Get the program's source code from this location:
  
   <ul>
+
   <p>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/iRef_PSI_XML2RDBMS/</p>
  <li>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML/</li>
 
  </ul>
 
  
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
 
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
  
   <pre>cvs co bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML</pre>
+
   <pre>cvs co bioscape/bioscape/modules/interaction/Sabry/iRef_PSI_XML2RDBMS</pre>
  
 
The <tt>CVSROOT</tt> environment variable should be set to the following for this to work:
 
The <tt>CVSROOT</tt> environment variable should be set to the following for this to work:
Line 495: Line 867:
 
<li>Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location:
 
<li>Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location:
  
   <ul>
+
   <p>http://www.mysql.com/products/connector/j/</p>
  <li>http://www.mysql.com/products/connector/j/</li>
 
  </ul>
 
  
 
You may choose to refer to the download from the <tt>BioPSI_Suplimenter</tt> build process.</li>
 
You may choose to refer to the download from the <tt>BioPSI_Suplimenter</tt> build process.</li>
Line 505: Line 875:
 
   <pre>tar zxf mysql-connector-java-5.1.6.tar.gz</pre>
 
   <pre>tar zxf mysql-connector-java-5.1.6.tar.gz</pre>
  
This will produce a directory called <tt>mysql-connector-java-5.1.6</tt> containing a file called <tt>mysql-connector-java-5.1.6-bin.jar</tt> which should be placed in the <tt>lib</tt> directory in the <tt>StaxPSIXML</tt> directory...
+
This will produce a directory called <tt>mysql-connector-java-5.1.6</tt> containing a file called <tt>mysql-connector-java-5.1.6-bin.jar</tt> which should be placed in the <tt>lib</tt> directory in the <tt>iRef_PSI_XML2RDBMS</tt> directory...
  
 
   <pre>
 
   <pre>
  mkdir lib
+
mkdir lib
  cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/</pre>
+
cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/</pre>
  
 
You may instead choose to copy the library from the <tt>BioPSI_Suplimenter/lib</tt> directory:
 
You may instead choose to copy the library from the <tt>BioPSI_Suplimenter/lib</tt> directory:
  
 
   <pre>
 
   <pre>
  mkdir lib
+
mkdir lib
  cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/</pre></li>
+
cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/</pre>
  
<li>Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the <tt>StaxPSIXML</tt> directory:
+
The filenames in the above examples will need adjusting, depending on the exact version of the library downloaded.</li>
  
  <pre>cp Build_files/build.xml .</pre>
+
<li>Compile the source code. It might be necessary to edit the <tt>build.xml</tt> file, changing the particular filename for the <tt>.jar</tt> file whose name begins with <tt>mysql-connector-java</tt>, since this name will change between versions of that library.
  
 
Compile and create the <tt>.jar</tt> file as follows:
 
Compile and create the <tt>.jar</tt> file as follows:
Line 526: Line 896:
 
</ol>
 
</ol>
  
== Running StaxPSIXML ==
+
== Running iRef_PSI_XML2RDBMS ==
  
 
The software must first be configured using files provided in the <tt>config</tt>
 
The software must first be configured using files provided in the <tt>config</tt>
directory:
+
directory. This can be done using the <tt>make_config.py</tt> script provided:
 +
 
 +
python make_config.py <data_directory> <date_prefix> <log_directory>
 +
 
 +
For example:
 +
 
 +
python make_config.py /home/irefindex/data 2010-02-08 /home/irefindex/logs
 +
 
 +
This will produce a new version of the <tt>configFileList.txt</tt> file which should be appropriately configured.
 +
 
 +
=== Manual Configuration ===
 +
 
 +
In <tt>configFileList.txt</tt>, the <tt>CONFIG</tt> entries must be edited in order to refer to the locations of each of the individual configuration files. To remove a data source, add a leading <tt>#</tt> character to the appropriate line.
 +
 
 +
Each supplied configuration file has a name of the form <tt>config_X_SOURCE.xml</tt>, where <tt>X</tt> is an arbitrary identifier which helps to distinguish between different configuration versions, and where <tt>SOURCE</tt> is a specific data source name such as one of the following:
 +
 
 +
<ul>
 +
<li>BIOGRID</li>
 +
<li>DIP</li>
 +
<li>HPRD</li>
 +
<li>I2D</li>
 +
<li>InnateDB</li>
 +
<li>IntAct</li>
 +
<li>MatrixDB</li>
 +
<li>MINT</li>
 +
<li>MIPS</li>
 +
<li>MIPS_MPACT</li>
 +
<li>OPHID</li>
 +
</ul>
 +
 
 +
In each file (specified in <tt>configFileList.txt</tt>), a number of elements need to be defined within the <tt>locations</tt> element:
  
 
<dl>
 
<dl>
<dt>configFileList.txt
 
<dd>Edit the <tt>#CONFIG</tt> entries in order to refer to the locations of each of the individual configuration files. To remove a data source, remove the leading # character from the appropriate line.
 
<dt>config_X_SOURCE.xml
 
<dd>Each supplied configuration file has this form, where <tt>X</tt> is an arbitrary identifier which helps to distinguish between different configuration versions, and where <tt>SOURCE</tt> is a specific data source name such as one of the following:
 
  <ul>
 
  <li>BIOGRID</li>
 
  <li>DIP</li>
 
  <li>HPRD</li>
 
  <li>IntAct</li>
 
  <li>MINT</li>
 
  <li>MIPS</li>
 
  <li>MIPS_MPACT</li>
 
  <li>OPHID</li>
 
  </ul>
 
In each file (specified in <tt>configFileList.txt</tt>), a number of elements need to be defined within the <tt>locations</tt> element:
 
 
<dt>logger
 
<dt>logger
 
<dd>The location of the log file to be created. If a log file already exists, the new information will be appended. In the event that the program throws more than 50000 exceptions, the errors will be continued in new files ordered by a numeric identifier specified at the end of each filename.
 
<dd>The location of the log file to be created. If a log file already exists, the new information will be appended. In the event that the program throws more than 50000 exceptions, the errors will be continued in new files ordered by a numeric identifier specified at the end of each filename.
Line 552: Line 937:
 
<dd>This the location of the PSI-XML files to be parsed. For example:
 
<dd>This the location of the PSI-XML files to be parsed. For example:
  
   <pre>/home/irefindex/data/BioGrid/09_22_2008/textfiles/</pre>
+
   <pre>/home/irefindex/data/BioGRID/2010-02-08/textfiles/</pre>
  
 
This will directly also store the <tt>lastUpdate.obj</tt> file:
 
This will directly also store the <tt>lastUpdate.obj</tt> file:
Line 569: Line 954:
 
<dd>This is the most important part of the parsing. This file defines where to obtain the data from the XML file and which field of the table in the database the data is destined for. The location of this file is typically within the source code distribution, and an absolute path should be specified. For example:
 
<dd>This is the most important part of the parsing. This file defines where to obtain the data from the XML file and which field of the table in the database the data is destined for. The location of this file is typically within the source code distribution, and an absolute path should be specified. For example:
  
   <pre>/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML/mapper/Map25_INTACT_MINT_BIOGRID.xml</pre>
+
   <pre>/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/iRef_PSI_XML2RDBMS/mapper/Map25_INTACT_MINT_BIOGRID.xml</pre>
  
 
More information about the mapper is available in <tt>Readme_mapper.txt</tt> within the <tt>StaxPSIXML</tt> directory.
 
More information about the mapper is available in <tt>Readme_mapper.txt</tt> within the <tt>StaxPSIXML</tt> directory.
 +
 +
See also the documentation on the topic of [[iRefIndex_Development#Adding_Sources_to_iRefIndex|adding sources to iRefIndex]] for details of writing new mapper configuration files.
 
</dl>
 
</dl>
  
Line 587: Line 974:
 
</dl>
 
</dl>
  
----
+
=== Running the Program ===
  
==== Important Note ====
+
{{Note|
 +
If running the program again using an existing set of data files, be sure to remove all <tt>lastUpdate.obj</tt> files residing within the various source data directories, or the program will happily ignore the data.
 +
}}
  
For [[#CORUM]], the downloaded data file must be modified before running the <tt>StaxPSIXML</tt> software.
+
When all configuration files and mapper files are ready. Run the program:
  
Using a suitable XSLT tool such as <tt>xsltproc</tt>, transform the uncompressed downloaded file as follows (substituting the appropriate data directory details for your own environment):
+
ant run
  
<pre>
+
This will display a graphical interface requesting information about the location of the configuration file <tt>configFileList.txt</tt> and a suitable log directory, as well as database credentials.
mv /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi.orig
 
xsltproc XSLT/fix_corum.xsl /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi.orig > /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi</pre>
 
  
The <tt>fix_corum.xsl</tt> file can be found in the <tt>XSLT</tt> directory within <tt>StaxPSIXML</tt>.
+
== Validating the Results from iRef_PSI_XML2RDBMS ==
  
----
+
It is possible to validate the results of the parsing process by issuing the following queries against the database being prepared:
  
When all configuration files and mapper files are ready. Run the program:
+
<pre>
 +
select name, count(uid) from int_source inner join int_db on int_source.source = int_db.id group by name;
 +
select name, count(uid) from int_object inner join int_db on int_object.source = int_db.id group by name;
 +
select name, count(distinct sourceid) from int_source2object inner join int_db on int_source2object.source = int_db.id group by name;
 +
select name, count(distinct objectid) from int_source2object inner join int_db on int_source2object.source = int_db.id group by name;
 +
</pre>
  
<pre>java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar -f <config_file_list_file></pre>
+
This will tabulate the interactions and interactors respectively for each data source. The values can then be compared to the totals written to the files produced by the SaxValidator program; these files can be found in the data directory hierarchy in a location resembling the following:
  
GUI version:
+
/home/irefindex/data/2010-02-08
  
<pre>java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar</pre>
+
Each validated data source should have a pair of files as illustrated by the following directory listing extract:
  
== Running BioPSI_Suplimenter (continued) ==
+
<pre>
 +
corum_interactions.txt
 +
corum_interactors.txt
 +
dip_interactions.txt
 +
dip_interactors.txt
 +
grid_interactions.txt
 +
grid_interactors.txt
 +
</pre>
  
=== ROG_ass + RIG_fill+ make Cy ===
+
A convenient way of getting similar tabular summaries as those returned by the above queries is to run the following commands:
  
A table prepared from Web service data needs to be given for the <tt>Pre_build Eutils</tt> field.
+
<pre>
 +
grep -e "total.*INTERACTION" /home/irefindex/data/2010-02-08/*.txt
 +
grep -e "total.*INTERACTOR" /home/irefindex/data/2010-02-08/*.txt
 +
</pre>
  
One way of ensuring that this table exists and is suitable is to drop any
+
{{Note|
existing table within the database being built, then to copy an existing table
+
The above approaches do not seem to work with BIND Translation since it provides experimental interactor details in addition to participant interactor details. However, a simple program can be written to perform a slightly more complicated textual search:
from a previously built database:
 
  
 
<pre>
 
<pre>
use <database>;
+
#!/usr/bin/env python
drop table eutils;
+
import re
create table eutils like <old_database>.eutils;
+
from glob import glob
insert into eutils select * from <old_database>.eutils;
+
l = glob("/home/irefindex/data/BIND_Translation/2011-06-11/*.xml")
 +
p = re.compile(r"<participant.*?>\s*<interactor id", re.MULTILINE)
 +
total = 0
 +
for i in l:
 +
    total += len(p.findall(open(i).read()))
 +
print total
 
</pre>
 
</pre>
  
For example:
+
This insists on counting only interactor elements within participant elements.
 +
}}
 +
 
 +
It is especially important when testing new data sources to see whether undefined values (represented by <tt>-8</tt>) appear in the results:
  
 
<pre>
 
<pre>
use irefindex;
+
select name, count(*) from int_source inner join int_db on int_source.source = int_db.id where uid = -8 group by name;
drop table eutils;
+
select name, count(*) from int_object inner join int_db on int_object.source = int_db.id where uid = -8 group by name;
create table eutils like old_db.eutils;
+
select name, count(*) from int_source2object inner join int_db on int_source2object.source = int_db.id where sourceid = -8 group by name;
insert into eutils select * from old_db.eutils;
+
select name, count(*) from int_source2object inner join int_db on int_source2object.source = int_db.id where objectid = -8 group by name;
 +
select name, refno, type, count(*) from int_category inner join int_xref on int_category.refno = int_xref.category inner join int_db on int_xref.dbid = int_db.id where uid = -8 group by int_db.name, refno, type;
 +
select refno, type, count(*) from int_category inner join int_name on int_category.refno = int_name.category where uid = -8 group by refno, type;
 
</pre>
 
</pre>
  
== Building PSI_Writer ==
+
=== Potential Problems ===
  
The <tt>PSI_Writer.jar</tt> file needs to be obtained or built.
+
Any <tt>-8</tt> values indicate that information was not correctly captured for a particular field, with the most severe case being <tt>-8</tt> in both the <tt>sourceid</tt> and <tt>objectid</tt> columns of the <tt>int_source2object</tt> table: records exhibiting such properties merely indicate the presence of interactions with no indication of what is interacting or the origins of the interaction information. The occurrence of <tt>-8</tt> is typically the result of a failure of the mapper component of iRef_PSI_XML2RDBMS to interpret a data file appropriately.
  
<ol>
+
The statistics for <tt>int_source</tt>, <tt>int_object</tt> and <tt>int_source2object</tt> may differ, potentially showing that fewer interactions are recorded in the latter mapping table than are present in the "source" table, or that fewer interactors are involved in interactions than are present in the "object" table. This may also be due to a failure of the mapper component to associate interactors with interactions, but there may be legitimate reasons for this: data files may contain repetition of definitions and identifiers, or there may be currently unsupported forms of data such as complex information in such files which the mapper does not yet support.
<li>Get the program's source code from this location:
 
  
  <ul>
+
{{Note|
  <li>https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/PSI_Writer/</li>
+
Make sure there is 100% agreement between the element count and what is loaded to the database. If there is a difference and even this difference is one, the reason should be located before proceeding. After parsing it is important to make sure there is no overlap in the UID. The following queries should each return the empty set:
  </ul>
 
  
Using CVS with the appropriate <tt>CVSROOT</tt> setting, run the following command:
+
select * from int_object where int_object.uid in (select uid from int_source);
 +
select * from int_object where int_object.uid in (select uid from int_experiment);
 +
select * from int_source where int_source.uid in (select uid from int_experiment);
 +
}}
  
  <pre>cvs co bioscape/bioscape/modules/interaction/Sabry/PSI_Writer</pre>
+
== Running BioPSI_Suplimenter (continued) ==
  
The <tt>CVSROOT</tt> environment variable should be set to the following for this to work:
+
'''New for iRefIndex 7.0:''' when running the "ROG_ass + RIG_fill+ make Cy" process for free and proprietary releases there is a slight difference due to the ROG consistency issue. There is a checkbox in the GUI (in the red area) which is selected by default, which has to be deselected when making the ''free'' version.
  
  <pre>export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot</pre>
+
'''Workaround for iRefIndex 7.0 and onwards:''' the <tt>rig2rigid</tt> and <tt>risg2risgid</tt> tables need to be copied from a previous build as follows:
  
(The <tt><username></tt> should be replaced with your actual username.)</li>
+
<pre>
 +
create table rig2rigid as (select * from iRefIndex_full_beta7.rig2rigid);
 +
create table risg2risgid as (select * from iRefIndex_full_beta7.risg2risgid);
 +
alter table rig2rigid add index rigid(rigid);
 +
alter table rig2rigid add index rig(rig);
 +
alter table risg2risgid add index risgid(risgid);
 +
alter table risg2risgid add index id(id);
 +
</pre>
  
<li>Obtain the program's dependencies. This program uses the <tt>SHA.jar</tt> file created above as well as the MySQL Connector/J library which can be found at the following location:
+
=== ROG_ass + RIG_fill+ make Cy ===
 
 
  <ul>
 
  <li>http://www.mysql.com/products/connector/j/</li>
 
  </ul></li>
 
  
<li>Extract the dependencies:
+
'''Note:''' when building the free version of the release, the <tt>UniProt_table</tt>, <tt>gene2refseq table</tt>, <tt>SEGUID table</tt> and <tt>Pre_build Eutils</tt> tables from the full version's database should be specified. (When building the full version, data has already been cloned from a previous release, and the tables in the same database should be specified.)
  
  <pre>tar zxf mysql-connector-java-5.1.6.tar.gz</pre>
+
A table prepared from Web service data should be given for the <tt>Pre_build Eutils</tt> field. For example:
  
This will produce a directory called <tt>mysql-connector-java-5.1.6</tt> containing
+
<dl>
a file called <tt>mysql-connector-java-5.1.6-bin.jar</tt> which should be placed in
+
<dt>Pre_build Eutils
the <tt>lib</tt> directory in the <tt>PSI_Writer</tt> directory...
+
<dd>The name of the Web service data table; for example:
 +
  <pre>irefindex.eutils</pre>
 +
</dl>
  
  <pre>
+
It is possible to specify the location of a table residing in another database. As a consequence, the table will be copied into the target database if it proves not to be possible to download Eutils information or if such downloading is not undertaken by the user.
  mkdir lib
 
  cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/</pre>
 
  
You may instead choose to copy the library from the <tt>BioPSI_Suplimenter/lib</tt> directory:
+
Upon initiating this activity, a dialogue will be displayed showing the following message:
  
  <pre>
+
This will reset all previous ROGFILL information. Do you want to continue ?
  mkdir lib
 
  cp ../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/</pre>
 
  
The <tt>SHA.jar</tt> file needs copying from its build location:
+
Selecting <tt>Yes</tt> will cause the activity to proceed. A dialogue will then appear:
  
  <pre>cp ../SHA/dist/SHA.jar lib/</pre>
+
Do you want to recreate Eutils (without using an existing version) ?
  
Alternatively, the external libraries can also be found in the following location:
+
Selecting <tt>Yes</tt> will cause Eutils information to be downloaded, whereas <tt>No</tt> will take the table specified above into use in order to provide such information to the activity.
  
  <pre>/biotek/dias/donaldson3/iRefIndex/External_libraries</pre></li>
+
{{Note|
 +
Due to restrictions around Eutils availability and the possibility that the program will need to access many Eutils records, the program will significantly reduce downloading activity outside weekend periods. Thus it is highly recommended that this activity be undertaken during a weekend.
 +
}}
  
<li>Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the <tt>PSI_Writer</tt> directory:
+
=== Canonical_Mapper ===
  
  <pre>cp Build_files/build.xml .</pre>
+
This process can be selected and run using the usual options.
  
Compile and create the <tt>.jar</tt> file as follows:
+
=== Make Cy tables + PMID scorer ===
  
  <pre>ant jar</pre></li>
+
This process can be selected and run using the usual options.
</ol>
 
  
== Running PSI_Writer ==
+
=== SEGUID Manipulator ===
  
In order to run the program, some additional database tables are required. One
+
Before this process can be run, a collection of SQL commands must be run manually against the database. These commands reside in files in the <tt>SQL_commands</tt> directory alongside <tt>BioPSI_Suplimenter</tt> in CVS:
way of ensuring that such tables exist and are suitable is to drop any
 
existing tables within the database being built, then to copy existing tables
 
from a previously built database:
 
  
<pre>
+
# The commands in <tt>make_export_table.sql</tt> should be executed in the build database.
use <database>;
+
# Then, the SEGUID Manipulator should be run.
drop table mapping_intDitection;
+
# Then, the commands in <tt>make_export_table_products.sql</tt> should be executed in the build database.
drop table mapping_intType;
 
drop table mapping_partidentification;
 
create table mapping_intDitection like <old_database>.mapping_intDitection;
 
create table mapping_intType like <old_database>.mapping_intType;
 
create table mapping_partidentification like <old_database>.mapping_partidentification;
 
insert into mapping_intDitection select * from <old_database>.mapping_intDitection;
 
insert into mapping_intType select * from <old_database>.mapping_intType;
 
insert into mapping_partidentification select * from <old_database>.mapping_partidentification;
 
</pre>
 
  
For example:
+
=== Make Cy tables (canonical) ===
  
<pre>
+
Before this process can be run, a collection of SQL commands must be run manually against the database. These commands reside in files in the <tt>SQL_commands</tt> directory alongside <tt>BioPSI_Suplimenter</tt> in CVS:
use irefindex;
 
drop table mapping_intDitection;
 
drop table mapping_intType;
 
drop table mapping_partidentification;
 
create table mapping_intDitection like old_db.mapping_intDitection;
 
create table mapping_intType like old_db.mapping_intType;
 
create table mapping_partidentification like old_db.mapping_partidentification;
 
insert into mapping_intDitection select * from old_db.mapping_intDitection;
 
insert into mapping_intType select * from old_db.mapping_intType;
 
insert into mapping_partidentification select * from old_db.mapping_partidentification;
 
</pre>
 
  
Run the program as follows:
+
# The commands in <tt>make_canonical_tables.sql</tt> should be executed in the build database.
 +
# Then, this process itself should be run.
 +
# Then, the commands in <tt>make_canonical_table_products.sql</tt> should be executed in the build database.
  
<pre>java -jar -Xms256m -Xmx256m build/jar/PSI_Writer.jar</pre>
+
== All iRefIndex Pages ==
  
Follow the instructions, supplying the requested arguments when running the
+
Follow this link for a listing of all iRefIndex related pages (archived and current).
program again.
+
[[Category:iRefIndex]]

Latest revision as of 15:50, 4 October 2011

Downloading the Source Data

Before downloading the source data, a location must be chosen for the downloaded files. For example:

/home/irefindex/data

Some data sources need special links to be obtained from their administrators via e-mail, and in general there is a distinction between free and proprietary data sources, described as follows:

Free
BIND, BioGRID, Gene2Refseq (NCBI), InnateDB, IntAct, MatrixDB, MINT, MMDB/PDB, MPIDB, MPPI, OPHID, RefSeq, UniProt
Proprietary
BIND Translation, CORUM, DIP, HPRD, MPact

I2D, which was considered for iRefIndex 7.0, is currently under review for inclusion in future releases. The status of BIND Translation is currently under review for possible inclusion in the free dataset in future releases.

The FTPtransfer program will download data from the following sources:

  • Gene2Refseq
  • IntAct
  • MINT
  • MMDB
  • PDB
  • RefSeq
  • UniProt

Manual Downloads

More information can be found at the following location: Sources_iRefIndex

For each manual download, a subdirectory hierarchy must be created in the main data directory using a command of the following form:

mkdir -p <path-to-data>/<source>/<date>/

Here, <path-to-data> should be replaced by the location of the data directory, <source> should be replaced by the name of the source, and <date> should be replaced by the current date.

For example, for BIND this directory might be created as follows:

mkdir -p /home/irefindex/data/BIND/2010-02-08/

BIND

The FTP site was previously available at the following location:

ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/

An archived copy of the data can be found at the following internal location:

/biotek/dias/donaldson3/Sabry/DATA_2006/BINDftp/

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/BIND/2010-02-08/

Copy the following following files into the newly created data directory:

20060525.complex2refs.txt
20060525.complex2subunits.txt
20060525.ints.txt
20060525.labels.txt
20060525.refs.txt

BIND Translation

NoteNote

This source should eventually be incorporated into the automated download functionality.

The location of BIND Translation downloads is as follows:

http://download.baderlab.org/BINDTranslation/

The location of the specific file to be downloaded is the following:

http://download.baderlab.org/BINDTranslation/release1_0/BINDTranslation_v1_xml_AllSpecies.tar.gz

(Note that the specific file varies from release to release - see the sources page for a particular release for more details.)

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/BIND_Translation/2010-02-08/

Download the file into the newly created data directory and unpack it as follows:

cd /home/irefindex/data/BIND_Translation/2010-02-08/
tar zxf BINDTranslation_v1_xml_AllSpecies.tar.gz

BioGRID

The location of BioGRID downloads is as follows:

http://www.thebiogrid.org/downloads.php

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/BioGRID/2010-02-08/

Select the BIOGRID-ALL-X.Y.Z.psi25.zip file (where X.Y.Z should be replaced by the actual release number) and download/copy it to the newly created data directory for BioGRID.

In the data directory for BioGRID, uncompress the downloaded file. For example:

cd /home/irefindex/data/BioGRID/2010-02-08/
unzip BIOGRID-ALL-2.0.62.psi25.zip

CORUM

The location of CORUM downloads is as follows:

http://mips.gsf.de/genre/proj/corum/index.html

The specific download file is this one:

http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/CORUM/2010-02-08/

Copy/download the file referenced above and uncompress it in the data directory for CORUM. For example:

cd /home/irefindex/data/CORUM/2010-02-08/
unzip allComplexes.psimi.zip

DIP

Access to data from DIP is performed via the following location:

http://dip.doe-mbi.ucla.edu/dip/Login.cgi?

You have to register, agree to terms, and get a user account.

Access credentials for internal users are available from Sabry.

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/DIP/2010-02-08/

Select the FULL - complete DIP data set from the Files page:

http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=3

Download the latest PSI-MI 2.5 file (dip<date>.mif25) to the newly created data directory. If a compressed version of the file was chosen, uncompress the file using the gunzip tool. For example:

cd /home/irefindex/data/DIP/2010-02-08/
gunzip dip20080708.mif25.gz

HPRD

http://www.hprd.org/download/

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/HPRD/2010-02-08/

Download the PSI-MI single file (HPRD_SINGLE_PSIMI_<date>.xml.tar.gz) to the newly created data directory.

Note: you have to register each and every time, unfortunately.

Uncompress the downloaded file. For example:

cd /home/irefindex/data/HPRD/2010-02-08/
tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz

I2D

For iRefIndex 7.0, I2D was supposed to replace OPHID, but problems with the source files have excluded I2D from that release.

http://ophid.utoronto.ca/ophidv2.201/downloads.jsp

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/I2D/2010-02-08/

For the Download Format in the download request form, specify PSI-MI 2.5 XML. Unfortunately, each Target Organism must be specified in turn when submitting the form: there is no ALL option.

Uncompress each downloaded file. For example:

cd /home/irefindex/data/I2D/2010-02-08/
unzip i2d.HUMAN.psi25.zip

InnateDB

NoteNote

This source should eventually be incorporated into the automated download functionality.

Select the "Curated InnateDB Data" download from the InnateDB downloads page:

http://www.innatedb.com/download.jsp

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/InnateDB/2010-02-08/

Uncompress the downloaded file. For example:

cd /home/irefindex/data/InnateDB/2010-02-08/
gunzip innatedb_20100716.xml.gz

MatrixDB

NoteNote

This source should eventually be incorporated into the automated download functionality.

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/MatrixDB/2011-06-11/

The data is available from the following site:

http://matrixdb.ibcp.fr/

Selecting the "Download MatrixDB data" leads to the following page:

http://matrixdb.ibcp.fr/cgi-bin/download

Here, selecting the "PSI-MI XML 2.5" download under "PSI-MI XML or TAB 2.5 MatrixDB literature curation interactions" will result in a file being downloaded, and this should be placed in the newly created directory.

Uncompress the data as follows:

cd /home/irefindex/data/MatrixDB/2011-06-11/
unzip MatrixDB_20100826.xml.zip

MIPS

NoteNote

This source should eventually be incorporated into the automated download functionality.

In the main downloaded data directory, create a subdirectory hierarchy as noted above for MIPS and MPACT. For example:

mkdir -p /home/irefindex/data/MIPS/2010-02-08/
mkdir -p /home/irefindex/data/MPACT/2010-02-08/

For MPPI, download the following file:

http://mips.gsf.de/proj/ppi/data/mppi.gz

For MPACT, download the following file:

ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz

Uncompress the downloaded files in their respective directories. For example:

cd /home/irefindex/data/MPACT/2010-02-08/
gunzip mpact-complete.psi25.xml.gz
cd /home/irefindex/data/MIPS/2010-02-08/
gunzip mppi.gz

MPIDB

NoteNote

This source should eventually be incorporated into the automated download functionality.

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/MPIDB/2011-06-11/

For MPI-LIT, download the following resource:

http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-LIT

For MPI-IMEX, download the following resource:

http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-IMEX

The downloads should be placed in the MPIDB data directory with the names MPI-LIT.txt and MPI-IMEX.txt, perhaps using the following example download commands:

cd /home/irefindex/data/MPIDB/2011-06-11/
wget -O MPI-LIT.txt 'http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-LIT'
wget -O MPI-IMEX.txt 'http://www.jcvi.org/mpidb/interaction.php?dbsource=MPI-IMEX'

OPHID

From iRefIndex 8.0, I2D replaces OPHID.

OPHID is no longer available, so you have to use the local copy of the data:

/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/OPHID/2010-02-08/

Copy the file ophid1153236640123.xml to the newly created data directory.

SEGUID

Downloading of the SEGUID dataset is described below.

Build Dependencies

To build the software, Apache Ant needs to be available. This software could be retrieved from the Apache site...

http://ant.apache.org/bindownload.cgi

...or from a mirror such as one of the following:

http://mirrorservice.nomedia.no/apache.org//ant/binaries/apache-ant-1.8.2-bin.tar.gz

http://mirrors.powertech.no/www.apache.org/dist//ant/binaries/apache-ant-1.8.2-bin.tar.gz

This software can be extracted as follows:

tar zxf apache-ant-1.8.2-bin.tar.gz

This will produce a directory called apache-ant-1.8.2 containing a directory called bin. The outer directory should be recorded in the ANT_HOME environment variable, whereas the bin directory should be incorporated into the PATH environment variable on your system. For example, for bash:

export ANT_HOME=/home/irefindex/apps/apache-ant-1.8.2
export PATH=${PATH}:${ANT_HOME}/bin

It should now be possible to run the ant program.

Building FTPtransfer

The FTPtransfer.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/FTPtransfer/

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/FTPtransfer

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...

    http://commons.apache.org/downloads/download_net.cgi

    ...or from a mirror such as one of the following:

    http://mirrorservice.nomedia.no/apache.org/commons/net/binaries/commons-net-1.4.1.tar.gz

    http://www.powertech.no/apache/dist/commons/net/binaries/commons-net-1.4.1.tar.gz

  3. Extract the dependencies:
    tar zxf commons-net-1.4.1.tar.gz

    This will produce a directory called commons-net-1.4.1 containing a file called commons-net-1.4.1.jar which should be placed in the lib directory in the FTPtransfer directory...

    mkdir lib
    cp commons-net-1.4.1/commons-net-1.4.1.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the FTPtransfer directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar

Running FTPtransfer

To run the program, invoke the .jar file as follows:

java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log datadir

The specified log argument can be replaced with a suitable location for the program's execution log, whereas the datadir argument should be replaced with a suitable location for downloaded data (such as /home/irefindex/data).

Building SHA

The SHA.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/SHA/

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/SHA

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Compile the source code. Compile and create the .jar file as follows:
    ant jar

    The SHA.jar file will be created in the dist directory.

Building SaxValidator

The SaxValidator.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    https://hfaistos.uio.no/cgi-bin/viewvc.cgi/Parser/SaxValidator/

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co Parser/SaxValidator

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Compile and create the .jar file as follows:
    ant jar

Running SaxValidator

The program used for validation and integrity checks is called SaxValidator and when the name was chosen it was merely a SAX-based validator. However, more functionality has since been included:

  1. Validate XML files against a schema.
  2. XML parser-independent counting of elements (count number of </interaction> and </interactor> tags in each file). This gives an indication on what to expect at the end of the parsing.
  3. Count number of lines in BIND text.
  4. Remove files containing negative interactions.

Run the program as follows:

java -jar -Xms256m -Xmx256m dist/SaxValidator.jar  <date extension> <validate true/false> <count elements true/false>

For example:

java -jar -Xms256m -Xmx256m dist/SaxValidator.jar /home/irefindex/data /2010-02-08/ true true

Be sure to include the leading and trailing / characters around the date information.

Handling Invalid Files

For each data source, invalid files will be moved to a subdirectory of that source's data directory. These subdirectories can be found by using the following Unix command:

find /home/irefindex/data -name inValid

Known Issues

  • Some BIND Translation files do not have an appropriate encoding declaration
  • The BioGRID file may generate entity-related errors
  • The DIP file omits id attributes on experimentDescription elements
  • The HPRD file omits required elements from experimentDescription elements
  • MIPS MPACT/MPPI files may be flagged as invalid, but can still be parsed using workarounds in the parsing process

To fix the BIND Translation errors, concatenate each incorrect file to the following declaration:

<?xml version="1.0" encoding="iso-8859-1"?>

For example, after saving the above in declaration.txt:

cat declaration /home/irefindex/data/BIND_Translation/2010-02-08/inValid/taxid9606_PSIMI25.xml > /home/irefindex/data/BIND_Translation/2010-02-08/taxid9606_PSIMI25.xml

To fix the BioGRID entity errors, run the following script from the iRef_PSI_XML2RDBMS directory:

python tools/fix_biogrid.py <BioGRID data file> <new BioGRID data file>

For example:

python tools/fix_biogrid.py /home/irefindex/data/BioGRID/2010-02-08/inValid/BIOGRID-ALL-3.1.69.psi25.xml /home/irefindex/data/BioGRID/2010-02-08/BIOGRID-ALL-3.1.69.psi25.xml

Make sure that only one XML file resides in the date-specific BioGRID data directory. Here, it is assumed that the data file was moved into the inValid subdirectory by the validator.

Alternatives and Utilities

The xmllint program provided in the libxml2 distribution, typically available as standard on GNU/Linux distributions, can be used to check and correct XML files. For example:

xmllint HPRD/2010-09-14/inValid/HPRD_SINGLE_PSIMI_041210.xml > HPRD_SINGLE_PSIMI_041210-corrected.xml

This corrects well-formedness issues with the source file in the output file.

Building BioPSI_Suplimenter

The BioPSI_Suplimenter.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:

    http://www.mysql.com/products/connector/j/

  3. Extract the dependencies. For example:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the BioPSI_Suplimenter directory...

    mkdir lib
    cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    The filenames in the above example will need adjusting, depending on the exact version of the library downloaded.

    The SHA.jar file needs copying from its build location:

    cp ../SHA/dist/SHA.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the BioPSI_Suplimenter directory:
    cp Build_files/build.xml .

    It might be necessary to edit the build.xml file, changing the particular filename for the .jar file whose name begins with mysql-connector-java, since this name will change between versions of that library.

    Compile and create the .jar file as follows:

    ant jar

Creating the Database

Enter MySQL using a command like the following:

mysql -h <host> -u <admin> -p -A

The <admin> is the name of the user with administrative privileges. For example:

mysql -h myhost -u admin -p -A

Then create a database and user using commands of the following form:

create database <database>;
create user '<username>'@'%' identified by '<password>';
grant all privileges on <database>.* to '<username>'@'%';

For example, with <database> given as irefindex, <username> given as irefindex, and a substitution for <password>:

create database irefindex;
create user 'irefindex'@'%' identified by 'mysecretpassword';
grant all privileges on irefindex.* to 'irefindex'@'%';

If difficulties occur granting privileges in this way, try the following statements:

grant select, insert, update, delete, create, drop, references, index, alter, create temporary tables, lock tables, execute, create view, show view, create routine, alter routine on <database>.* to '<username>'@'%';
grant process, file on *.* to '<username>'@'%';

Manual loading of data

In order to get the sequence of SEGUIDs not retrieved in later stages the table "seguid2sequence" has to be made as follows.

Where SEGUID identifier consistency is required with a previous database, copy the table from the previous release:

create table seguid2sequence as
  select * from olddb.seguid2sequence;

Otherwise, perform the following steps:

  1. Obtain the file "seguidflat" from ftp://bioinformatics.anl.gov/seguid/ or (locally) /biotek/dias/donaldson3/DATA/SEGUID
  2. Use the following SQL commands to load this into a table:
create table seguid2sequence (
  seguid char(27) default '0',
  sequence varchar(49152) default 'NA',
  noaa int(11) default -1
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

load data infile '......../seguidflat' into table seguid2sequence FIELDS TERMINATED BY '\t';

update seguid2sequence set noaa=length(replace(sequence,' ',''));

After populating the table in either situation, add an index as follows:

alter table seguid2sequence add index seguid(seguid);

Running BioPSI_Suplimenter

Please make sure that the manual loading of data was completed before this step, if appropriate.

Run the program as follows:

java -jar -Xms256m -Xmx768m build/jar/BioPSI_Suplimenter.jar &

In the dialogue box that appears, the following details must always be filled out:

Server
the <host> value specified when creating the database
Database
the <database> value specified when creating the database
User name
the <username> value specified above
Password
the <password> value specified above
Log file
the path to a log file where the program's output shall be written

Make sure that the log file will be written to a directory which already exists. For example:

mkdir /home/irefindex/logs/

The program will need to be executed a number of times for different activities, and these are described in separate sections below. For each one, select the corresponding menu item in the Program field shown in the dialogue.

Create tables

NoteNote

Before beginning, test the following:

  • Verify the SQL file is up-to-date (especially if new database names were added during the previous build).
  • Make sure the permissions are appropriate for the user running the program.

The SQL file field should refer to the Create_iRefIndex.sql file in the SQL directory within BioPSI_Suplimenter, and this should be a full path to the file. For example:

/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/SQL/Create_iRefIndex.sql

Click the OK button to create the tables.

A brief review of the database is recommended to check whether all the tables in the SQL file were successfully created.

Clone seguid

NoteNote

Before beginning, make sure the database user is privileged to read from the source SEGUID table. The following command can be used to grant privileges to access an earlier database:

grant select on <database>.* to '<username>'@'%';

For example:

grant select on olddb.* to 'irefindex'@'%';


From iRefIndex 6.0, each ROG is consistent with the previous release. Therefore, the first operation when creating the new SEGUID table is to copy the SEGUID table from the previous release. This is included as an option in BioPSI_Suplimenter. The seguidannotation file is no longer parsed and if there is an updated version of this file from the SEGUID project it has to be used as an updating step.

NoteNote

A process for updating from a newer version of the seguidannotation file is not currently defined in BioPSI_Suplimenter.

When the "Clone SEGUID" option is selected from the BioPSI_Suplimenter GUI, the SEGUID file is the source seguid table (so for the database beta7, the seguid table used would be beta6.seguid). The target database selected should not have a seguid table and if it has, this will throw an error.

Free and proprietary releases & clone seguid

NoteNote

This section has been retained for historical purposes.

iRefIndex previously supplied two subversions or distributions for every release - free and proprietary - requiring that each ROG not merely be consistent with the previous release, but also be consistent between the free and proprietary versions of the release being made. Thus, the cloning always had to be done using the earlier full/proprietary version as the source and the current full/proprietary as target. In other words, the proprietary version was made first and then the free version. Once the proprietary version had been made, the SEGUID table of the free version was made by cloning the current proprietary version's SEGUID (not the previous version's).

It is recommended to check that the record counts match between source and target and that all the indices are properly made.

Recreate SEGUID

NoteNote

This activity needs to be performed as an update step after the Clone SEGUID process has been performed.

The SEGUID file field should refer to the seguidannotation file in the SEGUID subdirectory hierarchy if a new release has been made available. For example, given the following data directory...

/home/irefindex/data

...an appropriate value for the SEGUID file field might be this:

/home/irefindex/data/SEGUID/2010-02-08/seguidannotation

Where the Clone SEGUID process has already populated the database with SEGUID information and no new file has been made available, the SEGUID file field can be left blank.

For the following fields, indicate the locations of the corresponding files and directories similarly:

Unip_SP_file
uniprot_sprot.dat (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot.dat
Unip_Trm_file
uniprot_trembl.dat (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_trembl.dat
unip_Isoform_file
uniprot_sprot_varsplic.fasta (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot_varsplic.fasta
Unip_Yeast_file
yeast.txt (in the yeast directory); for example:
/home/irefindex/data/yeast/2010-02-08/yeast.txt
Unip_Fly_file
fly.txt (in the fly directory); for example:
/home/irefindex/data/fly/2010-02-08/fly.txt
RefSeq DIR
The specific download directory for RefSeq; for example:
/home/irefindex/data/RefSeq/2010-02-08/
Fasta 4 PDB
pdbaa.fasta (from PDB); for example:
/home/irefindex/data/PDB/2010-02-08/pdbaa.fasta
Tax Table 4 PDB
tax.table (from PDB); for example:
/home/irefindex/data/PDB/2010-02-08/tax.table
Gene info file
gene_info.txt (in the geneinfo directory); for example:
/home/irefindex/data/geneinfo/2010-02-08/gene_info.txt
gene2Refseq
gene2refseq.txt (in the NCBI_Mappings directory); for example:
/home/irefindex/data/NCBI_Mappings/2010-02-08/gene2refseq.txt

The SEGUID table field should specify the current database. For example:

beta9.seguid

Fill Bind info

The file fields should refer to the appropriate BIND files in the data directory hierarchy. For example, for Bind Ints file:

/home/irefindex/data/BIND/2010-02-08/20060525.ints.txt

To conveniently edit all file fields, you can edit the Base loc field, inserting the top-level data directory. For example:

/home/irefindex/data

In addition, the Date info can also be changed to indicate the common date directory name used by the data sources. For example:

2010-02-08

Be sure to check the final values of the file fields themselves before activating the operation.

Importing MPIDB Data

A distribution called mpidb2mitab has been created for the purpose of parsing and correcting the MPIDB data files, preparing the files for import into iRefIndex. An overview of the complete process is given as follows:

  1. Create a database for processing purposes. This currently uses PostgreSQL but could be changed to run within the database being built for iRefIndex.
  2. Parse the MPIDB data files.
  3. Initialise the processing database for MITAB-related data.
  4. Import the MITAB-related data.
  5. Convert the data to iRefIndex-compatible data.
  6. Export the data for presentation to iRefIndex.
  7. Inspect the iRefIndex database to obtain a starting unique identifier (uid) for the import.
  8. Import the iRefIndex-compatible data into iRefIndex, specifying the uid.

Creating taxid2name table

This table is created by manually loading data.

  1. Download the SQL dump from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz :
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
  2. Create a database table:
    create table taxid2name(
      taxid int default -1,
      name varchar(256) default 'NA',
      unq_name varchar(256) default 'NA',
      cla_name varchar(256) default 'NA'
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
  3. Unpack the data into a directory:
    mkdir taxdump
    mv taxdump.tar.gz taxdump
    tar zxf taxdump.tar.gz -C taxdump
  4. Import the name data:
    load data infile 'taxdump/names.dmp' into table taxid2name FIELDS TERMINATED BY '\|';
    (Note that if --local-infile is specified when logging into MySQL, client-side and relative paths can be used with the load data local infile command.)
  5. Post-processing:
    update taxid2name set name=(replace(name,'\t',''));
    update taxid2name set unq_name=(replace(unq_name,'\t',''));
    update taxid2name set cla_name=(replace(cla_name,'\t',''));
    alter table taxid2name add index taxid(taxid);
    alter table taxid2name add index name(name);
    alter table taxid2name add index unq_name(unq_name);

Building iRef_PSI_XML2RDBMS

NoteNote

iRef_PSI_XML2RDBMS replaces StaxPSIXML as the PSI-MI XML parsing component in iRefIndex from release 8.0.

  • Before beginning test: Make sure the validator was run on all the source files. Check the log of validator to locate any anomalies. Check whether the files are in the place where the config file will search. Please note that if a file was found to be invalid then this will be moved a sub-folder called "invalid".

The iRef_PSI_XML2RDBMS.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/iRef_PSI_XML2RDBMS/

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/iRef_PSI_XML2RDBMS

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location:

    http://www.mysql.com/products/connector/j/

    You may choose to refer to the download from the BioPSI_Suplimenter build process.
  3. Extract the dependencies:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the iRef_PSI_XML2RDBMS directory...

    mkdir lib
    cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    You may instead choose to copy the library from the BioPSI_Suplimenter/lib directory:

    mkdir lib
    cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/
    The filenames in the above examples will need adjusting, depending on the exact version of the library downloaded.
  4. Compile the source code. It might be necessary to edit the build.xml file, changing the particular filename for the .jar file whose name begins with mysql-connector-java, since this name will change between versions of that library. Compile and create the .jar file as follows:
    ant jar

Running iRef_PSI_XML2RDBMS

The software must first be configured using files provided in the config directory. This can be done using the make_config.py script provided:

python make_config.py <data_directory> <date_prefix> <log_directory>

For example:

python make_config.py /home/irefindex/data 2010-02-08 /home/irefindex/logs

This will produce a new version of the configFileList.txt file which should be appropriately configured.

Manual Configuration

In configFileList.txt, the CONFIG entries must be edited in order to refer to the locations of each of the individual configuration files. To remove a data source, add a leading # character to the appropriate line.

Each supplied configuration file has a name of the form config_X_SOURCE.xml, where X is an arbitrary identifier which helps to distinguish between different configuration versions, and where SOURCE is a specific data source name such as one of the following:

  • BIOGRID
  • DIP
  • HPRD
  • I2D
  • InnateDB
  • IntAct
  • MatrixDB
  • MINT
  • MIPS
  • MIPS_MPACT
  • OPHID

In each file (specified in configFileList.txt), a number of elements need to be defined within the locations element:

logger
The location of the log file to be created. If a log file already exists, the new information will be appended. In the event that the program throws more than 50000 exceptions, the errors will be continued in new files ordered by a numeric identifier specified at the end of each filename.
data
This the location of the PSI-XML files to be parsed. For example:
/home/irefindex/data/BioGRID/2010-02-08/textfiles/

This will directly also store the lastUpdate.obj file: this file contains successfully parsed files and allow the parsing to be processed from the last successful point in the event of a disruption. This also prevents accidental parsing of files more than once. If all files have to be parsed again (in the case of a new build, for example) lastUpdate.obj has to be deleted. If only certain files to be parsed again use the Exemptions option instead.

Exemptions
This gives the location of PSI-MI files to re-parsed, thus overriding the lastUpdate.obj control. This may be needed if information from certain files has to be parsed again, but this directory should be created and left empty initially.
mapper
This is the most important part of the parsing. This file defines where to obtain the data from the XML file and which field of the table in the database the data is destined for. The location of this file is typically within the source code distribution, and an absolute path should be specified. For example:
/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/iRef_PSI_XML2RDBMS/mapper/Map25_INTACT_MINT_BIOGRID.xml

More information about the mapper is available in Readme_mapper.txt within the StaxPSIXML directory.

See also the documentation on the topic of adding sources to iRefIndex for details of writing new mapper configuration files.

For the MIPS and MIPS_MPACT sources, the following "specs" element needs to be changed:

filetype
A specific file should be specified. For MIPS, this should be something like the following:
mppi.xml

For MIPS_MPACT, the file should be something like this:

mpact-complete.psi25.xml

Running the Program

NoteNote

If running the program again using an existing set of data files, be sure to remove all lastUpdate.obj files residing within the various source data directories, or the program will happily ignore the data.

When all configuration files and mapper files are ready. Run the program:

ant run

This will display a graphical interface requesting information about the location of the configuration file configFileList.txt and a suitable log directory, as well as database credentials.

Validating the Results from iRef_PSI_XML2RDBMS

It is possible to validate the results of the parsing process by issuing the following queries against the database being prepared:

select name, count(uid) from int_source inner join int_db on int_source.source = int_db.id group by name;
select name, count(uid) from int_object inner join int_db on int_object.source = int_db.id group by name;
select name, count(distinct sourceid) from int_source2object inner join int_db on int_source2object.source = int_db.id group by name;
select name, count(distinct objectid) from int_source2object inner join int_db on int_source2object.source = int_db.id group by name;

This will tabulate the interactions and interactors respectively for each data source. The values can then be compared to the totals written to the files produced by the SaxValidator program; these files can be found in the data directory hierarchy in a location resembling the following:

/home/irefindex/data/2010-02-08

Each validated data source should have a pair of files as illustrated by the following directory listing extract:

corum_interactions.txt
corum_interactors.txt
dip_interactions.txt
dip_interactors.txt
grid_interactions.txt
grid_interactors.txt

A convenient way of getting similar tabular summaries as those returned by the above queries is to run the following commands:

grep -e "total.*INTERACTION" /home/irefindex/data/2010-02-08/*.txt
grep -e "total.*INTERACTOR" /home/irefindex/data/2010-02-08/*.txt
NoteNote

The above approaches do not seem to work with BIND Translation since it provides experimental interactor details in addition to participant interactor details. However, a simple program can be written to perform a slightly more complicated textual search:

#!/usr/bin/env python
import re
from glob import glob
l = glob("/home/irefindex/data/BIND_Translation/2011-06-11/*.xml")
p = re.compile(r"<participant.*?>\s*<interactor id", re.MULTILINE)
total = 0
for i in l:
    total += len(p.findall(open(i).read()))
print total

This insists on counting only interactor elements within participant elements.

It is especially important when testing new data sources to see whether undefined values (represented by -8) appear in the results:

select name, count(*) from int_source inner join int_db on int_source.source = int_db.id where uid = -8 group by name;
select name, count(*) from int_object inner join int_db on int_object.source = int_db.id where uid = -8 group by name;
select name, count(*) from int_source2object inner join int_db on int_source2object.source = int_db.id where sourceid = -8 group by name;
select name, count(*) from int_source2object inner join int_db on int_source2object.source = int_db.id where objectid = -8 group by name;
select name, refno, type, count(*) from int_category inner join int_xref on int_category.refno = int_xref.category inner join int_db on int_xref.dbid = int_db.id where uid = -8 group by int_db.name, refno, type;
select refno, type, count(*) from int_category inner join int_name on int_category.refno = int_name.category where uid = -8 group by refno, type;

Potential Problems

Any -8 values indicate that information was not correctly captured for a particular field, with the most severe case being -8 in both the sourceid and objectid columns of the int_source2object table: records exhibiting such properties merely indicate the presence of interactions with no indication of what is interacting or the origins of the interaction information. The occurrence of -8 is typically the result of a failure of the mapper component of iRef_PSI_XML2RDBMS to interpret a data file appropriately.

The statistics for int_source, int_object and int_source2object may differ, potentially showing that fewer interactions are recorded in the latter mapping table than are present in the "source" table, or that fewer interactors are involved in interactions than are present in the "object" table. This may also be due to a failure of the mapper component to associate interactors with interactions, but there may be legitimate reasons for this: data files may contain repetition of definitions and identifiers, or there may be currently unsupported forms of data such as complex information in such files which the mapper does not yet support.

NoteNote

Make sure there is 100% agreement between the element count and what is loaded to the database. If there is a difference and even this difference is one, the reason should be located before proceeding. After parsing it is important to make sure there is no overlap in the UID. The following queries should each return the empty set:

select * from int_object where int_object.uid in (select uid from int_source);
select * from int_object where int_object.uid in (select uid from int_experiment);
select * from int_source where int_source.uid in (select uid from int_experiment);

Running BioPSI_Suplimenter (continued)

New for iRefIndex 7.0: when running the "ROG_ass + RIG_fill+ make Cy" process for free and proprietary releases there is a slight difference due to the ROG consistency issue. There is a checkbox in the GUI (in the red area) which is selected by default, which has to be deselected when making the free version.

Workaround for iRefIndex 7.0 and onwards: the rig2rigid and risg2risgid tables need to be copied from a previous build as follows:

create table rig2rigid as (select * from iRefIndex_full_beta7.rig2rigid);
create table risg2risgid as (select * from iRefIndex_full_beta7.risg2risgid);
alter table rig2rigid add index rigid(rigid);
alter table rig2rigid add index rig(rig);
alter table risg2risgid add index risgid(risgid);
alter table risg2risgid add index id(id);

ROG_ass + RIG_fill+ make Cy

Note: when building the free version of the release, the UniProt_table, gene2refseq table, SEGUID table and Pre_build Eutils tables from the full version's database should be specified. (When building the full version, data has already been cloned from a previous release, and the tables in the same database should be specified.)

A table prepared from Web service data should be given for the Pre_build Eutils field. For example:

Pre_build Eutils
The name of the Web service data table; for example:
irefindex.eutils

It is possible to specify the location of a table residing in another database. As a consequence, the table will be copied into the target database if it proves not to be possible to download Eutils information or if such downloading is not undertaken by the user.

Upon initiating this activity, a dialogue will be displayed showing the following message:

This will reset all previous ROGFILL information. Do you want to continue ?

Selecting Yes will cause the activity to proceed. A dialogue will then appear:

Do you want to recreate Eutils (without using an existing version) ?

Selecting Yes will cause Eutils information to be downloaded, whereas No will take the table specified above into use in order to provide such information to the activity.

NoteNote

Due to restrictions around Eutils availability and the possibility that the program will need to access many Eutils records, the program will significantly reduce downloading activity outside weekend periods. Thus it is highly recommended that this activity be undertaken during a weekend.

Canonical_Mapper

This process can be selected and run using the usual options.

Make Cy tables + PMID scorer

This process can be selected and run using the usual options.

SEGUID Manipulator

Before this process can be run, a collection of SQL commands must be run manually against the database. These commands reside in files in the SQL_commands directory alongside BioPSI_Suplimenter in CVS:

  1. The commands in make_export_table.sql should be executed in the build database.
  2. Then, the SEGUID Manipulator should be run.
  3. Then, the commands in make_export_table_products.sql should be executed in the build database.

Make Cy tables (canonical)

Before this process can be run, a collection of SQL commands must be run manually against the database. These commands reside in files in the SQL_commands directory alongside BioPSI_Suplimenter in CVS:

  1. The commands in make_canonical_tables.sql should be executed in the build database.
  2. Then, this process itself should be run.
  3. Then, the commands in make_canonical_table_products.sql should be executed in the build database.

All iRefIndex Pages

Follow this link for a listing of all iRefIndex related pages (archived and current).