Difference between revisions of "iRefIndex Build Process"

From irefindex
(→‎Fill Bind info: Added notes about the "Base loc" and "Date info" fields.)
m (Changed dates and date formats.)
Line 38: Line 38:
 
For example, for BIND this directory might be created as follows:
 
For example, for BIND this directory might be created as follows:
  
<pre>mkdir -p /home/irefindex/data/BIND/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BIND/2010-02-08/</pre>
  
 
=== BIND ===
 
=== BIND ===
Line 53: Line 53:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/BIND/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BIND/2010-02-08/</pre>
  
 
Copy the following following files into the newly created data directory:
 
Copy the following following files into the newly created data directory:
Line 71: Line 71:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/BIND_Translation/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BIND_Translation/2010-02-08/</pre>
  
 
Copy the following following files into the newly created data directory:
 
Copy the following following files into the newly created data directory:
Line 95: Line 95:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/BioGrid/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/BioGrid/2010-02-08/</pre>
  
 
Select the <tt>BIOGRID-ORGANISM-XXXXX.psi25.zip</tt> file and download/copy it to the newly
 
Select the <tt>BIOGRID-ORGANISM-XXXXX.psi25.zip</tt> file and download/copy it to the newly
Line 103: Line 103:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/BioGrid/2009-02-19/
+
cd /home/irefindex/data/BioGrid/2010-02-08/
 
unzip BIOGRID-ORGANISM-2.0.49.psi25.zip</pre>
 
unzip BIOGRID-ORGANISM-2.0.49.psi25.zip</pre>
  
Line 118: Line 118:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
 
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/CORUM/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/CORUM/2010-02-08/</pre>
  
 
Copy/download the file referenced above and uncompress it in the data directory for CORUM. For example:
 
Copy/download the file referenced above and uncompress it in the data directory for CORUM. For example:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/CORUM/2009-02-19/
+
cd /home/irefindex/data/CORUM/2010-02-08/
 
unzip allComplexes.psimi.zip</pre>
 
unzip allComplexes.psimi.zip</pre>
  
Line 143: Line 143:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/DIP/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/DIP/2010-02-08/</pre>
  
 
Select the <tt>FULL - complete DIP data set</tt> from the <tt>Files</tt> page:
 
Select the <tt>FULL - complete DIP data set</tt> from the <tt>Files</tt> page:
Line 154: Line 154:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/DIP/2009-02-19/
+
cd /home/irefindex/data/DIP/2010-02-08/
 
gunzip dip20080708.mif25.gz</pre>
 
gunzip dip20080708.mif25.gz</pre>
  
Line 164: Line 164:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/HPRD/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/HPRD/2010-02-08/</pre>
  
 
Download the PSI-MI single file (<tt>HPRD_SINGLE_PSIMI_<date>.xml.tar.gz</tt>) to the
 
Download the PSI-MI single file (<tt>HPRD_SINGLE_PSIMI_<date>.xml.tar.gz</tt>) to the
Line 174: Line 174:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/HPRD/2009-02-19/
+
cd /home/irefindex/data/HPRD/2010-02-08/
 
tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz</pre>
 
tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz</pre>
  
Line 184: Line 184:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/I2D/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/I2D/2010-02-08/</pre>
  
 
For the <tt>Download Format</tt> in the download request form, specify <tt>PSI-MI 2.5 XML</tt>. Unfortunately, each <tt>Target Organism</tt> must be specified in turn when submitting the form: there is no <tt>ALL</tt> option.
 
For the <tt>Download Format</tt> in the download request form, specify <tt>PSI-MI 2.5 XML</tt>. Unfortunately, each <tt>Target Organism</tt> must be specified in turn when submitting the form: there is no <tt>ALL</tt> option.
Line 191: Line 191:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/I2D/2009-02-19/
+
cd /home/irefindex/data/I2D/2010-02-08/
 
unzip i2d.HUMAN.psi25.zip</pre>
 
unzip i2d.HUMAN.psi25.zip</pre>
  
Line 205: Line 205:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/OPHID/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/OPHID/2010-02-08/</pre>
  
 
Copy the file <tt>ophid1153236640123.xml</tt> to the newly created data directory.
 
Copy the file <tt>ophid1153236640123.xml</tt> to the newly created data directory.
Line 214: Line 214:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/MIPS/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/MIPS/2010-02-08/</pre>
  
 
For MPPI, download the following file:
 
For MPPI, download the following file:
Line 227: Line 227:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/MIPS/2009-02-19/
+
cd /home/irefindex/data/MIPS/2010-02-08/
 
gunzip mpact-complete.psi25.xml.gz
 
gunzip mpact-complete.psi25.xml.gz
 
gunzip mppi.gz</pre>
 
gunzip mppi.gz</pre>
Line 236: Line 236:
 
noted above. For example:
 
noted above. For example:
  
<pre>mkdir -p /home/irefindex/data/UniProt/2009-02-19/</pre>
+
<pre>mkdir -p /home/irefindex/data/UniProt/2010-02-08/</pre>
  
 
Visit the following site:
 
Visit the following site:
Line 261: Line 261:
  
 
<pre>
 
<pre>
cd /home/irefindex/data/UniProt/2009-02-19/
+
cd /home/irefindex/data/UniProt/2010-02-08/
 
gunzip uniprot_sprot.dat.gz
 
gunzip uniprot_sprot.dat.gz
 
gunzip uniprot_trembl.dat.gz
 
gunzip uniprot_trembl.dat.gz
Line 549: Line 549:
 
...an appropriate value for the <tt>File</tt> field might be this:
 
...an appropriate value for the <tt>File</tt> field might be this:
  
<pre>/home/irefindex/data/SEGUID/09_22_2008/seguidannotation</pre>
+
<pre>/home/irefindex/data/SEGUID/2010-02-08/seguidannotation</pre>
  
 
For the following fields, indicate the locations of the corresponding files and directories similarly:
 
For the following fields, indicate the locations of the corresponding files and directories similarly:
Line 556: Line 556:
 
<dt>Unip_SP_file
 
<dt>Unip_SP_file
 
<dd><tt>uniprot_sprot.dat</tt> (from UniProt); for example:
 
<dd><tt>uniprot_sprot.dat</tt> (from UniProt); for example:
   <pre>/home/irefindex/data/UniProt/2009-02-19/uniprot_sprot.dat</pre>
+
   <pre>/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot.dat</pre>
 
<dt>Unip_Trm_file
 
<dt>Unip_Trm_file
 
<dd><tt>uniprot_trembl.dat</tt> (from UniProt); for example:
 
<dd><tt>uniprot_trembl.dat</tt> (from UniProt); for example:
   <pre>/home/irefindex/data/UniProt/2009-02-19/uniprot_trembl.dat</pre>
+
   <pre>/home/irefindex/data/UniProt/2010-02-08/uniprot_trembl.dat</pre>
 
<dt>unip_Isoform_file
 
<dt>unip_Isoform_file
 
<dd><tt>uniprot_sprot_varsplic.fasta</tt> (from UniProt); for example:
 
<dd><tt>uniprot_sprot_varsplic.fasta</tt> (from UniProt); for example:
   <pre>/home/irefindex/data/UniProt/2009-02-19/uniprot_sprot_varsplic.fasta</pre>
+
   <pre>/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot_varsplic.fasta</pre>
 
<dt>Unip_Yeast_file
 
<dt>Unip_Yeast_file
 
<dd><tt>yeast.txt</tt> (in the <tt>yeast</tt> directory); for example:
 
<dd><tt>yeast.txt</tt> (in the <tt>yeast</tt> directory); for example:
   <pre>/home/irefindex/data/yeast/02_19_2009/yeast.txt</pre>
+
   <pre>/home/irefindex/data/yeast/2010-02-08/yeast.txt</pre>
 
<dt>Unip_Fly_file
 
<dt>Unip_Fly_file
 
<dd><tt>fly.txt</tt> (in the <tt>fly</tt> directory); for example:
 
<dd><tt>fly.txt</tt> (in the <tt>fly</tt> directory); for example:
   <pre>/home/irefindex/data/fly/02_19_2009/fly.txt</pre>
+
   <pre>/home/irefindex/data/fly/2010-02-08/fly.txt</pre>
 
<dt>RefSeq DIR
 
<dt>RefSeq DIR
 
<dd>The specific download directory for RefSeq; for example:
 
<dd>The specific download directory for RefSeq; for example:
   <pre>/home/irefindex/data/RefSeq/02_19_2009/</pre>
+
   <pre>/home/irefindex/data/RefSeq/2010-02-08/</pre>
 
<dt>Fasta 4 PDB
 
<dt>Fasta 4 PDB
 
<dd><tt>pdbaa.fasta</tt> (from PDB); for example:
 
<dd><tt>pdbaa.fasta</tt> (from PDB); for example:
   <pre>/home/irefindex/data/PDB/02_19_2009/pdbaa.fasta</pre>
+
   <pre>/home/irefindex/data/PDB/2010-02-08/pdbaa.fasta</pre>
 
<dt>Tax Table 4 PDB
 
<dt>Tax Table 4 PDB
 
<dd><tt>tax.table</tt> (from PDB); for example:
 
<dd><tt>tax.table</tt> (from PDB); for example:
   <pre>/home/irefindex/data/PDB/02_19_2009/tax.table</pre>
+
   <pre>/home/irefindex/data/PDB/2010-02-08/tax.table</pre>
 
<dt>Gene info file
 
<dt>Gene info file
 
<dd><tt>gene_info.txt</tt> (in the <tt>geneinfo</tt> directory); for example:
 
<dd><tt>gene_info.txt</tt> (in the <tt>geneinfo</tt> directory); for example:
   <pre>/home/irefindex/data/geneinfo/02_19_2009/gene_info.txt</pre>
+
   <pre>/home/irefindex/data/geneinfo/2010-02-08/gene_info.txt</pre>
 
<dt>gene2Refseq
 
<dt>gene2Refseq
 
<dd><tt>gene2refseq.txt</tt> (in the <tt>NCBI_Mappings</tt> directory); for example:
 
<dd><tt>gene2refseq.txt</tt> (in the <tt>NCBI_Mappings</tt> directory); for example:
   <pre>/home/irefindex/data/NCBI_Mappings/02_19_2009/gene2refseq.txt</pre>
+
   <pre>/home/irefindex/data/NCBI_Mappings/2010-02-08/gene2refseq.txt</pre>
 
</dl>
 
</dl>
  
Line 591: Line 591:
 
directory hierarchy. For example, for <tt>Bind Ints file</tt>:
 
directory hierarchy. For example, for <tt>Bind Ints file</tt>:
  
<pre>/home/irefindex/data/BIND/09_22_2008/20060525.ints.txt</pre>
+
<pre>/home/irefindex/data/BIND/2010-02-08/20060525.ints.txt</pre>
  
 
To conveniently edit all file fields, you can edit the <tt>Base loc</tt> field, inserting the top-level data directory. For example:
 
To conveniently edit all file fields, you can edit the <tt>Base loc</tt> field, inserting the top-level data directory. For example:
Line 687: Line 687:
 
<dd>This the location of the PSI-XML files to be parsed. For example:
 
<dd>This the location of the PSI-XML files to be parsed. For example:
  
   <pre>/home/irefindex/data/BioGrid/09_22_2008/textfiles/</pre>
+
   <pre>/home/irefindex/data/BioGrid/2010-02-08/textfiles/</pre>
  
 
This will directly also store the <tt>lastUpdate.obj</tt> file:
 
This will directly also store the <tt>lastUpdate.obj</tt> file:
Line 733: Line 733:
  
 
<pre>
 
<pre>
mv /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi.orig
+
mv /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi.orig
xsltproc XSLT/fix_corum.xsl /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi.orig > /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi</pre>
+
xsltproc XSLT/fix_corum.xsl /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi.orig > /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi</pre>
  
 
The <tt>fix_corum.xsl</tt> file can be found in the <tt>XSLT</tt> directory within <tt>StaxPSIXML</tt>.
 
The <tt>fix_corum.xsl</tt> file can be found in the <tt>XSLT</tt> directory within <tt>StaxPSIXML</tt>.

Revision as of 18:11, 18 February 2010

Downloading the Source Data

Before downloading the source data, a location must be chosen for the downloaded files. For example:

/home/irefindex/data

Some data sources need special links to be obtained from their administrators via e-mail, and in general there is a distinction between free and proprietary data sources, described as follows:

Free
BIND, BioGrid, Gene2Refseq (NCBI), I2D (from iRefIndex 7.0), IntAct, MINT, MMDB/PDB, MPPI, OPHID (before iRefIndex 7.0), RefSeq, UniProt
Proprietary
CORUM, DIP, HPRD, MPact

The FTPtransfer program will download data from the following sources:

  • Gene2Refseq
  • IntAct
  • MINT
  • MMDB
  • PDB
  • RefSeq
  • UniProt

Manual Downloads

More information can be found at the following location: Sources_iRefIndex_3.0

For each manual download, a subdirectory hierarchy must be created in the main data directory using a command of the following form:

mkdir -p <path-to-data>/<source>/<date>/

Here, <path-to-data> should be replaced by the location of the data directory, <source> should be replaced by the name of the source, and <date> should be replaced by the current date.

For example, for BIND this directory might be created as follows:

mkdir -p /home/irefindex/data/BIND/2010-02-08/

BIND

The FTP site was previously available at the following location:

ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/

An archived copy of the data can be found at the following internal location:

/biotek/dias/donaldson3/Sabry/DATA_2006/BINDftp/

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/BIND/2010-02-08/

Copy the following following files into the newly created data directory:

20060525.complex2refs.txt
20060525.complex2subunits.txt
20060525.ints.txt
20060525.labels.txt
20060525.refs.txt

BIND Translation

No download location is currently provided for BIND Translation, and this information is provided for reference only.

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/BIND_Translation/2010-02-08/

Copy the following following files into the newly created data directory:

SelectedSpecies_1.zip
SelectedSpecies_2.zip

In the data directory, uncompress the files. For example:

unzip SelectedSpecies_1.zip
unzip SelectedSpecies_2.zip

BioGrid

The location of BioGrid downloads is as follows:

http://www.thebiogrid.org/downloads.php

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/BioGrid/2010-02-08/

Select the BIOGRID-ORGANISM-XXXXX.psi25.zip file and download/copy it to the newly created data directory for BioGrid.

In the data directory for BioGrid, uncompress the downloaded file. For example:

cd /home/irefindex/data/BioGrid/2010-02-08/
unzip BIOGRID-ORGANISM-2.0.49.psi25.zip

CORUM

The location of CORUM downloads is as follows:

http://mips.gsf.de/genre/proj/corum/index.html

The specific download file is this one:

http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/CORUM/2010-02-08/

Copy/download the file referenced above and uncompress it in the data directory for CORUM. For example:

cd /home/irefindex/data/CORUM/2010-02-08/
unzip allComplexes.psimi.zip

Important Note

The CORUM data needs adjusting to work with the StaxPSIXML software. See the #Running StaxPSIXML section for details.

DIP

Access to data from DIP is performed via the following location:

http://dip.doe-mbi.ucla.edu/dip/Login.cgi?

You have to register, agree to terms, and get a user account.

Access credentials for internal users are available from Sabry.

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/DIP/2010-02-08/

Select the FULL - complete DIP data set from the Files page:

http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=3

Download the latest PSI-MI 2.5 file (dip<date>.mif25) to the newly created data directory. If a compressed version of the file was chosen, uncompress the file using the gunzip tool. For example:

cd /home/irefindex/data/DIP/2010-02-08/
gunzip dip20080708.mif25.gz

HPRD

http://www.hprd.org/download/

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/HPRD/2010-02-08/

Download the PSI-MI single file (HPRD_SINGLE_PSIMI_<date>.xml.tar.gz) to the newly created data directory.

Note: you have to register each and every time, unfortunately.

Uncompress the downloaded file. For example:

cd /home/irefindex/data/HPRD/2010-02-08/
tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz

I2D

http://ophid.utoronto.ca/ophidv2.201/downloads.jsp

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/I2D/2010-02-08/

For the Download Format in the download request form, specify PSI-MI 2.5 XML. Unfortunately, each Target Organism must be specified in turn when submitting the form: there is no ALL option.

Uncompress each downloaded file. For example:

cd /home/irefindex/data/I2D/2010-02-08/
unzip i2d.HUMAN.psi25.zip

OPHID

From iRefIndex 7.0, OPHID is no longer used.

OPHID is no longer available, so you have to use the local copy of the data:

/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/OPHID/2010-02-08/

Copy the file ophid1153236640123.xml to the newly created data directory.

MIPS

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/MIPS/2010-02-08/

For MPPI, download the following file:

http://mips.gsf.de/proj/ppi/data/mppi.gz

For MPACT, download the following file:

ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz

Uncompress the downloaded files. For example:

cd /home/irefindex/data/MIPS/2010-02-08/
gunzip mpact-complete.psi25.xml.gz
gunzip mppi.gz

UniProt

In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:

mkdir -p /home/irefindex/data/UniProt/2010-02-08/

Visit the following site:

http://www.uniprot.org/downloads

Download the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL files in text format:

Or from the EBI UK mirror:

These files should be moved into the newly created data directory and uncompressed. For example:

cd /home/irefindex/data/UniProt/2010-02-08/
gunzip uniprot_sprot.dat.gz
gunzip uniprot_trembl.dat.gz
gunzip uniprot_sprot_varsplic.fasta.gz

Building FTPtransfer

The FTPtransfer.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/FTPtransfer

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...

    ...or from a mirror such as the following:

  3. Extract the dependencies:
    tar zxf commons-net-1.4.1.tar.gz

    This will produce a directory called commons-net-1.4.1 containing a file called commons-net-1.4.1.jar which should be placed in the lib directory in the FTPtransfer directory...

    mkdir lib
    cp commons-net-1.4.1/commons-net-1.4.1.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the FTPtransfer directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar

Running FTPtransfer

To run the program, invoke the .jar file as follows:

java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log datadir

The specified log argument can be replaced with a suitable location for the program's execution log, whereas the datadir argument should be replaced with a suitable location for downloaded data (such as /home/irefindex/data).

Building SHA

The SHA.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/SHA

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Compile the source code. Compile and create the .jar file as follows:
    ant jar

    The SHA.jar file will be created in the dist directory.

Building SaxValidator

The SaxValidator.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co Parser/SaxValidator

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Compile and create the .jar file as follows:
    ant jar

Running SaxValidator

The program used for validation and integrity checks is called SaxValidator and when the name was chosen it was merely a SAX-based validator. However, more functionality has since been included:

  1. Validate XML files against a schema.
  2. XML parser-independent counting of elements (count number of </interaction> and </interactor> tags in each file). This gives an indication on what to expect at the end of the parsing.
  3. Count number of lines in BIND text.
  4. Remove files containing negative interactions.

Run the program as follows:

java -jar -Xms256m -Xmx256m dist/SaxValidator.jar  <date extension> <validate true/false> <count elements true/false>

For example:

java -jar -Xms256m -Xmx256m dist/SaxValidator.jar /home/irefindex/data /2010-02-08/ true true

Be sure to include the leading and trailing / characters around the date information.

Building BioPSI_Suplimenter

The BioPSI_Suplimenter.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:
  3. Extract the dependencies. For example:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the BioPSI_Suplimenter directory...

    mkdir lib
    cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    The filenames in the above example will need adjusting, depending on the exact version of the library downloaded.

    The SHA.jar file needs copying from its build location:

    cp ../SHA/dist/SHA.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the BioPSI_Suplimenter directory:
    cp Build_files/build.xml .

    It might be necessary to edit the build.xml file, changing the particular filename for the .jar file whose name begins with mysql-connector-java, since this name will change between versions of that library.

    Compile and create the .jar file as follows:

    ant jar

Creating the Database

Enter MySQL using a command like the following:

mysql -h <host> -u <admin> -p -A

The <admin> is the name of the user with administrative privileges. For example:

mysql -h myhost -u admin -p -A

Then create a database and user using commands of the following form:

create database <database>;
create user '<username>'@'%' identified by '<password>';
grant all privileges on <database>.* to '<username>'@'%';

For example, with <database> given as irefindex, <username> given as irefindex, and a substitution for <password>:

create database irefindex;
create user 'irefindex'@'%' identified by 'mysecretpassword';
grant all privileges on irefindex.* to 'irefindex'@'%';

Running BioPSI_Suplimenter

Run the program as follows:

java -jar -Xms256m -Xmx768m build/jar/BioPSI_Suplimenter.jar &

In the dialogue box that appears, the following details must always be filled out:

Server
the <host> value specified when creating the database
Database
the <database> value specified when creating the database
User name
the <username> value specified above
Password
the <password> value specified above
Log file
the path to a log file where the program's output shall be written

Make sure that the log file will be written to a directory which already exists. For example:

mkdir /home/irefindex/logs/

The program will need to be executed a number of times for different activities, and these are described in separate sections below. For each one, select the corresponding menu item in the Program field shown in the dialogue.

Create tables

The SQL file field should refer to the Create_iRefIndex.sql file in the SQL directory within BioPSI_Suplimenter, and this should be a full path to the file. For example:

/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/SQL/Create_iRefIndex.sql

Click the OK button to create the tables.

Clone seguid

From release beta 7.0 , ROG is consistent with the previous release. Therefore, first operation when creating the new SEGUID table is to copy the SEGUID table from the previous release. This is included as an option in BioPSI_Suplimenter. The "seguidannotation" file is no longer parsed and if there is an updated version of this file from SEGUID people it has to be used as an updating step.

When the "Clone SEGUID" option selected form the BioPSI_Suplimenter GUI, the "SEGUID table" is the source seguid table (i.e for beta7, the seguid table filed would be beta6.seguid). The target database selected should not have a seguid table and if it has this will throw an error

Free and proprietary releases & clone seguid

iRefIndex has two subversion for every release, Free and proprietary. Therefore, not only the ROG should be consistent with the previous release it has to be consistent between the free and proprietary versions. So, the cloning will always be done using the earlier full/proprietary version as the source and the current full/proprietary as target. In other wards, the proprietary version will be made first and then the free version. Once the proprietary version is made the SEGUID table of the Free version is made by cloning the current proprietary versions SEGUID (not previous version)

Recreate SEGUID

The File field should refer to the seguidannotation file in the SEGUID subdirectory hierarchy. For example, given the following data directory...

/home/irefindex/data

...an appropriate value for the File field might be this:

/home/irefindex/data/SEGUID/2010-02-08/seguidannotation

For the following fields, indicate the locations of the corresponding files and directories similarly:

Unip_SP_file
uniprot_sprot.dat (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot.dat
Unip_Trm_file
uniprot_trembl.dat (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_trembl.dat
unip_Isoform_file
uniprot_sprot_varsplic.fasta (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot_varsplic.fasta
Unip_Yeast_file
yeast.txt (in the yeast directory); for example:
/home/irefindex/data/yeast/2010-02-08/yeast.txt
Unip_Fly_file
fly.txt (in the fly directory); for example:
/home/irefindex/data/fly/2010-02-08/fly.txt
RefSeq DIR
The specific download directory for RefSeq; for example:
/home/irefindex/data/RefSeq/2010-02-08/
Fasta 4 PDB
pdbaa.fasta (from PDB); for example:
/home/irefindex/data/PDB/2010-02-08/pdbaa.fasta
Tax Table 4 PDB
tax.table (from PDB); for example:
/home/irefindex/data/PDB/2010-02-08/tax.table
Gene info file
gene_info.txt (in the geneinfo directory); for example:
/home/irefindex/data/geneinfo/2010-02-08/gene_info.txt
gene2Refseq
gene2refseq.txt (in the NCBI_Mappings directory); for example:
/home/irefindex/data/NCBI_Mappings/2010-02-08/gene2refseq.txt

Fill Bind info

The file fields should refer to the appropriate BIND files in the data directory hierarchy. For example, for Bind Ints file:

/home/irefindex/data/BIND/2010-02-08/20060525.ints.txt

To conveniently edit all file fields, you can edit the Base loc field, inserting the top-level data directory. For example:

/home/irefindex/data

In addition, the Date info can also be changed to indicate the common date directory name used by the data sources. For example:

2010-02-08

Be sure to check the final values of the file fields themselves before activating the operation.

Building StaxPSIXML

The StaxPSIXML.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location: You may choose to refer to the download from the BioPSI_Suplimenter build process.
  3. Extract the dependencies:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the StaxPSIXML directory...

    mkdir lib
    cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    You may instead choose to copy the library from the BioPSI_Suplimenter/lib directory:

    mkdir lib
    cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/
    The filenames in the above examples will need adjusting, depending on the exact version of the library downloaded.
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the StaxPSIXML directory:
    cp Build_files/build.xml .

    It might be necessary to edit the build.xml file, changing the particular filename for the .jar file whose name begins with mysql-connector-java, since this name will change between versions of that library.

    Compile and create the .jar file as follows:

    ant jar

Running StaxPSIXML

The software must first be configured using files provided in the config directory:

configFileList.txt
Edit the #CONFIG entries in order to refer to the locations of each of the individual configuration files. To remove a data source, remove the leading # character from the appropriate line.
config_X_SOURCE.xml
Each supplied configuration file has this form, where X is an arbitrary identifier which helps to distinguish between different configuration versions, and where SOURCE is a specific data source name such as one of the following:
  • BIOGRID
  • DIP
  • HPRD
  • IntAct
  • MINT
  • MIPS
  • MIPS_MPACT
  • OPHID

In each file (specified in configFileList.txt), a number of elements need to be defined within the locations element:

logger
The location of the log file to be created. If a log file already exists, the new information will be appended. In the event that the program throws more than 50000 exceptions, the errors will be continued in new files ordered by a numeric identifier specified at the end of each filename.
data
This the location of the PSI-XML files to be parsed. For example:
/home/irefindex/data/BioGrid/2010-02-08/textfiles/

This will directly also store the lastUpdate.obj file: this file contains successfully parsed files and allow the parsing to be processed from the last successful point in the event of a disruption. This also prevents accidental parsing of files more than once. If all files have to be parsed again (in the case of a new build, for example) lastUpdate.obj has to be deleted. If only certain files to be parsed again use the Exemptions option instead.

Exemptions
This gives the location of PSI-MI files to re-parsed, thus overriding the lastUpdate.obj control. This may be needed if information from certain files has to be parsed again, but this directory should be created and left empty initially.
mapper
This is the most important part of the parsing. This file defines where to obtain the data from the XML file and which field of the table in the database the data is destined for. The location of this file is typically within the source code distribution, and an absolute path should be specified. For example:
/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML/mapper/Map25_INTACT_MINT_BIOGRID.xml

More information about the mapper is available in Readme_mapper.txt within the StaxPSIXML directory.

See also the documentation on the topic of adding sources to iRefIndex for details of writing new mapper configuration files.

For the MIPS and MIPS_MPACT sources, the following "specs" element needs to be changed:

filetype
A specific file should be specified. For MIPS, this should be something like the following:
mppi.xml

For MIPS_MPACT, the file should be something like this:

mpact-complete.psi25.xml

Important Note

For #CORUM, the downloaded data file must be modified before running the StaxPSIXML software.

Using a suitable XSLT tool such as xsltproc, transform the uncompressed downloaded file as follows (substituting the appropriate data directory details for your own environment):

mv /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi.orig
xsltproc XSLT/fix_corum.xsl /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi.orig > /home/irefindex/data/CORUM/2010-02-08/allComplexes.psimi

The fix_corum.xsl file can be found in the XSLT directory within StaxPSIXML.


When all configuration files and mapper files are ready. Run the program:

java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar -f <config_file_list_file>

GUI version:

java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar

Running BioPSI_Suplimenter (continued)

  *New : when running the  "ROG_ass + RIG_fill+ make Cy" for free and proprietary releases there is a slight difference starting from beta7 due ROG constancy issue. There is a check box in the GUI (in the red area) which is selected by default, which has to be deselected when making the FREE version. 

ROG_ass + RIG_fill+ make Cy

A table prepared from Web service data needs to be given for the Pre_build Eutils field. For example:

Pre_build Eutils
The name of the Web service data table; for example:
irefindex.eutils

One way of ensuring that this table exists and is suitable is to drop any existing table within the database being built, then to copy an existing table from a previously built database:

use <database>;
drop table eutils;
create table eutils like <old_database>.eutils;
insert into eutils select * from <old_database>.eutils;

For example:

use irefindex;
drop table eutils;
create table eutils like old_db.eutils;
insert into eutils select * from old_db.eutils;