iRefIndex Build Process

From irefindex
Revision as of 13:17, 13 February 2009 by PaulBoddie (talk | contribs) (Added more build details.)

Downloading the Source Data

Before downloading the source data, a location must be chosen for the downloaded files. For example:

/biotek/prometheus/storage/Sabry/data

Download the files to create local copies. This is not possible for all the data sources and some need special links to be obtained from the source administrators via e-mail. The FTPtransfer program will download data from the following sources:

  • RefSeq
  • MMDB
  • PDB
  • gene2refseq
  • IntAct
  • MINT

Manual Downloads

More information can be found at the following location:

ftp://ftp.no.embnet.org/irefindex/data/current/sources.htm

For each manual download, a subdirectory hierarchy must be created in the main data directory using a command of the following form:

mkdir -p <path-to-data>/<source>/<date>/

Here, <path-to-data> should be replaced by the location of the data directory, <source> should be replaced by the name of the source, and <date> should be replaced by the current date.

For example, for BIND this directory might be created as follows:

mkdir -p /biotek/prometheus/storage/Sabry/data/BIND/09_22_2008/

BIND

The FTP site was previously available at the following location:

ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/

An archived copy of the data can be found at the following internal location:

/biotek/dias/donaldson3/Sabry/DATA_2006/BINDftp/

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Copy the following following files into the newly created data directory:

20060525.complex2refs.txt
20060525.complex2subunits.txt
20060525.ints.txt
20060525.labels.txt
20060525.refs.txt

BioGrid

The location of BioGrid downloads is as follows:

http://www.thebiogrid.org/downloads.php

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Select the BIOGRID-ORGANISM-XXXXX.psi25.zip file and download it to the newly created data directory.

In the data directory, uncompress the downloaded file; for example:

unzip BIOGRID-ORGANISM-2.0.44.psi25.zip

CORUM

The location of CORUM downloads is as follows:

http://mips.gsf.de/genre/proj/corum/index.html

The specific download file is this one:

http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip

Uncompress the downloaded file:

unzip allComplexes.psimi.zip

Important Note

The CORUM data needs adjusting to work with the StaxPSIXML software. Using a suitable XSLT tool such as xsltproc, transform the uncompressed downloaded file as follows:

mv allComplexes.psimi allComplexes.psimi.orig
xsltproc fix_corum.xsl allComplexes.psimi.orig > allComplexes.psimi

The fix_corum.xsl file can be found in the XSLT directory within StaxPSIXML.

DIP

Access to data from DIP is performed via the following location:

http://dip.doe-mbi.ucla.edu/dip/Login.cgi?

You have to register, agree to terms, and get a user account.

Access credentials for internal users are available from Sabry.

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Select the FULL - complete DIP data set from the Files page:

http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=3

Download the latest PSI-MI 2.5 file (dip<date>.mif25) to the newly created data directory. If a compressed version of the file was chosen, uncompress the file using the gunzip tool. For example:

gunzip dip20080708.mif25

HPRD

http://www.hprd.org/download/

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Download the PSI-MI single file (HPRD_SINGLE_PSIMI_<date>.xml.tar.gz) to the newly created data directory.

Note: you have to register each and every time, unfortunately.

Uncompress the downloaded file. For example:

tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz

OPHID

OPHID is no longer available, so you have to use the local copy of the data:

/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Copy the file ophid1153236640123.xml to the newly created data directory.

MIPS

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

For MPPI, download the following file:

http://mips.gsf.de/proj/ppi/data/mppi.gz

For MPACT, download the following file:

ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz

Uncompress the downloaded files:

gunzip mpact-complete.psi25.xml.gz
gunzip mppi.gz

UniProt

In the main downloaded data directory, create a subdirectory hierarchy as noted above.

Visit the following site:

http://www.uniprot.org/downloads

Download the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL files in text format:

Or from the EBI UK mirror:

These files should be moved into the newly created data directory and uncompressed:

gunzip uniprot_sprot.dat.gz
gunzip uniprot_trembl.dat.gz
gunzip uniprot_sprot_varsplic.fasta.gz

Building FTPtransfer

The FTPtransfer.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/FTPtransfer

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...

    ...or from a mirror such as the following:

  3. Extract the dependencies:
    tar zxf commons-net-1.4.1.tar.gz

    This will produce a directory called commons-net-1.4.1 containing a file called commons-net-1.4.1.jar which should be placed in the lib directory in the FTPtransfer directory...

      mkdir lib
      cp commons-net-1.4.1/commons-net-1.4.1.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Customise the output locations. Currently, the output locations are hard-coded, and changing them would involve searching for the following...
    /biotek/prometheus/storage/Sabry/data

    ...and replacing it with the path to the preferred output directory. The source code is found in the following directory within the FTPtransfer directory:

    src/ftptransfer
  5. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the FTPtransfer directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar

Running FTPtransfer

To run the program, invoke the .jar file as follows:

java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log

The specified log argument can be replaced with a suitable location for the program's execution log.

Building SHA

The SHA.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/SHA

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Compile the source code. Compile and create the .jar file as follows:
    ant jar

    The SHA.jar file will be created in the dist directory.

Building BioPSI_Suplimenter

The BioPSI_Suplimenter.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:
  3. Extract the dependencies:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the BioPSI_Suplimenter directory...

      mkdir lib
      cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    The SHA.jar file needs copying from its build location:

    cp ../SHA/dist/SHA.jar lib/

    Alternatively, the external libraries can also be found in the following location:

    /biotek/dias/donaldson3/iRefIndex/External_libraries
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the BioPSI_Suplimenter directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar

Creating the Database

Enter MySQL using a command like the following:

mysql -h <host> -u <admin> -p -A

The <admin> is the name of the user with administrative privileges. For example:

mysql -h myhost -u admin -p -A

Then create a database and user using commands of the following form:

create database <database>;
create user '<username>'@'%' identified by '<password>';
grant all privileges on <database>.* to '<username>'@'%';

For example, with <database> given as irefindex, <username> given as irefindex, and a substitution for <password>:

create database irefindex;
create user 'irefindex'@'%' identified by 'mysecretpassword';
grant all privileges on irefindex.* to 'irefindex'@'%';

Running BioPSI_Suplimenter

Run the program as follows:

java -jar -Xms256m -Xmx256m build/jar/BioPSI_Suplimenter.jar &

In the dialogue box that appears, the following details must always be filled out:

Database
the <database> value specified when creating the database
User name
the <username> value specified above
Password
the <password> value specified above
Log file
the path to a log file where the program's output shall be written

Make sure that the log file will be written to a directory which already exists.

A number of steps or programs will need to be executed, and these are described in separate sections below. For each one, select the corresponding menu item in the "Program" field shown in the dialogue.

Create tables

The SQL file field should refer to the Create_iRefIndex.sql file in the SQL directory within BioPSI_Suplimenter, and this should be a full path to the file. For example:

/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/SQL/Create_iRefIndex.sql

Click the OK button to create the tables.

Recreate SEGUID

The File field should refer to the seguidannotation file in the SEGUID subdirectory hierarchy. For example, given the following data directory...

/biotek/prometheus/storage/Sabry/data

...an appropriate value for the File field might be this:

/biotek/prometheus/storage/Sabry/data/SEGUID/09_22_2008/seguidannotation

For the following fields, indicate the locations of the corresponding files similarly:

Unip_SP_file
uniprot_sprot.dat (from UniProt)
Unip_Trm_file
uniprot_trembl.dat (from UniProt)
unip_Isoform_file
uniprot_sprot_varsplic.fasta (from UniProt)
Unip_Yeast_file
yeast.txt (in the yeast directory)
Unip_Fly_file
fly.txt (in the fly directory)
Fasta 4 PDB
pdbaa.fasta (from PDB)
Tax Table 4 PDB
tax.table (from PDB)
Gene info file
gene_info.txt (in the geneinfo directory)
gene2Refseq
gene2refseq.txt (in the NCBI_Mappings directory)

For RefSeq DIR, a directory needs to be given instead of individual filenames.

Fill Bind info

The file fields should refer to the appropriate BIND files in the data directory hierarchy. For example, for Bind Ints file:

/biotek/prometheus/storage/Sabry/data/BIND/09_22_2008/20060525.ints.txt

Building StaxPSIXML

The StaxPSIXML.jar file needs to be obtained or built.

  1. Get the program's source code from this location:

    Using CVS with the appropriate CVSROOT setting, run the following command:

    cvs co bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML

    The CVSROOT environment variable should be set to the following for this to work:

    export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
    (The <username> should be replaced with your actual username.)
  2. Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location: You may choose to refer to the download from the BioPSI_Suplimenter build process.
  3. Extract the dependencies:
    tar zxf mysql-connector-java-5.1.6.tar.gz

    This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the StaxPSIXML directory...

      mkdir lib
      cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/

    You may instead choose to copy the library from the BioPSI_Suplimenter/lib directory:

      mkdir lib
      cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/
  4. Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the StaxPSIXML directory:
    cp Build_files/build.xml .

    Compile and create the .jar file as follows:

    ant jar