Difference between revisions of "iRefIndex Build Process"
PaulBoddie (talk | contribs) (→HPRD: Added examples.) |
PaulBoddie (talk | contribs) (→OPHID: Added examples.) |
||
Line 158: | Line 158: | ||
In the main downloaded data directory, create a subdirectory hierarchy as | In the main downloaded data directory, create a subdirectory hierarchy as | ||
− | noted above. | + | noted above. For example: |
+ | |||
+ | <pre>mkdir -p /home/irefindex/data/OPHID/2009-02-19/</pre> | ||
Copy the file <tt>ophid1153236640123.xml</tt> to the newly created data directory. | Copy the file <tt>ophid1153236640123.xml</tt> to the newly created data directory. |
Revision as of 13:05, 19 February 2009
Contents
- 1 Downloading the Source Data
- 2 Manual Downloads
- 3 Building FTPtransfer
- 4 Running FTPtransfer
- 5 Building SHA
- 6 Building BioPSI_Suplimenter
- 7 Creating the Database
- 8 Running BioPSI_Suplimenter
- 9 Building StaxPSIXML
- 10 Running StaxPSIXML
- 11 Running BioPSI_Suplimenter (continued)
- 12 Building PSI_Writer
- 13 Running PSI_Writer
Downloading the Source Data
Before downloading the source data, a location must be chosen for the downloaded files. For example:
/home/irefindex/data
Download the files to create local copies. This is not possible for all the data sources and some need special links to be obtained from the source administrators via e-mail. The FTPtransfer program will download data from the following sources:
- RefSeq
- MMDB
- PDB
- gene2refseq
- IntAct
- MINT
Manual Downloads
More information can be found at the following location:
ftp://ftp.no.embnet.org/irefindex/data/current/sources.htm
For each manual download, a subdirectory hierarchy must be created in the main data directory using a command of the following form:
mkdir -p <path-to-data>/<source>/<date>/
Here, <path-to-data> should be replaced by the location of the data directory, <source> should be replaced by the name of the source, and <date> should be replaced by the current date.
For example, for BIND this directory might be created as follows:
mkdir -p /home/irefindex/data/BIND/2009-02-19/
BIND
The FTP site was previously available at the following location:
ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/
An archived copy of the data can be found at the following internal location:
/biotek/dias/donaldson3/Sabry/DATA_2006/BINDftp/
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/BIND/2009-02-19/
Copy the following following files into the newly created data directory:
20060525.complex2refs.txt 20060525.complex2subunits.txt 20060525.ints.txt 20060525.labels.txt 20060525.refs.txt
BioGrid
The location of BioGrid downloads is as follows:
http://www.thebiogrid.org/downloads.php
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/BioGrid/2009-02-19/
Select the BIOGRID-ORGANISM-XXXXX.psi25.zip file and download/copy it to the newly created data directory for BioGrid.
In the data directory for BioGrid, uncompress the downloaded file; for example:
cd /home/irefindex/data/BioGrid/2009-02-19/ unzip BIOGRID-ORGANISM-2.0.49.psi25.zip
CORUM
The location of CORUM downloads is as follows:
http://mips.gsf.de/genre/proj/corum/index.html
The specific download file is this one:
http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/CORUM/2009-02-19/
Copy/download the file referenced above and uncompress it in the data directory for CORUM; for example:
cd /home/irefindex/data/CORUM/2009-02-19/ unzip allComplexes.psimi.zip
Important Note
The CORUM data needs adjusting to work with the StaxPSIXML software. See the #Running StaxPSIXML section for details.
DIP
Access to data from DIP is performed via the following location:
http://dip.doe-mbi.ucla.edu/dip/Login.cgi?
You have to register, agree to terms, and get a user account.
Access credentials for internal users are available from Sabry.
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/DIP/2009-02-19/
Select the FULL - complete DIP data set from the Files page:
http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=3
Download the latest PSI-MI 2.5 file (dip<date>.mif25) to the newly created data directory. If a compressed version of the file was chosen, uncompress the file using the gunzip tool. For example:
cd /home/irefindex/data/DIP/2009-02-19/ gunzip dip20080708.mif25
HPRD
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/HPRD/2009-02-19/
Download the PSI-MI single file (HPRD_SINGLE_PSIMI_<date>.xml.tar.gz) to the newly created data directory.
Note: you have to register each and every time, unfortunately.
Uncompress the downloaded file. For example:
cd /home/irefindex/data/HPRD/2009-02-19/ tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz
OPHID
OPHID is no longer available, so you have to use the local copy of the data:
/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/OPHID/2009-02-19/
Copy the file ophid1153236640123.xml to the newly created data directory.
MIPS
In the main downloaded data directory, create a subdirectory hierarchy as noted above.
For MPPI, download the following file:
http://mips.gsf.de/proj/ppi/data/mppi.gz
For MPACT, download the following file:
ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz
Uncompress the downloaded files:
gunzip mpact-complete.psi25.xml.gz gunzip mppi.gz
UniProt
In the main downloaded data directory, create a subdirectory hierarchy as noted above.
Visit the following site:
http://www.uniprot.org/downloads
Download the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL files in text format:
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
Or from the EBI UK mirror:
- ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
- ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
- ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
These files should be moved into the newly created data directory and uncompressed:
gunzip uniprot_sprot.dat.gz gunzip uniprot_trembl.dat.gz gunzip uniprot_sprot_varsplic.fasta.gz
Building FTPtransfer
The FTPtransfer.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/FTPtransfer
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...
...or from a mirror such as the following:
- Extract the dependencies:
tar zxf commons-net-1.4.1.tar.gz
This will produce a directory called commons-net-1.4.1 containing a file called commons-net-1.4.1.jar which should be placed in the lib directory in the FTPtransfer directory...
mkdir lib cp commons-net-1.4.1/commons-net-1.4.1.jar lib/
Alternatively, the external libraries can also be found in the following location:
/biotek/dias/donaldson3/iRefIndex/External_libraries
- Customise the output locations. Currently, the output locations are hard-coded, and changing them would involve searching for the following...
/biotek/prometheus/storage/Sabry/data
...and replacing it with the path to the preferred output directory. The source code is found in the following directory within the FTPtransfer directory:
src/ftptransfer
- Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the FTPtransfer directory:
cp Build_files/build.xml .
Compile and create the .jar file as follows:
ant jar
Running FTPtransfer
To run the program, invoke the .jar file as follows:
java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log
The specified log argument can be replaced with a suitable location for the program's execution log.
Building SHA
The SHA.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/SHA
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Compile the source code. Compile and create the .jar file as follows:
ant jar
The SHA.jar file will be created in the dist directory.
Building BioPSI_Suplimenter
The BioPSI_Suplimenter.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:
- Extract the dependencies:
tar zxf mysql-connector-java-5.1.6.tar.gz
This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the BioPSI_Suplimenter directory...
mkdir lib cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/
The SHA.jar file needs copying from its build location:
cp ../SHA/dist/SHA.jar lib/
Alternatively, the external libraries can also be found in the following location:
/biotek/dias/donaldson3/iRefIndex/External_libraries
- Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the BioPSI_Suplimenter directory:
cp Build_files/build.xml .
Compile and create the .jar file as follows:
ant jar
Creating the Database
Enter MySQL using a command like the following:
mysql -h <host> -u <admin> -p -A
The <admin> is the name of the user with administrative privileges. For example:
mysql -h myhost -u admin -p -A
Then create a database and user using commands of the following form:
create database <database>; create user '<username>'@'%' identified by '<password>'; grant all privileges on <database>.* to '<username>'@'%';
For example, with <database> given as irefindex, <username> given as irefindex, and a substitution for <password>:
create database irefindex; create user 'irefindex'@'%' identified by 'mysecretpassword'; grant all privileges on irefindex.* to 'irefindex'@'%';
Running BioPSI_Suplimenter
Run the program as follows:
java -jar -Xms256m -Xmx256m build/jar/BioPSI_Suplimenter.jar &
In the dialogue box that appears, the following details must always be filled out:
- Database
- the <database> value specified when creating the database
- User name
- the <username> value specified above
- Password
- the <password> value specified above
- Log file
- the path to a log file where the program's output shall be written
Make sure that the log file will be written to a directory which already exists.
A number of steps or programs will need to be executed, and these are described in separate sections below. For each one, select the corresponding menu item in the "Program" field shown in the dialogue.
Create tables
The SQL file field should refer to the Create_iRefIndex.sql file in the SQL directory within BioPSI_Suplimenter, and this should be a full path to the file. For example:
/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/SQL/Create_iRefIndex.sql
Click the OK button to create the tables.
Recreate SEGUID
The File field should refer to the seguidannotation file in the SEGUID subdirectory hierarchy. For example, given the following data directory...
/home/irefindex/data
...an appropriate value for the File field might be this:
/home/irefindex/data/SEGUID/09_22_2008/seguidannotation
For the following fields, indicate the locations of the corresponding files similarly:
- Unip_SP_file
- uniprot_sprot.dat (from UniProt)
- Unip_Trm_file
- uniprot_trembl.dat (from UniProt)
- unip_Isoform_file
- uniprot_sprot_varsplic.fasta (from UniProt)
- Unip_Yeast_file
- yeast.txt (in the yeast directory)
- Unip_Fly_file
- fly.txt (in the fly directory)
- Fasta 4 PDB
- pdbaa.fasta (from PDB)
- Tax Table 4 PDB
- tax.table (from PDB)
- Gene info file
- gene_info.txt (in the geneinfo directory)
- gene2Refseq
- gene2refseq.txt (in the NCBI_Mappings directory)
For RefSeq DIR, a directory needs to be given instead of individual filenames.
Fill Bind info
The file fields should refer to the appropriate BIND files in the data directory hierarchy. For example, for Bind Ints file:
/home/irefindex/data/BIND/09_22_2008/20060525.ints.txt
Building StaxPSIXML
The StaxPSIXML.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location: You may choose to refer to the download from the BioPSI_Suplimenter build process.
- Extract the dependencies:
tar zxf mysql-connector-java-5.1.6.tar.gz
This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the StaxPSIXML directory...
mkdir lib cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/
You may instead choose to copy the library from the BioPSI_Suplimenter/lib directory:
mkdir lib cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/
- Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the StaxPSIXML directory:
cp Build_files/build.xml .
Compile and create the .jar file as follows:
ant jar
Running StaxPSIXML
The software must first be configured using files provided in the config directory:
- configFileList.txt
- Edit the #CONFIG entries in order to refer to the locations of each of the individual configuration files. To remove a data source, remove the leading # character from the appropriate line.
- config_X_SOURCE.xml
- Each supplied configuration file has this form, where X is an arbitrary identifier which helps to distinguish between different configuration versions, and where SOURCE is a specific data source name such as one of the following:
- BIOGRID
- DIP
- HPRD
- IntAct
- MINT
- MIPS
- MIPS_MPACT
- OPHID
In each file (specified in configFileList.txt), a number of elements need to be defined within the locations element:
- logger
- The location of the log file to be created. If a log file already exists, the new information will be appended. In the event that the program throws more than 50000 exceptions, the errors will be continued in new files ordered by a numeric identifier specified at the end of each filename.
- data
- This the location of the PSI-XML files to be parsed. For example:
/home/irefindex/data/BioGrid/09_22_2008/textfiles/
This will directly also store the lastUpdate.obj file: this file contains successfully parsed files and allow the parsing to be processed from the last successful point in the event of a disruption. This also prevents accidental parsing of files more than once. If all files have to be parsed again (in the case of a new build, for example) lastUpdate.obj has to be deleted. If only certain files to be parsed again use the Exemptions option instead.
- Exemptions
- This gives the location of PSI-MI files to re-parsed, thus overriding the lastUpdate.obj control. This may be needed if information from certain files has to be parsed again, but this directory should be created and left empty initially.
- mapper
- This is the most important part of the parsing. This file defines where to obtain the data from the XML file and which field of the table in the database the data is destined for. The location of this file is typically within the source code distribution, and an absolute path should be specified. For example:
/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML/mapper/Map25_INTACT_MINT_BIOGRID.xml
More information about the mapper is available in Readme_mapper.txt within the StaxPSIXML directory.
For the MIPS and MIPS_MPACT sources, the following "specs" element needs to be changed:
- filetype
- A specific file should be specified. For MIPS, this should be something like the following:
mppi.xml
For MIPS_MPACT, the file should be something like this:
mpact-complete.psi25.xml
Important Note
For #CORUM, the downloaded data file must be modified before running the StaxPSIXML software.
Using a suitable XSLT tool such as xsltproc, transform the uncompressed downloaded file as follows (substituting the appropriate data directory details for your own environment):
mv /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi.orig xsltproc XSLT/fix_corum.xsl /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi.orig > /home/irefindex/data/CORUM/2009-02-19/allComplexes.psimi
The fix_corum.xsl file can be found in the XSLT directory within StaxPSIXML.
When all configuration files and mapper files are ready. Run the program:
java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar -f <config_file_list_file>
GUI version:
java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar
Running BioPSI_Suplimenter (continued)
ROG_ass + RIG_fill+ make Cy
A table prepared from Web service data needs to be given for the Pre_build Eutils field.
One way of ensuring that this table exists and is suitable is to drop any existing table within the database being built, then to copy an existing table from a previously built database:
use <database>; drop table eutils; create table eutils like <old_database>.eutils; insert into eutils select * from <old_database>.eutils;
For example:
use irefindex; drop table eutils; create table eutils like old_db.eutils; insert into eutils select * from old_db.eutils;
Building PSI_Writer
The PSI_Writer.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/PSI_Writer
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:
- Extract the dependencies:
tar zxf mysql-connector-java-5.1.6.tar.gz
This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the PSI_Writer directory...
mkdir lib cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/
You may instead choose to copy the library from the BioPSI_Suplimenter/lib directory:
mkdir lib cp ../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/
The SHA.jar file needs copying from its build location:
cp ../SHA/dist/SHA.jar lib/
Alternatively, the external libraries can also be found in the following location:
/biotek/dias/donaldson3/iRefIndex/External_libraries
- Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the PSI_Writer directory:
cp Build_files/build.xml .
Compile and create the .jar file as follows:
ant jar
Running PSI_Writer
In order to run the program, some additional database tables are required. One way of ensuring that such tables exist and are suitable is to drop any existing tables within the database being built, then to copy existing tables from a previously built database:
use <database>; drop table mapping_intDitection; drop table mapping_intType; drop table mapping_partidentification; create table mapping_intDitection like <old_database>.mapping_intDitection; create table mapping_intType like <old_database>.mapping_intType; create table mapping_partidentification like <old_database>.mapping_partidentification; insert into mapping_intDitection select * from <old_database>.mapping_intDitection; insert into mapping_intType select * from <old_database>.mapping_intType; insert into mapping_partidentification select * from <old_database>.mapping_partidentification;
For example:
use irefindex; drop table mapping_intDitection; drop table mapping_intType; drop table mapping_partidentification; create table mapping_intDitection like old_db.mapping_intDitection; create table mapping_intType like old_db.mapping_intType; create table mapping_partidentification like old_db.mapping_partidentification; insert into mapping_intDitection select * from old_db.mapping_intDitection; insert into mapping_intType select * from old_db.mapping_intType; insert into mapping_partidentification select * from old_db.mapping_partidentification;
Run the program as follows:
java -jar -Xms256m -Xmx256m build/jar/PSI_Writer.jar
Follow the instructions, supplying the requested arguments when running the program again.