Difference between revisions of "iRefIndex Build Process"
PaulBoddie (talk | contribs) m (→Running BioPSI_Suplimenter: Qualify the initial statement.) |
PaulBoddie (talk | contribs) (→Clone seguid: Added a note about updating from a newer seguidannotation file. Modified the wording.) |
||
Line 609: | Line 609: | ||
From iRefIndex 7.0, each ROG is consistent with the previous release. Therefore, the first operation when creating the new SEGUID table is to copy the SEGUID table from the previous release. | From iRefIndex 7.0, each ROG is consistent with the previous release. Therefore, the first operation when creating the new SEGUID table is to copy the SEGUID table from the previous release. | ||
This is included as an option in BioPSI_Suplimenter. The <tt>seguidannotation</tt> file is no longer parsed and if there is an updated version of this file from the SEGUID project it has to be used as an updating step. | This is included as an option in BioPSI_Suplimenter. The <tt>seguidannotation</tt> file is no longer parsed and if there is an updated version of this file from the SEGUID project it has to be used as an updating step. | ||
+ | |||
+ | {{Note| | ||
+ | A process for updating from a newer version of the <tt>seguidannotation</tt> file is not currently defined in BioPSI_Suplimenter. | ||
+ | }} | ||
When the "Clone SEGUID" option selected from the BioPSI_Suplimenter GUI, the <tt>SEGUID table</tt> is the source <tt>seguid</tt> table (so for the database <tt>beta7</tt>, the <tt>seguid</tt> table used would be <tt>beta6.seguid</tt>). The target database selected should not have a <tt>seguid</tt> table and if it has, this will throw an error. | When the "Clone SEGUID" option selected from the BioPSI_Suplimenter GUI, the <tt>SEGUID table</tt> is the source <tt>seguid</tt> table (so for the database <tt>beta7</tt>, the <tt>seguid</tt> table used would be <tt>beta6.seguid</tt>). The target database selected should not have a <tt>seguid</tt> table and if it has, this will throw an error. | ||
Line 614: | Line 618: | ||
====Free and proprietary releases & clone seguid ==== | ====Free and proprietary releases & clone seguid ==== | ||
− | iRefIndex has two | + | iRefIndex has two subversions for every release, free and proprietary. Therefore, the ROG should not merely be consistent with the previous release: it has to be consistent between the free and proprietary versions of the release being made. Thus, the cloning will always be done using the earlier full/proprietary version as the source and the current full/proprietary as target. In other words, the proprietary version will be made first and then the free version. |
− | Once the proprietary version is made the SEGUID table of the | + | Once the proprietary version is made the SEGUID table of the free version is made by cloning the current proprietary version's SEGUID (not the previous version's). |
=== Recreate SEGUID === | === Recreate SEGUID === |
Revision as of 13:22, 30 September 2010
Contents
- 1 Downloading the Source Data
- 2 Manual Downloads
- 3 Build Dependencies
- 4 Building FTPtransfer
- 5 Running FTPtransfer
- 6 Building SHA
- 7 Building SaxValidator
- 8 Running SaxValidator
- 9 Building BioPSI_Suplimenter
- 10 Creating the Database
- 11 Running BioPSI_Suplimenter
- 12 Building StaxPSIXML
- 13 Running StaxPSIXML
- 14 Validating the Results from StaxPSIXML
- 15 Running BioPSI_Suplimenter (continued)
- 16 All iRefIndex Pages
Downloading the Source Data
Before downloading the source data, a location must be chosen for the downloaded files. For example:
/home/irefindex/data
Some data sources need special links to be obtained from their administrators via e-mail, and in general there is a distinction between free and proprietary data sources, described as follows:
- Free
- BIND, BioGrid, Gene2Refseq (NCBI), IntAct, MINT, MMDB/PDB, MPPI, OPHID, RefSeq, UniProt
- Proprietary
- BIND Translation, CORUM, DIP, HPRD, MPact
I2D, which was considered for iRefIndex 7.0, is currently under review for inclusion in future releases. The status of BIND Translation is currently under review for possible inclusion in the free dataset in future releases.
The FTPtransfer program will download data from the following sources:
- Gene2Refseq
- IntAct
- MINT
- MMDB
- PDB
- RefSeq
- UniProt
Manual Downloads
More information can be found at the following location: Sources_iRefIndex_8.0
For each manual download, a subdirectory hierarchy must be created in the main data directory using a command of the following form:
mkdir -p <path-to-data>/<source>/<date>/
Here, <path-to-data> should be replaced by the location of the data directory, <source> should be replaced by the name of the source, and <date> should be replaced by the current date.
For example, for BIND this directory might be created as follows:
mkdir -p /home/irefindex/data/BIND/2010-02-08/
BIND
The FTP site was previously available at the following location:
ftp://ftp.bind.ca/pub/BIND/data/bindflatfiles/bindindex/
An archived copy of the data can be found at the following internal location:
/biotek/dias/donaldson3/Sabry/DATA_2006/BINDftp/
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/BIND/2010-02-08/
Copy the following following files into the newly created data directory:
20060525.complex2refs.txt 20060525.complex2subunits.txt 20060525.ints.txt 20060525.labels.txt 20060525.refs.txt
BIND Translation
The location of BIND Translation downloads is as follows:
http://download.baderlab.org/BIND/
The location of the specific file to be downloaded is the following:
http://download.baderlab.org/BIND/PSIMI25_AllInclusive_AllSpecies.zip
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/BIND_Translation/2010-02-08/
Download the file into the newly created data directory and unpack it as follows:
cd /home/irefindex/data/BIND_Translation/2010-02-08/ unzip PSIMI25_AllInclusive_AllSpecies.zip
BioGrid
The location of BioGrid downloads is as follows:
http://www.thebiogrid.org/downloads.php
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/BioGrid/2010-02-08/
Select the BIOGRID-ALL-X.Y.Z.psi25.zip file (where X.Y.Z should be replaced by the actual release number) and download/copy it to the newly created data directory for BioGrid.
In the data directory for BioGrid, uncompress the downloaded file. For example:
cd /home/irefindex/data/BioGrid/2010-02-08/ unzip BIOGRID-ALL-2.0.62.psi25.zip
CORUM
The location of CORUM downloads is as follows:
http://mips.gsf.de/genre/proj/corum/index.html
The specific download file is this one:
http://mips.gsf.de/genre/export/sites/default/corum/allComplexes.psimi.zip
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/CORUM/2010-02-08/
Copy/download the file referenced above and uncompress it in the data directory for CORUM. For example:
cd /home/irefindex/data/CORUM/2010-02-08/ unzip allComplexes.psimi.zip
DIP
Access to data from DIP is performed via the following location:
http://dip.doe-mbi.ucla.edu/dip/Login.cgi?
You have to register, agree to terms, and get a user account.
Access credentials for internal users are available from Sabry.
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/DIP/2010-02-08/
Select the FULL - complete DIP data set from the Files page:
http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=3
Download the latest PSI-MI 2.5 file (dip<date>.mif25) to the newly created data directory. If a compressed version of the file was chosen, uncompress the file using the gunzip tool. For example:
cd /home/irefindex/data/DIP/2010-02-08/ gunzip dip20080708.mif25.gz
HPRD
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/HPRD/2010-02-08/
Download the PSI-MI single file (HPRD_SINGLE_PSIMI_<date>.xml.tar.gz) to the newly created data directory.
Note: you have to register each and every time, unfortunately.
Uncompress the downloaded file. For example:
cd /home/irefindex/data/HPRD/2010-02-08/ tar zxf HPRD_SINGLE_PSIMI_090107.xml.tar.gz
I2D
For iRefIndex 7.0, I2D was supposed to replace OPHID, but problems with the source files have excluded I2D from that release.
http://ophid.utoronto.ca/ophidv2.201/downloads.jsp
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/I2D/2010-02-08/
For the Download Format in the download request form, specify PSI-MI 2.5 XML. Unfortunately, each Target Organism must be specified in turn when submitting the form: there is no ALL option.
Uncompress each downloaded file. For example:
cd /home/irefindex/data/I2D/2010-02-08/ unzip i2d.HUMAN.psi25.zip
InnateDB
Select the "Curated InnateDB Data" download from the InnateDB downloads page:
http://www.innatedb.com/download.jsp
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/InnateDB/2010-02-08/
Uncompress the downloaded file. For example:
cd /home/irefindex/data/InnateDB/2010-02-08/ gunzip innatedb_20100716.xml.gz
OPHID
From iRefIndex 8.0, I2D replaces OPHID.
OPHID is no longer available, so you have to use the local copy of the data:
/biotek/dias/donaldson3/Sabry/iRefIndex_Backup/BckUp15SEP2008/OPHID/2008MAR16
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/OPHID/2010-02-08/
Copy the file ophid1153236640123.xml to the newly created data directory.
MIPS
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/MIPS/2010-02-08/
For MPPI, download the following file:
http://mips.gsf.de/proj/ppi/data/mppi.gz
For MPACT, download the following file:
ftp://ftpmips.gsf.de/yeast/PPI/mpact-complete.psi25.xml.gz
Uncompress the downloaded files. For example:
cd /home/irefindex/data/MIPS/2010-02-08/ gunzip mpact-complete.psi25.xml.gz gunzip mppi.gz
SEGUID
Downloading of the SEGUID dataset is described below.
UniProt
In the main downloaded data directory, create a subdirectory hierarchy as noted above. For example:
mkdir -p /home/irefindex/data/UniProt/2010-02-08/
Visit the following site:
http://www.uniprot.org/downloads
Download the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL files in text format:
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
- ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
Or from the EBI UK mirror:
- ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
- ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz
- ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
These files should be moved into the newly created data directory and uncompressed. For example:
cd /home/irefindex/data/UniProt/2010-02-08/ gunzip uniprot_sprot.dat.gz gunzip uniprot_trembl.dat.gz gunzip uniprot_sprot_varsplic.fasta.gz
Build Dependencies
To build the software, Apache Ant needs to be available. This software could be retrieved from the Apache site...
http://ant.apache.org/bindownload.cgi
...or from a mirror such as one of the following:
http://mirrorservice.nomedia.no/apache.org//ant/binaries/apache-ant-1.8.1-bin.tar.gz
http://mirrors.powertech.no/www.apache.org/dist//ant/binaries/apache-ant-1.8.1-bin.tar.gz
This software can be extracted as follows:
tar zxf apache-ant-1.8.1-bin.tar.gz
This will produce a directory called apache-ant-1.8.1 containing a directory called bin. The outer directory should be recorded in the ANT_HOME environment variable, whereas the bin directory should be incorporated into the PATH environment variable on your system. For example, for bash:
export ANT_HOME=/home/irefindex/apps/apache-ant-1.8.1 export PATH=${PATH}:${ANT_HOME}/bin
It should now be possible to run the ant program.
Building FTPtransfer
The FTPtransfer.jar file needs to be obtained or built.
- Get the program's source code from this location:
https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/FTPtransfer/
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/FTPtransfer
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the Apache commons-net package, and this must be available during compilation. This library could be retrieved from the Apache site...
http://commons.apache.org/downloads/download_net.cgi
...or from a mirror such as one of the following:
http://mirrorservice.nomedia.no/apache.org/commons/net/binaries/commons-net-1.4.1.tar.gz
http://www.powertech.no/apache/dist/commons/net/binaries/commons-net-1.4.1.tar.gz
- Extract the dependencies:
tar zxf commons-net-1.4.1.tar.gz
This will produce a directory called commons-net-1.4.1 containing a file called commons-net-1.4.1.jar which should be placed in the lib directory in the FTPtransfer directory...
mkdir lib cp commons-net-1.4.1/commons-net-1.4.1.jar lib/
Alternatively, the external libraries can also be found in the following location:
/biotek/dias/donaldson3/iRefIndex/External_libraries
- Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the FTPtransfer directory:
cp Build_files/build.xml .
Compile and create the .jar file as follows:
ant jar
Running FTPtransfer
To run the program, invoke the .jar file as follows:
java -Xms256m -Xmx256m -jar build/jar/FTPtransfer.jar log datadir
The specified log argument can be replaced with a suitable location for the program's execution log, whereas the datadir argument should be replaced with a suitable location for downloaded data (such as /home/irefindex/data).
Building SHA
The SHA.jar file needs to be obtained or built.
- Get the program's source code from this location:
https://hfaistos.uio.no/cgi-bin/viewvc.cgi/bioscape/bioscape/modules/interaction/Sabry/SHA/
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/SHA
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Compile the source code. Compile and create the .jar file as follows:
ant jar
The SHA.jar file will be created in the dist directory.
Building SaxValidator
The SaxValidator.jar file needs to be obtained or built.
- Get the program's source code from this location:
https://hfaistos.uio.no/cgi-bin/viewvc.cgi/Parser/SaxValidator/
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co Parser/SaxValidator
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Compile and create the .jar file as follows:
ant jar
Running SaxValidator
The program used for validation and integrity checks is called SaxValidator and when the name was chosen it was merely a SAX-based validator. However, more functionality has since been included:
- Validate XML files against a schema.
- XML parser-independent counting of elements (count number of </interaction> and </interactor> tags in each file). This gives an indication on what to expect at the end of the parsing.
- Count number of lines in BIND text.
- Remove files containing negative interactions.
Run the program as follows:
java -jar -Xms256m -Xmx256m dist/SaxValidator.jar <date extension> <validate true/false> <count elements true/false>
For example:
java -jar -Xms256m -Xmx256m dist/SaxValidator.jar /home/irefindex/data /2010-02-08/ true true
Be sure to include the leading and trailing / characters around the date information.
Handling Invalid Files
For each data source, invalid files will be moved to a subdirectory of that source's data directory. These subdirectories can be found by using the following Unix command:
find /home/irefindex/data -name inValid
Known Issues
- MIPS MPACT/MPPI files may be flagged as invalid, but can still be parsed using workarounds in the parsing process
Alternatives and Utilities
The xmllint program provided in the libxml2 distribution, typically available as standard on GNU/Linux distributions, can be used to check and correct XML files. For example:
xmllint HPRD/2010-09-14/inValid/HPRD_SINGLE_PSIMI_041210.xml > HPRD_SINGLE_PSIMI_041210-corrected.xml
This corrects well-formedness issues with the source file in the output file.
Building BioPSI_Suplimenter
The BioPSI_Suplimenter.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the SHA.jar file created above as well as the MySQL Connector/J library which can be found at the following location:
- Extract the dependencies. For example:
tar zxf mysql-connector-java-5.1.6.tar.gz
This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the BioPSI_Suplimenter directory...
mkdir lib cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/
The filenames in the above example will need adjusting, depending on the exact version of the library downloaded.
The SHA.jar file needs copying from its build location:
cp ../SHA/dist/SHA.jar lib/
Alternatively, the external libraries can also be found in the following location:
/biotek/dias/donaldson3/iRefIndex/External_libraries
- Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the BioPSI_Suplimenter directory:
cp Build_files/build.xml .
It might be necessary to edit the build.xml file, changing the particular filename for the .jar file whose name begins with mysql-connector-java, since this name will change between versions of that library.
Compile and create the .jar file as follows:
ant jar
Creating the Database
Enter MySQL using a command like the following:
mysql -h <host> -u <admin> -p -A
The <admin> is the name of the user with administrative privileges. For example:
mysql -h myhost -u admin -p -A
Then create a database and user using commands of the following form:
create database <database>; create user '<username>'@'%' identified by '<password>'; grant all privileges on <database>.* to '<username>'@'%';
For example, with <database> given as irefindex, <username> given as irefindex, and a substitution for <password>:
create database irefindex; create user 'irefindex'@'%' identified by 'mysecretpassword'; grant all privileges on irefindex.* to 'irefindex'@'%';
Manual loading of data
In order to get the sequence of SEGUIDs not retrieved in later stages the table "seguid2sequence" has to be made as follows.
Note |
This process only applies when no previous database version exists with which SEGUID identifier consistency is required. For iRefIndex maintenance, this process is typically skipped. |
- Obtain the file "seguidflat" from ftp://bioinformatics.anl.gov/seguid/ or (locally) /biotek/dias/donaldson3/DATA/SEGUID
- Use the following SQL commands to load this into a table:
create table seguid2sequence( seguid char(27) default '0', sequence varchar(49152) default 'NA', noaa int(11) default -1 ) ENGINE=InnoDB DEFAULT CHARSET=latin1; load data infile '......../seguidflat' into table seguid2sequence FIELDS TERMINATED BY '\t'; alter table seguid2sequence add index seguid(seguid); update seguid2sequence set noaa=length(replace(sequence,' ',''));
Running BioPSI_Suplimenter
Please make sure that the manual loading of data was completed before this step, if appropriate.
Run the program as follows:
java -jar -Xms256m -Xmx768m build/jar/BioPSI_Suplimenter.jar &
In the dialogue box that appears, the following details must always be filled out:
- Server
- the <host> value specified when creating the database
- Database
- the <database> value specified when creating the database
- User name
- the <username> value specified above
- Password
- the <password> value specified above
- Log file
- the path to a log file where the program's output shall be written
Make sure that the log file will be written to a directory which already exists. For example:
mkdir /home/irefindex/logs/
The program will need to be executed a number of times for different activities, and these are described in separate sections below. For each one, select the corresponding menu item in the Program field shown in the dialogue.
Create tables
The SQL file field should refer to the Create_iRefIndex.sql file in the SQL directory within BioPSI_Suplimenter, and this should be a full path to the file. For example:
/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/BioPSI_Suplimenter/SQL/Create_iRefIndex.sql
Click the OK button to create the tables.
Clone seguid
From iRefIndex 7.0, each ROG is consistent with the previous release. Therefore, the first operation when creating the new SEGUID table is to copy the SEGUID table from the previous release. This is included as an option in BioPSI_Suplimenter. The seguidannotation file is no longer parsed and if there is an updated version of this file from the SEGUID project it has to be used as an updating step.
Note |
A process for updating from a newer version of the seguidannotation file is not currently defined in BioPSI_Suplimenter. |
When the "Clone SEGUID" option selected from the BioPSI_Suplimenter GUI, the SEGUID table is the source seguid table (so for the database beta7, the seguid table used would be beta6.seguid). The target database selected should not have a seguid table and if it has, this will throw an error.
Free and proprietary releases & clone seguid
iRefIndex has two subversions for every release, free and proprietary. Therefore, the ROG should not merely be consistent with the previous release: it has to be consistent between the free and proprietary versions of the release being made. Thus, the cloning will always be done using the earlier full/proprietary version as the source and the current full/proprietary as target. In other words, the proprietary version will be made first and then the free version. Once the proprietary version is made the SEGUID table of the free version is made by cloning the current proprietary version's SEGUID (not the previous version's).
Recreate SEGUID
The File field should refer to the seguidannotation file in the SEGUID subdirectory hierarchy. For example, given the following data directory...
/home/irefindex/data
...an appropriate value for the File field might be this:
/home/irefindex/data/SEGUID/2010-02-08/seguidannotation
For the following fields, indicate the locations of the corresponding files and directories similarly:
- Unip_SP_file
- uniprot_sprot.dat (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot.dat
- Unip_Trm_file
- uniprot_trembl.dat (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_trembl.dat
- unip_Isoform_file
- uniprot_sprot_varsplic.fasta (from UniProt); for example:
/home/irefindex/data/UniProt/2010-02-08/uniprot_sprot_varsplic.fasta
- Unip_Yeast_file
- yeast.txt (in the yeast directory); for example:
/home/irefindex/data/yeast/2010-02-08/yeast.txt
- Unip_Fly_file
- fly.txt (in the fly directory); for example:
/home/irefindex/data/fly/2010-02-08/fly.txt
- RefSeq DIR
- The specific download directory for RefSeq; for example:
/home/irefindex/data/RefSeq/2010-02-08/
- Fasta 4 PDB
- pdbaa.fasta (from PDB); for example:
/home/irefindex/data/PDB/2010-02-08/pdbaa.fasta
- Tax Table 4 PDB
- tax.table (from PDB); for example:
/home/irefindex/data/PDB/2010-02-08/tax.table
- Gene info file
- gene_info.txt (in the geneinfo directory); for example:
/home/irefindex/data/geneinfo/2010-02-08/gene_info.txt
- gene2Refseq
- gene2refseq.txt (in the NCBI_Mappings directory); for example:
/home/irefindex/data/NCBI_Mappings/2010-02-08/gene2refseq.txt
Fill Bind info
The file fields should refer to the appropriate BIND files in the data directory hierarchy. For example, for Bind Ints file:
/home/irefindex/data/BIND/2010-02-08/20060525.ints.txt
To conveniently edit all file fields, you can edit the Base loc field, inserting the top-level data directory. For example:
/home/irefindex/data
In addition, the Date info can also be changed to indicate the common date directory name used by the data sources. For example:
2010-02-08
Be sure to check the final values of the file fields themselves before activating the operation.
Building StaxPSIXML
The StaxPSIXML.jar file needs to be obtained or built.
- Get the program's source code from this location:
Using CVS with the appropriate CVSROOT setting, run the following command:
cvs co bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML
The CVSROOT environment variable should be set to the following for this to work:
export CVSROOT=:ext:<username>@hfaistos.uio.no:/mn/hfaistos/storage/cvsroot
(The <username> should be replaced with your actual username.) - Obtain the program's dependencies. This program uses the MySQL Connector/J library which can be found at the following location: You may choose to refer to the download from the BioPSI_Suplimenter build process.
- Extract the dependencies:
tar zxf mysql-connector-java-5.1.6.tar.gz
This will produce a directory called mysql-connector-java-5.1.6 containing a file called mysql-connector-java-5.1.6-bin.jar which should be placed in the lib directory in the StaxPSIXML directory...
mkdir lib cp mysql-connector-java-5.1.6/mysql-connector-java-5.1.6-bin.jar lib/
You may instead choose to copy the library from the BioPSI_Suplimenter/lib directory:
mkdir lib cp ../../BioPSI_Suplimenter/lib/mysql-connector-java-5.1.6-bin.jar lib/
The filenames in the above examples will need adjusting, depending on the exact version of the library downloaded. - Compile the source code. In order to build the software on a computer which does not have the NetBeans IDE installed, copy the generic build file into the StaxPSIXML directory:
cp Build_files/build.xml .
It might be necessary to edit the build.xml file, changing the particular filename for the .jar file whose name begins with mysql-connector-java, since this name will change between versions of that library.
Compile and create the .jar file as follows:
ant jar
Running StaxPSIXML
The software must first be configured using files provided in the config directory:
- configFileList.txt
- Edit the #CONFIG entries in order to refer to the locations of each of the individual configuration files. To remove a data source, remove the leading # character from the appropriate line.
- config_X_SOURCE.xml
- Each supplied configuration file has this form, where X is an arbitrary identifier which helps to distinguish between different configuration versions, and where SOURCE is a specific data source name such as one of the following:
- BIOGRID
- DIP
- HPRD
- IntAct
- MINT
- MIPS
- MIPS_MPACT
- OPHID
In each file (specified in configFileList.txt), a number of elements need to be defined within the locations element:
- logger
- The location of the log file to be created. If a log file already exists, the new information will be appended. In the event that the program throws more than 50000 exceptions, the errors will be continued in new files ordered by a numeric identifier specified at the end of each filename.
- data
- This the location of the PSI-XML files to be parsed. For example:
/home/irefindex/data/BioGrid/2010-02-08/textfiles/
This will directly also store the lastUpdate.obj file: this file contains successfully parsed files and allow the parsing to be processed from the last successful point in the event of a disruption. This also prevents accidental parsing of files more than once. If all files have to be parsed again (in the case of a new build, for example) lastUpdate.obj has to be deleted. If only certain files to be parsed again use the Exemptions option instead.
- Exemptions
- This gives the location of PSI-MI files to re-parsed, thus overriding the lastUpdate.obj control. This may be needed if information from certain files has to be parsed again, but this directory should be created and left empty initially.
- mapper
- This is the most important part of the parsing. This file defines where to obtain the data from the XML file and which field of the table in the database the data is destined for. The location of this file is typically within the source code distribution, and an absolute path should be specified. For example:
/home/irefindex/build/bioscape/bioscape/modules/interaction/Sabry/StAX/StaxPSIXML/mapper/Map25_INTACT_MINT_BIOGRID.xml
More information about the mapper is available in Readme_mapper.txt within the StaxPSIXML directory.
See also the documentation on the topic of adding sources to iRefIndex for details of writing new mapper configuration files.
For the MIPS and MIPS_MPACT sources, the following "specs" element needs to be changed:
- filetype
- A specific file should be specified. For MIPS, this should be something like the following:
mppi.xml
For MIPS_MPACT, the file should be something like this:
mpact-complete.psi25.xml
When all configuration files and mapper files are ready. Run the program:
java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar -f <config_file_list_file>
For the GUI version, omit the program arguments:
java -jar -Xms128m -Xmx512m build/jar/StaxPSIXML.jar
Validating the Results from StaxPSIXML
It is possible to validate the results of the parsing process by issuing the following queries against the database being prepared:
select name, count(uid) from int_source inner join int_db on int_source.source = int_db.id group by name; select name, count(uid) from int_object inner join int_db on int_object.source = int_db.id group by name;
This will tabulate the interactions and interactors respectively for each data source. The values can then be compared to the totals written to the files produced by the SaxValidator program; these files can be found in the data directory hierarchy in a location resembling the following:
/home/irefindex/data/2010-02-08
Each validated data source should have a pair of files as illustrated by the following directory listing extract:
corum_interactions.txt corum_interactors.txt dip_interactions.txt dip_interactors.txt grid_interactions.txt grid_interactors.txt
A convenient way of getting similar tabular summaries as those returned by the above queries is to run the following commands:
grep -e "total.*INTERACTION" /home/irefindex/data/2010-02-08/*.txt grep -e "total.*INTERACTOR" /home/irefindex/data/2010-02-08/*.txt
Running BioPSI_Suplimenter (continued)
New for iRefIndex 7.0: when running the "ROG_ass + RIG_fill+ make Cy" process for free and proprietary releases there is a slight difference due to the ROG consistency issue. There is a checkbox in the GUI (in the red area) which is selected by default, which has to be deselected when making the free version.
Workaround for iRefIndex 7.0: the rig2rigid table needs to be copied from a previous build as follows:
create table rig2rigid as (select * from iRefindex_full_beta7.rig2rigid);
ROG_ass + RIG_fill+ make Cy
Note: when building the free version of the release, the UniProt_table, gene2refseq table, SEGUID table and Pre_build Eutils tables from the full version's database should be specified. (When building the full version, data has already been cloned from a previous release, and the tables in the same database should be specified.)
A table prepared from Web service data needs to be given for the Pre_build Eutils field. For example:
- Pre_build Eutils
- The name of the Web service data table; for example:
irefindex.eutils
It is possible to specify the location of a table residing in another database. As a consequence, the table will be copied into the target database.
All iRefIndex Pages
Follow this link for a listing of all iRefIndex related pages (archived and current).