The University of Arizona
runMultiTCW Guide
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour

MultiTCW (mTCW) is the comparative module of Transcriptome Computational Workbench. This module takes as input two or more singleTCW databases (sTCWdb). It has been tested with input of four sTCWdbs (134k total sequences); though it can probably handle more input, the viewMultiTCW queries get very slow.

Note, familiarity with singleTCW is essential, as MultiTCW projects are created by merging existing sTCW projects.

Contents

Overview

Go to top
Common abbreviations:
mTCW   multiTCW database
sTCW   singleTCW database
NT   Nucleotide (transcript, gene)
AA   Amino acid (translated ORF, protein)
NT-sTCW   singleTCW created from NT sequences.
AA-sTCW   singleTCW created from AA sequences.
NT-mTCW   multiTCW build from "only" NT-sTCW.
AA-mTCW   multiTCW build from only AA-sTCW or a mix of AA-sTCW and NT-sTCW.

runMultiTCW

  1. Input:
    • Two or more sTCW databases. The sequences, their annoDB hits, RPKM and DE p-values are imported to the multiTCW database.
    • For NT-sTCW, the mTCWdb will contain for each seqID the nucleotide, CDS, and protein sequences. The protein sequence is created from the CDS sequence, which is created from the TCW computed ORF.
    • (Optional) A file of clusters.
  2. The results are the best if:
    • The sTCW databases are annotated the same.
    • The conditions names are exactly the same (when applicable). For example, if two species both have counts for the tissue type 'leaf', the condition name provided in runSingleTCW must be the same for both (e.g. leaf), though the name is case-insensitive.
    • The DE column names are exactly the same (when applicable), as in the previous point.
  3. Computation:
    • Compare the AA sequences using one of the supported Search programs, and the NT sequences with blastn.
    • Compute one or more sets of clusters using the BBH (bi-directional best hit), Closure, orthoMCL2, or user-supplied clusters.
    • For a NT-mTCW database created from only NT-sTCW databases, statistics such as Ka/Ks3, synonymous, etc are computed.

Software Requirements and Installation

Go to top
The multiTCW executables were installed when you installed TCW (see Installation).
SoftwareVersionSourceAdditional info
Optional package for transcript comparison:
KaKs_Calculator v.1.2. KaKs_calculator
(google for other download sites).
For KaKs analysis for runMultiTCW.
Additional packages (supplied with TCW)+:
OrthoMCL 2.0 OrthoMCL Clusters orthologous proteins in runMultiTCW.
Uses Perl and MySQL, which requires DBI::mysql
Muscle 3.8.31 Muscle Multiple alignment of protein cluster in viewMultiTCW.

+ Linux binaries are supplied under the external directory and the Mac binaries are supplied under the external_osx. You do not need to do anything special as TCW will find the packages here if they are not in your path. We cannot guarantee the packages will work on every linux/mac machine; if they do not, you will need to replace them.

Running the Demo

Go to top
This section is essential for learning how to use runMultiTCW, and describes how to make a multiTCW project starting from the two singleTCW demos. You can use demoTra, demoAsm, demoPro for a three way comparison that includes an AA-sTCW. However, the three 'ex' demos, which are included in the package, have more homology so make a better example, so that is what the webpage will use.
Using runSingleTCW, create sTCW_exBar, as follows:
  • Create the annotation files:
    1. ./runAS
    2. Fill out the panel as shown on the right, then select Tax Download.
    3. You may select GO Download to build the GO database for singleTCW, but everything will execute faster if you skip this step.
    4. Select TCW.anno, which writes a file for input to runSingleTCW.
  • Create sTCW_exBar:
    1. ./runSingleTCW
    2. Select "exBar" from the Project dropdown.
    3. Select Import annoDBs and select TCW.anno.UniProt_ex file.
    4. Execute the three steps to load data, instantiate and annotated. If you have the search program diamond, it is suggested you use it as it executes much faster (set for each annoDB using the "Edit" button).
  • Create sTCW_exFoo as above. If you want to include a 3rd sTCW, create sTCW_exFly.

To create the multiTCW project, start by running the mTCW Manager,

./runMultiTCW
This brings up the Manager interface, shown on the lower right (though all fields will be blank).
Click Add Project to create a new project. Enter "ex" in the entry box, and click ok. The Manager interface will have mTCW database filled in as mTCW_ex, which will be the MySQL database name.

The following are the steps to take, where more detail is provided below.

  1. Using the Add beside the "single TCW databases":
    1. Add the exBar database with its translated ORF file (projcmp/AAfiles/exBar_aaORFs.fa)
    2. Likewise, add exFoo and exFly databases with their respective aaORFs file.
  2. Select Build Database.
  3. Select Run Blast.
  4. Select Add Pairs from Blast.
  5. Using the Add button beside the "Cluster Methods" table, add one or more methods. Execute "Add New Clusters"; the added methods in the table will be italized. You may add additional methods at any time.
  6. Select Run Stats. When it completes, the label will change to "No action selected".
  7. Exit runMultiTCW in order to run KaKs_calculator.
    1. Download the KaKs_calculator.
    2. Change directory to projcmp. Put the KaKs_calculator executable in this directory.
    3. Change directory to ex/KaKs. Execute "sh runKaKs".
    4. Start up runMultiTCW again. The label on the 4th section should say "Read KaKs", if it does not, select Settings and select it. Then Run Stats.
    5. Select Launch viewMultiTCW to query the results.
The overview will look similar to this Overview.

The four steps

Go to top
The following continues to use exBar and exFoo as examples of sTCW databases, but will work for any set of sTCW databases.

Top three rows

Add ProjectA popup window will appear where you enter the project name. On 'OK', the following occurs:
(1) A project directory will be created under projcmp with the project name.
(2) A file called mTCW.cfg is created and written to the project directory.
(3) The database will be the same name with the prefix 'mTCW_' added.
HelpA pop-up window that provides similar information to this UserGuide.
ProjectThe drop-down lists all sub-directories under projcmp. When you select one, the projects mTCW.cfg file will be read and values entered into the interface.
SaveEverything you enter gets saved into mTCW.cfg every time you make a change. However, you can initiate the save with this button if you want to be sure the new information is save.
RemoveSee below.
OverviewOnce you have selected a project, you can select 'Overview' to see its state.
mTCW databaseBy default, the mySQL name will be mTCW_<project-name>. You can change the name, though it must start with mTCW_.

Remove: Select one or more options. When you select 'Ok', you will be prompt to verify each removal.

Pairs.. from database Removes the pairs and clusters so you can start over without running Build Database again, e.g. if you want to use a different blast file, create all new clusters, etc. Once the pairs are removed, you can change the blast settings (e.g. use filtered blast) and then re-add the Pairs.
mTCW database Remove the database but leave the project on disk.
From disk
Pairs and method files If you remove the pairs and clusters from database, it is a good idea to remove all associated files from disk using this option.
Blast files If you recreate the database and you think there may be changes to the sequences in it, you definitely want to remove the blast files so that it allows you to re-blast. Or, if you want to re-run blast, you need to first remove the blast files using this option.
All files If you no longer are using the project, you can delete the database (above) and the all relevant files here.

1. single TCW databases

SingleTCW databases can be created from nucleotide (NT-sTCW) or proteins (AA-sTCW). A multiTCW database can be created with a mix of NT-sTCW and AA-sTCW databases. If the multiTCW is created with on AA-sTCW or a mix, only the PCC statistics are available (see Step 4. Pair Statistics).
Click the Add button next to Single TCW databases. This brings up the sTCW selection panel shown on the right.

Click Select sTCW Database produces a popup of existing sTCW databases. Choose the sTCW from the list.

The 'prefix' is only used in the Method files, so it does not matter what it is as long as its unique.

The remark can be anything, and can be added/changed after the database is created. Avoid special characters such as quotes.

When you select Keep, it will take you back to the main panel and this database will be shown in the "single TCW databases" table.

Repeat to add all the sTCWdbs you want to compare.

2. Self-Blast

The Run Blast puts the results in blastAA.tab and blastNT.tab; these names cannot be changed.

On the "Settings" page, you can change the search program (see Search). By default, blast is run with "-use_sw_tback" for the nucleotide self-blast, which is very important as it runs dynamic programming for the final score.

Once blast is run, you can no longer change the parameters; if you want to change the parameter and re-run blast, remove the blast files using the main panel Remove....

Add Pairs from blast loads the pairs from the blast result file or filtered file.

3. Cluster Methods

Click Add in section "Cluster Methods" to add a new clustering method; this brings up the Method panel. The drop-down beside "Method" shows BBH, Closure, Ortholog, and User defined. You can add any number of cluster methods. You can add the same method multiple times with different parameters, where only the "Prefix" has to be different. All methods need a unique prefix, which is used to prefix the cluster names, e.g. a method with prefix "BB8" will have cluster names BB8_00001, BB8_00002, etc. The prefix can only be 5 characters, but make it a meaningful 5 characters.

When the multiTCW database is created from nucleotide sTCW databases, it is advantagous to have BBH be at least one of the methods as these pairs are used for the overall summary and for the KaKs pairs.

The Help page for clustering provides more detail, but the following is an overview.

BBH

The BBH finds the bi-directional best hit based on blast e-value. It uses the blast hits that were loaded into the database with Add Pairs from Blast. The following explains the parameters:
  1. Amino acid or nucleotide (for NT-mTCW only).
  2. %Similarity - the Blast similarity (Identity).
  3. %Overlap - the alignment length is divided by the length of the sequence times 100 to get the %Overlap for each sequence of the pair (Olap1 and Olap2). You can choose "Either", which requires that either Olap1>%Overlap OR Olap2>%Overlap. If you choose "Both", then Olap1>%Overlap AND Olap2>%Overlap. For example, for an alignment:
    Seq1    -------------------------------
    Seq2     ----------	
    
    This will pass the filter %Overlap>=80 if "Either" is selected, but not "Both".
  4. The "Select sTCWdbs" will only be present if there are more than two sTCWdbs loaded into the mTCWdb. The rules are as follows:
    1. Select two sTCWdbs for the standard BBH of one pair per cluster.
    2. Select N (N>2) sTCWdbs, and clusters of exactly size N will be created, where each pair in the cluster is a BBH pair.
    3. Do not select any sTCWdbs, and one cluster set will be created from all pairs of sTCWdbs.

Closure

Closure has the following requirements:
  1. All sequences in a cluster must have a blast hit with all other sequences in the cluster.
  2. Each sequence must pass the filters with at least one other sequence in the cluster, where the filter parameters are exactly as described for BBH.
The algorithm uses the blast hits from the database.

OrthMCL

OrthoMCL requires numerous steps to run, and uses a temporary MySQL database; TCW organizes all these details.

OrthoMCL uses the blast file blastAA.tab. It does not guarantee that all sequences in a cluster have a blast hit with each other.

OrthoMCL produces the following message, but it works anyway: Error: acquiring genes from Combined.fasta

OrthoMCL occassionally fails -- every time this has happened to me, I rerun and it works.

User-defined clusters

For this you create a file specifying the groupings, and the interface simply uploads that file. Blast results are not used. The group file has the following format:
..
D26: tra|tra_030 tra|tra_184 tra|tra_094 pro|pro_100
D27: tra|tra_045 tra|tra_209 pro|pro_011
...
Each line starts with "DN", where N is the group number, and then has a space-separated list of the sequences in the group, prefixed by the project prefix that you entered when you set up the mTCW.

4. Run Stats

The statistics are broken into three sections:
  1. Run on all pairs in the database (e.g. 459):
    • The PCC (Pearson Correlation Coefficient) is only relevant if there are shared conditions, as it is used to determine how similar the RPKM values of the conditions are. It is run on all pairs in the database.
  2. This is "only" relevant for a mTCW database created from only nucleotide sTCW databases:
    • Blast pairs in cluster (e.g. 704)
      • Synonymous codons, nonsynonymous codons, %match, #gaps, GC content, etc.
      • The summary statistics shown on the Overview for "Pairs".
      • Outputs the Ka/Ks files for input into KaKs_calculator.
  3. Only if Ka/Ks input files exist.
    • Run Stats with Write selected to output the files for input to KaKs_calculator
    • Run the KaKs_calculator from a terminal window.
    • Execute Run Stats again with Read selected.
After adding clusters and running stats, you can add more clusters. In order to update the stats after adding more clusters:
  1. Select Compute Statistics will align any new unaligned pairs in clusters and update the summary.
  2. Select KaKs Write will align ALL pairs in clusters and update the summary.

Details on running KaKs_calculator

After the KaKs files have been created using Run Stats:
  • Exit runMultiTCW or use a different terminal window.
  • Change directories to projcmp. Put the KaKs_calculator executable in this directory.
  • Change directories to projcmp/<project-name>/KaKs. There will be multiple files with the name oTCWn.awt where n startsl at '1'. They each have pairs of aligned sequences minus the gaps. There is also a file called runKaKs which has the commands to run KaKs_calculator on each file, e.g.
    ../../KaKs_calculator -i oTCW1.awt -o iTCW1.xls -m YN &
    ../../KaKs_calculator -i oTCW2.awt -o iTCW2.xls -m YN &
    ../../KaKs_calculator -i oTCW3.awt -o iTCW3.xls -m YN &
    
    If you prefer to use a different method than 'YN', edit this file to change it to whatever method you want (see the KaKs_calculator documentation).
  • From the command line, type sh runKaKs.
  • Read the KaKs files:
    1. If you have exited runMultiTCW and start it back up, the label beside the pair statistics Setting will say "Read KaKs". Just select Run Stats and the files will be read.
    2. If you had not exited runMultiTCW but ran runKaKs for another terminal window, the label will still say "No action to be performed". Select Settings, select "Read KaKs files", Keep. Now Run Stats.

Additional details

Go to top

Timings

Go to top
The following times are from the log files for building an mTCW database with three NT-sTCWdbs.
StepTimeAdded
Build Database5h:0m:36s138,907 sequences
Add Pairs2h:3m:04s454,568 pairs
Add New Clusters1h:23m:05s46,831 clusters
Run Stats1h:33m:15s116,109 alignments

The longest task is to Add GOs (timing not shown); this task can be done at anytime, so it is recommended to wait until everything else is finalized before adding the GOs.

The blast is run on #CPU, but all mTCW tasks only use one CPU.

Trouble shooting

Go to top
runMultiTCW is not very forgiving if datasets or cluster methods are entered wrong. Its easiest to just to Remove the offending dataset or cluster and re-enter it.

A file called mTCW.error.log is created if there is an error. If its not clear how to fix the problem, send the file to tcw@agcol.arizona.edu.

View/Query with viewMultiTCW

Go to top
The clusters can be viewed by either:
  1. Click the Launch viewMultiTCW button in the runMultiTCW interface.
  2. Execute './viewMultiTCW' and a window of existing mTCW databases will be displayed, where databases can be selected for display.
  3. Execute './viewMultiTCW <database name>', e.g. ./viewMultiTCW demo displays the window on the right.
There is Help on all the viewMultiTCW views, and Tour shows snapshots of some of the viewMultiTCW windows.

Running as an applet

Go to top
The viewMultitTCW can be used as an applet, as follows:
  1. The mtcw.jar file must be signed with a code signing certificate, e.g. using Digicert (which is easy, but costs), or your university may have an account with InCommon, which provides code-signing' certificates.
  2. You need mysql-connector-java-5.0.5-bin.jar (contact tcw at agcol.arizona.edu if you would like us to provide you a copy).
  3. Modify the following code as appropriate, and put it in your cgi-bin.
		<HTML>
		<HEAD>
		<TITLE>viewMultiTCW</TITLE>
		</HEAD>
		<BODY>
		Running viewMultiTCW  (please wait)
  		<applet 
    		CODE="cmp.viewer.MTCWApplet"
        	ARCHIVE="location_of_your_jars/mtcw.jar,jars/mysql-connector-java-5.0.5-bin.jar"
    		width=0 height=0
  		MAYSCRIPT>
  		<param name="ASSEMBLY_DB" value="mTCW_your_db_name">
  		<param name="DB_URL"      value="www.your URL">
  		<param name="DB_USER"     value="your username (read-only)">
  		<param name="DB_PASS"     value="your password">
  		<hr>
  		Unable to display the viewMultiTCW applet. Please verify your Java installation.
  		<hr>
  		</applet>
		</BODY>
		</HTML>
		
Applet users may have to adjust their applet memory; see instructions in the
Troubleshooting Guide.

References

Go to top
  1. Li, L., Stoeckert, C.J., Jr. and Roos, D.S. (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res, 13, 2178-2189.
  2. Zhang Z, Li J, Xiao-Qian Z, Wang J, Wong, G, Yu J (2006) KaKs_Calculator: Calculating Ka and Ks through model selection and model averaging. Geno. Prot. Bioinfo. Vol 4 No 4. 259-263.
  3. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-1797.
  4. UniPROT Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35: D193-197.

Email Comments To: tcw@agcol.arizona.edu