The University of Arizona
mulit-species TCW Guide
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour

Multi-species TCW (mTCW) is the comparative module of Transcriptome Computational Workbench.

Note, familiarity with singleTCW is essential, as MultiTCW projects are created by merging existing sTCW projects.

Contents

Overview

Go to top
runMultiTCW
  1. Input:
    1. Two or more singleTCW databases (nucleotide or protein). The sequences, their annoDB hits, RPKM and DE p-values are imported to the multiTCW database.
    2. For nucleotide singleTCWs, either input the corresponding translated sequences (i.e. from runSingleTCW) or a SMAT file so TCW can run ESTSCAN3 to generate the translated sequences.
    3. (Optional) A file of clusters.
  2. The results are the best if:
    • The singleTCW databases are annotated the same.
    • The conditions names are exactly the same (when applicable). For example, if two species both have counts for the tissue type 'leaf', the condition name provided in runSingleTCW must be the same for both (e.g. leaf), though the name is case-insensitive.
    • The DE column names are exactly the same (when applicable), as in the previous point.
  3. Computation:
    • Compare the sequences using BLAST 1,2.
    • Compute one or more sets of clusters using the BBH (bi-directional best hit), Transitive, orthoMCL4, or user-supplied clusters.

viewMultiTCW

  1. View and Query the results.
    • If the annotations are the same, then clusters can be annotated with the majority hit.
    • If the condition names are the same, then a given conditions can be compared across species, i.e. the Pearson's Correlation Coefficient is computed using RPKM values of conditions with the same name.
    • If the DE names are the same, then DE values can easily be compared across datasets.
  2. Run pairwise alignment or MUSCLE5 to view the alignment of the clusters.

Software Requirements and Installation

Go to top
The multiTCW was installed when you installed TCW (see Installation).
SoftwareVersionSourceAdditional info
For protein or transcript comparison:
BLAST2.xNCBI Used for annotation and assembly. Legacy BLAST (and Megablast) or BLAST+ may be used.
Additional packages (supplied with TCW)1:
OrthoMCL2.0OrthoMCLClusters orthologous transcripts.
uses Perl and MySQL, which requires DBI::mysql
ESTScan3.0.3ESTScanExtracts protein sequences from transcripts.
Muscle3.8.31MuscleMultiple alignment of transcript cluster.

1 Linux binaries are supplied under the external directory and the Mac binaries are supplied under the external_osx. You do not need to do anything special as TCW will find the packages here if they are not in your path. We cannot guarantee the packages will work on every linux/mac machine; if they do not, you will need to replace them.

Running the Demo

Go to top
This section is essential for learning how to use runMultiTCW, and describes how to make a multiTCW project starting from the three singleTCW demos which are included in the package.

Using runSingleTCW, create the pre-configured sTCW_demoTra, sTCW_demoPro and sTCW_demoAsm, i.e.

  1. Select the project, execute Load Data, Instantiate and Annotate Sequences.
  2. (Optional) All three have the conditions Tip and Zone, and demoPro and demoTra have condition Root. Using runDE, compute pairwise differential expression.
To create the multiTCW project, start by running the mTCW Manager,
./runMultiTCW
This brings up the Manager interface, shown at right.

Click Add Project to create a new project. Enter "demo" in the entry box, and click ok. The Manager interface will have DB Name filled in as mTCW_demo. This will be the MySQL database name.

 

Step1: Click the Add button next to Single TCW databases. This brings up the sTCW Selection dialog on the lower right.

(Click to see larger image)
 
Click Select sTCW Database produces a popup of existing sTCW databases (you'll have to click on the server name first, probably "localhost"). Choose sTCW_demoTra.

MultiTCW uses proteins for its alignments, hence you will need to generate the amino acid sequences. For this demo:

  • Select Generate protein file
  • Select "..." and the file chooser will take you to external/ESTScan/smat directory, which contains the file embr.smat -- select it.

Alternatively:

  • Select Use the existing protein file.
  • Select "...". With the file chooser, go to
       projects/demoTra/ORF/ORFbestTranslated.fasta
    and select it.
If you use this option, the results will come out slightly different from those below.
Follow the same process to add the demoAsm and demoPro projects, though demoPro is a protein database, so no SMAT file or translated file is necessary. The resulting Manager interface is shown on the right.

Select Build Database to create the mTCW database. This takes a few minutes as sequences and annotation are transferred from the sTCW databases to the new mTCW database, and ESTScan is run to create the amino acid sequences.

Note: a limited number of hits are transferred for each sequence, the number is generally the 3 top hits, though it does ensure that the 'Best Annotation' is transferred.

Many messages print to the console during loading, ending with a summary, shown below:

	
Project: demo    multiTCW 1.6.8

Sequence datasets: 3  Created 03-Jan-17
             #aaSeq  #ntSeq  #annoSeq  #annoDB  Created    Remark           
      asm       104     104       102        7  03-Jan-17  assembly demo    
      tra       211     211       208        7  03-Jan-17  transcript demo  
      pro       128       0       128        7  03-Jan-17  protein demo     
      TOTAL     443     315       438           

Associated Counts  
           Sanger      Tip     Zone     Root  
      asm      98  120,361   45,776       --  
      tra      --  473,150  210,795  971,049  
      pro      --   22,235   21,355   12,355  

Differential Expression (number with p-value < 0.05: 
           TiZo  RoTi  RoZo  
      asm     0     0     0  
      tra     1    41    52  
      pro    31    37     6  
	
Complete mTCW_demo at  03-Jan-17 19:34:42 Elapse time  0m:07s
Step2: Select Run Blast. The step blasts the amino acid sequences against each other to find the basic similarities used for clustering.
Step3: You can create multiple sets of clusters with different methods or parameters.

Click Add in section "Cluster Methods" to add a new clustering method; this brings up the Method dialog, shown at right.

  1. Selected Add, BBH is the default, select Keep; control will return to the main panel.
  2. Then select Add again, in the dropdown beside Method, select Transitive, select Keep.
  3. Select Add again, select OrthoMCL, select Keep.
Select the Add New Clusters button to perform all clustering. It will runs OrthoMCL, which is supplied with TCW, and run the built-in transitive method. Status messages are printed to the console ending with the summary shown below.
   Sizes: 
      Method  =2  3-5  6-10  11-20  21-30  31-40  41-50  >50  Total  #Seqs  
      BB      62    0     0      0      0      0      0    0     62    124  
      TR      59    9     0      0      0      0      0    0     68    145  
      OM      55   17    11      4      2      0      1    0     90    413  

Complete adding 3 methods for mTCW_demo at  03-Jan-17 19:41:47 Elapse time  0m:03s

				
You can add more clusters trying different parameters; just select Add, change the parameters, make the prefix unique and Keep; then select 'Add New Clusters'. You may also remove clusters by selecting the name for the cluster table, followed by remove; first it will remove it from the database, a second remove will remove it from the table.

Creating a project

Go to top
Building a multiTCW database is as simple as creating the demo above. The following are some details:

Transcripts to proteins

If your sTCW databases were created with protein sequences, skip this section.
  1. You may supply your own protein sequences, where the sequences identifiers (i.e. on the ">" description line) must correspond to the sequence identifiers in the sTCW database. runSingleTCW produces a file in the project directory called ORFbestTranslated.fasta, which can be used as input (see TCW ORF finder).

  2. You may have runMultiTCW run ESTscan, but you must supply the SMAT file (or use the existing one, which is very old). Go to estscan.sourceforge.net for the source code and instructions for building a SMAT file.

Blast Settings

Using the Settings, you may request that is run a self-blast with the nucleotide sequences in the database; these can be used in the BBH or Transitive algorithm.

You may request that it filter the Blast file to remove near identical sequences. The Setting "Help" page provides more detail.

Clustering Methods

Go to top
The Help page for clustering provides more detail, but the following is an overview.

BBH

The BBH finds the reciprocal best match based on blast e-value. Hence, it makes clusters strickly of size 2.

Transitive

Transitive builds nearest-neighbor clusters (i.e., "transitive closure"), based on blast alignment parameters such as %similarity and bases of overlap.

OrthMCL

OrthoMCL requires numerous steps to run, and uses a temporary MySQL database; TCW organizes all these details.

OrthoMCL produces the following message, but it works anyway:

Error: acquiring genes from Combined.fasta

User-defined clusters

For this you create a file specifying the groupings, and the interface simply uploads that file. Blast results are not used. The group file has the following format:
..
D26: tra|tra_030 tra|tra_184 tra|tra_094 pro|pro_100
D27: tra|tra_045 tra|tra_209 pro|pro_011
...
Each line starts with "DN", where N is the group number, and then has a space-separated list of the sequences in the group, prefixed by the project prefix that you entered when you set up the mTCW.

Directory structure

Go to top
A directory is created is created for the mTCW project called projcmp/<project name>. A file called mTCW.cfg is created that contains all the information about the project, and will be reloaded into runMultiTCW when you select the project.

Trouble shooting

Go to top
runMultiTCW is not very forgiving if datasets or cluster methods are entered wrong. Its easiest to just to Remove the offending dataset or cluster and re-enter it.

A file called mTCW.error.log is created if there is an error. If its not clear how to fix the problem, send the file to tcw@agcol.arizona.edu.

View/Query with viewMultiTCW

Go to top
The clusters can be viewed by either:
  1. Click the Launch viewMultiTCW button in the runMultiTCW interface.
  2. Execute './viewMultiTCW' and a window of existing mTCW databases will be displayed, where databases can be selected for display.
  3. Execute './viewMultiTCW <database name>', e.g. ./viewMultiTCW demo displays the window on the right.
There is Help on all the viewMultiTCW views, and Tour shows snapshots of some of the viewMultiTCW windows.

Running as an applet

Go to top
The following can be used to display viewSingleTCW on the web with a given TCW database:
		<HTML>
		<HEAD>
		<TITLE>viewMultiTCW</TITLE>
		</HEAD>
		<BODY>
		Running viewMultiTCW  (please wait)
  		<applet 
    		CODE="cmp.viewer.MTCWApplet"
        	ARCHIVE="location_of_your_jars/mtcw.jar,jars/mysql-connector-java-5.0.5-bin.jar"
    		width=0 height=0
  		MAYSCRIPT>
  		<param name="ASSEMBLY_DB" value="mTCW_your_db_name">
  		<param name="DB_URL"      value="www.your URL">
  		<param name="DB_USER"     value="your username (read-only)">
  		<param name="DB_PASS"     value="your password">
  		<hr>
  		Unable to display the viewMultiTCW applet. Please verify your Java installation.
  		<hr>
  		</applet>
		</BODY>
		</HTML>
		
Applet users may have to adjust their applet memory; see instructions in the
Troubleshooting Guide.

References

Go to top
  1. Zhang, Z., S. Schwartz, L. Wagner, and W. Miller (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203-214.
  2. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421.
  3. Iseli, C., Jongeneel, C.V. and Bucher, P. (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol, 138-148.
  4. Li, L., Stoeckert, C.J., Jr. and Roos, D.S. (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res, 13, 2178-2189.
  5. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-1797.
  6. UniPROT Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35: D193-197.

Email Comments To: tcw@agcol.arizona.edu