The University of Arizona
Searching via TCW
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour
runSingleTCW can use blast+, usearch or diamond for annotation.

Contents:

  1. Using different search programs in TCW
  2. Timings and results
  3. Search program differences

Using different search programs in TCW

  1. Download (blast is required, the other two are optional):
     
  2. Put the search path in the HOSTS.cfg file (see HOSTS.cfg for more detail):
    blast_path =
    diamond_path =
    usearch_path =
    
  3. Execute runSingleTCW.
When you add an annoDB (i.e. database to search against such as UniProt), you have the option of selecting the search program to use. It will only list the programs specified in the HOSTs.cfg file. When there is more than one search program, they will be listed along with TCW Selects.

TCW Select will automatically select the program to use according to the following rules:

  1. SwissProt -- uses blast+ because we want every possible hit.
  2. TrEMBL -- these have gotten so big that they can take weeks with blast+, hence, it will use a fast program according to rule 4.
  3. For any other annoDB type: blast+ will be used if the database is <1Gb, else the fast program will be used.
  4. Fast program: If diamond is available, it takes precedence over usearch since it performs 6-frame translation whereas usearch uses ORFs.
To be clear, you can override the TCW automatic selection by manually selecting the program from the runSingleTCW interface, and it can be selected separately for each annoDB, as shown in the figures.
AnnoDB table in runSingleTCW interface

AnnoDB add/edit interface

Timing and Results

Go to top
The following results use the TCW defaults and 16 CPUs.
2921 proteins against the 19M SwissProt plants
ProgramTime1UniqueTotalSim>=40Sim<40
Blast+5m21s11,63425,38216,3269,056
Diamond3s10,67921,22515,5515,674
Usearch5s12,87734,41221,44412,968

26,685 transcripts against 19M SwissProt plants
Blast+2hr4m2s23,465227,025135,95591,070
Diamond33s23,449203,219131,12772,092
Usearch34s26,174595,4942 222,806372,688
1The times do not include formatting.
 For the SwissProt plants, it takes less than 3secs to format the database by any of these programs.
 For 1.7Gb TrEMBL plants, diamond takes 1m:30s, usearch runs out of memory due to 32-bit limit, blast+ takes 3m:34s.
2For blast+, -max_target_seqs 25 was used to limit the output; an equivalent option was not used for usearch.

Search program differences as of 24May15

Go to top
  • Usearch 32-bit is free, however, it exceeds that memory limit on the TrEMBL databases, e.g. 1.7Gb Plant TrEMBL. A 64-bit version is available for a cost.
  • Diamond can take as input zipped databases, the other two cannot.
  • Usearch does not need to have a separate step for formatting the database, though formatting is a good idea on large databases. The other two must format the database.
  • It is not necessary to specify the database type for usearch, it is necessary for the other two.
  • Blast and usearch find hits in the gray zone, i.e. similarity <40%; diamond does not find most of these.
  • Blast provides tblastx (6 frame translated nucleotide against 6 frame translated nucleotide) and tblastn (protein against 6 frame translated nucleotide database), whereas the other two programs do not.
  • Using Nt=nucleotide, Pr=protein, Tr=translated nucleotide:
    ActionBlast+DiamondUsearch
    Pr to Pr (blastp)YesYesYes
    Tr to Pr (blastx)YesYesYes1
    Nt to Nt (blastn)YesNoNo2
    1Usearch computes all ORFs and translates them for alignment.
    2The strand must be specified, so this is not offered in TCW at this time.

Go to top

Email Comments To: tcw@agcol.arizona.edu