runMultiTCW - Build Comparison Database from Multiple sTCWdbs
runMultiTCW takes as input two or more singleTCW (sTCW) databases
and builds a multiTCW (mTCW) database of clustered sequences.
See mTCW UserGuide for details.
1. sTCWdbs (single TCW databases).
Add/Edit: Define the input sTCW databases
(see panel #1), which can be created from
nucleotide sequences (NT-sTCW) and/or protein sequences (AA-sTCW).
For NT-sTCW, the nucleotide sequences and translated ORFs will be loaded.
Build database: Builds a database of all sequences, TPM, DE, annotations and GOs.
2. Compare sequences
Run Search: This creates a file of all sequences and performs a heuristic search1 against itself
to determine similar sequences. The search program and parameters can be changed (see panel #2).
Add Pair from Hits: All pairs from the Hit file are entered into the database.
3. Cluster Sets.
Add/Edit: Add a cluster type to be computed (see panel #3), where the methods are:
- BBH - TCW algorithm for Bi-directional Best Hit between N sTCWdbs, where each resulting cluster has N sequences and all pairs are BBH.
- Closure - TCW algorithm for determining clusters, where each sequence in a cluster has
a hit, good similarity and good overlap with all other sequences in the cluster.
- Best Hit - TCW algorithm for clustering on Hit ID or Hit Descriptions.
- OrthoMCL3 - runMultiTCW executes the
orthoMCL scripts and then loads the results into the mTCW database.
- User Defined - load a file of clusters where the file format is the orthoMCL format.
Add New Cluster: The clusters defined in the table will be computed and added to the database.
Add Stats: Add statistics (see panel #4), as follows.
- PCC (Pearson Correlation Coefficient): This is only relevant if there are shared conditions,
as it is used to determine how similar the TPM values of the conditions are.
It is run on all pairs in the database.
- The following is only relevant for mTCW databases that are built with
only NT-sTCW databases, as it is based on the aligned nucleotide coding regions.
- Each pair that has a hit and is in a cluster is pair-wise aligned.
- Statistics are computed from the alignment, such as synonymous and nonsynonymous codons, TS/TV, etc
(see e.g. summary).
- The KaKs files are written from the alignments for input to
the KaKs_calculator4 along with a script to run from the terminal.
- Read KaKs files: This is only relevant if the KaKs_calculator has been
executed on the KaKs files. It reads the results into the database.
- Compute the MSA for all clusters and score them. The MAFFT4 program is used; it occasionally
fails on a cluster, in which case MUSCLE5 is used.
1. Add single TCW database
2. Run Search
3. Add a cluster method
4. Add statistics. The counts on the bottom are updated after Run Stats is executed.
Project: ex Cluster: 868 Pairs: 997 Seqs: 707 Hits: 2.1k PCC Stats KaKs Multi
Created: 24-Oct-20 v3.1.0 Last Update: 26-Oct-20 v3.1.0
Type #Seq #annotated #annoDB Created Remark
bar NT 250 244 2 05-Oct-20 exBar
foo NT 250 245 2 05-Oct-20 exFoo
fly NT 207 205 2 18-Oct-20 exFly
CLUSTER SETS: 4
Prefix Method Parameters
BB BBH Sim 60; Cov 40(Both); bar,fly,foo
HT BestHit Description; Sim 20; Cov 50
CL Closure Sim 60; Cov 40(Both)
OM User Defined ./projcmp/ex/orthoMCL.OM-40
Prefix =2 =3 4-5 6-10 11-15 16-20 21-25 >25 Total #Seqs
BB 0 155 0 0 0 0 0 0 155 65.8%
HT 74 147 4 0 0 0 0 0 225 86.0%
CL 90 153 1 1 0 0 0 0 245 91.9%
OM 51 186 3 3 0 0 0 0 243 98.3%
Prefix conLen sdLen Score SD Trident SD
BB 535.55 56.94 11.04 3.44 0.75 0.17
HT 599.68 80.10 8.08 5.24 0.71 0.22
CL 569.77 57.45 8.48 5.36 0.74 0.18
OM 596.01 84.14 8.80 6.69 0.68 0.22
AA Diff 860 Same 110 Similarity 69.5% Coverage 82.8%
NT Diff 661 Same 2 Similarity 86.9% Coverage 64.0%
Aligned: 706 CDS: 1.0Mb 5UTR: 80.3kb 3UTR: 111.0kb
Codons 306.6k Amino Acids Nucleotides
Exact 58.6% Exact 87.8% CDS Diff 25.4%
Synonymous 29.2% Substitution >0 5.8% Gaps 9.3%
Fourfold 15.2% Substitution<=0 6.4% SNPs 16.1%
Twofold 11.4% 5UTR Diff 34.4%
Nonsynonymous 12.2% 3UTR Diff 36.7%
Pos1 Pos2 Pos3 Total GC CpG-Nt CpG-Cd
Transition 9.1% 4.6% 36.9% 50.5% Both 37.7% 3.9% 2.2%
Transversion 11.3% 7.2% 30.9% 49.5% Either 54.3% 11.3% 5.8%
ts/tv 0.80 0.64 1.19 1.02 Jaccard 0.69 0.35 0.38
KaKs method: YN Pairs: 706
Average Ka/Ks Quartiles P-value
Ka 0.085 Zero 51 Q1(Lower) 0.01670 <1E-100 428
Ks 2.527 Ka=Ks 0 Q2(Median) 0.03763 <1E-10 88
P-value 0.023 Ka<Ks 650 Q3(Upper) 0.09073 <0.001 57
Ka>Ks 5 Other 133
- Supported search programs: any of the following programs can be used for the AA search, and blastn is used for the NT search.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.
Buchfink B, Xie C, Huson D (2015) Fast and Sensitive Protein Alignment
using DIAMOND, Nature Methods, 12, 59-60 doi:10.1038/nmeth.3176.
- Li L, Stoeckert CJ, Jr., Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178-2189.
- Zhang Z, Li J, Xiao-Qian Z, Wang J, Wong, G, Yu J (2006) KaKs_Calculator: Calculating Ka and Ks through model selection and model averaging. Geno. Prot. Bioinfo. Vol 4 No 4. 259-263.
- Katoh K, Standley DM (2013) MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution Vol 30, Issue 4 772:780
- Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-1797.