TCW ORF finder
This document discusses the |
TCW ORF finder that is released with
TCW v2.5 (12 May 2018).
runSingleTCW - Running the ORF finder
- ORF finder algorithm
- Running from the ORF finder only
- Output file
viewSingleTCW - Viewing the ORFs
TCW ORF finder and
- Explanation of summary output
TCW has programs to build and view a single species (singleTCW), and to build and view
the orthologs of multiple species (multiTCW). For the comparison, it is necessary to have
the protein sequences, i.e. translated Open Reading Frame (ORF).
TCW annotation step computes the best ORF for each sequence using the annotation hits,
5th-order Markov model2,3 and length of candidate ORF. It outputs a protein sequence for
runSingleTCW - Running the ORF finder
|Go to top|
runSingleTCW has an Option menu that allows the user to
set the GO, ORF and similarity options for annotations. The section on the right
are the parameters for ORF finding. The algorithm and parameters are discussed in the next section.
This ORF finder algorithm was designed to find the 'best' ORF from de novo assembled
transcripts for input into
runMultiTCW. However, a file of all the 'good' ORFs is output
(see Output files).
Figure 1: The
runSingleTCW ORF finder options.
The algorithm is:
1. For each of the 6 frames:
Find the best ORF as follows:
If the frame computes an ORF using Rule 1, no further ORFs are found.
Otherwise, find all possible ORFs and select the best using Rule 2 and Rule 3.
2. From the 6 ORFs found in Step 1, find the best ORF using Rules 1, 2 and 3.
Rule 1: Use best hit frame if E-value < [default 1E-30] or has good Markov score
If this rule succeeds, the ORF found in the best frame is the selected one. See Additional heuristics for information on assigning coordinates.
Rule 2: Else use the longest ORF frame if the log length ratio > [default 0.3].
If the lengths of two candidate ORFs are very different, than it is best to
use the longest ORF. To determine if their lengths are 'very' different,
the log length ratio is used. For example:
In Table 1 on the right, the 0.1, 0.2 and 0.3 are Log Length Ratio cutoffs.
A value of 'false' indicates that the longest ORF will be selected.
A value of 'true' indicates that Rule 3 will be used.
Table 1. Length differences
Rule 3: Else use the best Markov score.
is the 5th-order Markov model as computed by
Transcoder3 and originally used
GeneID2. It is trained using either of the following:
- Train with hits that have E-value < [default 1E-75]. Minimum Set [default 50]
There must be at least Minimum Set sequences that pass the E-value
cutoff or no training set is created.
The training set is created from the exact nucleotide regions of blast4 alignments.
- Train with CDS file [default none]
The file name of a fasta file of nucleotide coding sequences.
Note: Its rare to use Rule 3 since generally a sequence has a good hit or is a sufficiently long
ORF. However, the scores are interesting to view in viewSingleTCW, where the
codon usage score is also shown (trained with the same set).
Additional heuristics: These rules were developed by studying the results.
An ORF must be greater than 30bp unless the sequence length is less than 30bp, then it must
be at least 9bps.
For Rule 1:
For Rules 2 and 3:
- If the hit frame is used, the coordinates are set as follows:
- If the hit E-value<1E-100 and the hit coverage is >95%,
search for the first Start and Stop codons within
the first 40 upstream and downstream codons, respectively. If no Start is found, the hit start coordinate
will be used, and if no Stop is found, the hit end coordinate will be used.
- Otherwise, search to the ends for the first Start and Stop codons.
- In either case, the extension will not pass through a string of >8 consecutive N's.
- The hit frame will be ignored in the following two conditions:
- If there are Stop codons within the hit region, the hit is ignored.
- If a sequence has hits to multiple frames, heuristics are applied to determine whether to ignore the
- The 'Best Eval' hit is used for this rule, but if the 'Best Eval' hit is from a nucleotide annoDB (i.e. annotation database),
it cannot be used. In this case, the 'Best Annotation' hit is used if it is from a protein database.
It is common that even if the hit is being ignored, the hit frame is selected anyway as it will have
the longest ORF or best Markov score. Ambiguity tends to arise with the shorter sequence, those
with many n's, and multi-frame sequences.
- If a frame is being analyzed that does not have a hit or it is being ignored:
N's in the sequence are ignored. The reason that consecutive N's are allowed in ORFs is that they can
be found in hit regions, as show in Figure 2.
- If the Markov score is being compared: if both are negative, if one has a hit it will be selected,
else the longest is selected.
- For training, only the best E-value hit from each sequence is evaluated.
Figure 2: A hit containing strings of consecutive n's. The italics region is the hit region.
Running the ORF finder only
The ORF finder is run after the sequences are annotated (Exec Annotate Sequences in
If you want to change the ORF finder options and run it again, there is a "Exec ORF only" function
runSingleTCW. Or you can run it from the command line:
This will uses the option set using
The ORF finder prints the progress to the terminal, finishing with a summary of the results.
The following files are written into the project's subdirectory "ORF".
The following examples are from the demoTra demo files.
For a given project, the "demoTra" will be replaced with its project name.
Table 3. Output files of the
|demoTra_aaORFs.fa||The translated CDS for the best ORF for each sequence. This file is also written
into the projcmp/AAfiles directory in order to make it easier to use with |
|demoTra_ntORFs.fa||The CDS for the best ORF for each sequence.
|demoTra_goodORFs.fa||The translated CDSs for all good ORFs*.
|AllGoodORFs.txt||A list of the coordinates for all good ORFs*.
|BestORFscores.txt||For the best ORF for each sequence, the 6-frame scores for the ORF.
|ScoreCodon.txt||The computed codon usage frequencies.
|ScoreMarkov.txt||The computed 5th-order Markov model loglikelihood.
|multiFrame.log||Provides information on sequences with hits to multiple frames.
TCW ORF finder
* Selected hit, or >= 900nt, or Markov score is >0 and best frame.
Example of BestORFScore.txt of the 6-frame scores for the best ORF:
>tra_001 type:5p-partial seqLen:1024 orfLen:564 frame:1 coords:106..669 Hit:1.9E-86
Markov 47.03 -9.11 -23.90 8.36 -6.33 -14.37 3
Codon 37.91 -131.61 -377.46 -58.59 -83.01 -239.04 3
The first score is for the ORF starting at 0 (hence, the current frame),
the next two are offset by 1 and 2 respectively, the last
three are for the reverse sequence with offsets 0,1,2.
TCW ORF finder assigns remarks to the sequence about the selected ORF;
TCW assigned remarks start with a "$". In the
images below, an ORF assigned remark is searched on in A, viewed as a column in B, and shown in the sequence frame view of C.
A. Basic Sequence Query for Remark
B. Columns of the Sequence table
C. Sequence Details Frame view
D. 6-frame scores for the displayed ORF
viewSingleTCW features for viewing the ORFs.
A. Basic Sequence Query for Remark: The Basic Sequence Query allows searching on the remark, in
this example, the remark "!LG" was searched on (substring of remark $ORF!LG), which shows the sequences where the ORF is not the longest.
All rows were selected and "View Sequences" shows them in the main Sequence table as shown in Figure 3B.
B. Columns of the Sequence table: This figure
shows most of the ORF columns. A row can be selected followed by
"View Selected Sequence" to view the frame as shown in Figure 3C.
C-D. Sequence Details Frame view: This example shows that the longest ORF was not the
selected one. The selected one has a hit with an E-value of 8E-52 and good Markov and Codon
scores; by toggling the "ORFs" to "Scores", the upper region will show the 6-frame scores as shown in Figure 3D.
The full sequence is shown, where the start and stop codons are highlighted in green, and the hit region is in italics.
TCW ORF finder prints out various summary counts. The results below are from a RNA-seq assembled
dataset of 20990 sequences.
Is Longest ORF 18767 (89%) Has Start&Stop 9187 (44%) Used Hit frame 16638 (79%)
ORF>=300 17729 (84%) Has Start 12852 (61%) Markov Best Frame 17783 (85%)
Average ORF len 840 Has Stop 12877 (61%) Codon Best Frame 19518 (93%)
|The Markov and Codon Best Frame score is one that is positive and greater than all other frames for the ORF.
Though the Codon percentage is better than the Markov in this case, the Markov score is much more discriminatory.
For example, when the ORF finder was executed with a file of UTRs as input, where the table on the right show the
Table 2: Training with CDS vs UTRs
|Hit Regions||17783 (85%)||19518 (93%)
|UTRs||5832 (28%)||13370 (64%)
From the 16638 ORFs that used the hit frame, 14956 (90%) were the longest ORF. If this is the case in general,
it would indicate that selecting the longest ORF would be correct in at least 90% of the cases. However,
the longest ORF is not necessary the best start codon, as shown in the example on the right.
Figure 3. Not longest ORF for frame. An ATG codon is further upstream then the one selected.
TCW ORF finder and Trinity Transcoder
|Go to top|
Transcoder3 Perl code for computing the reading frame specific
5th-order Markov model was written in
Java for the
TCW ORF finder, that is, it computes the exact same scores.
- If the
TCW file HitRegion.txt
was entered into the
Transcoder seq_n_baseprobs_to_loglikelihood_vals.pl program,
TCW ScoreMarkov.txt file will be the same as the
Transcoder hexamer.scores file.
- If the project_ntORFs.fa file is used as input to the score_CDS_likelihood_all_6_frames.pl
TCW BestORFscores.txt file will have similar Markov score as the
longest_orfs.cds.scores file; there is a slight variation because of a difference in precision between
Perl and Java.
TCW ORF finder results were compared for the same input file of
250 sequences. As the
Transcoder may list more than one ORF for a frame, only the first (and best) was
compared with the best TCW ORF. There were the 131 differences: Frame: 6, Start 94, Stop 12, Start&Stop 19.
There was a hit for
all difference except for 11; one of these had a different frame due to TCW Rule 3, the rest were
insignificant differences in coordinates, e.g.
Bar_101 Diff start
1644 3p-partial 2 17..1660 TCW
1662 internal 2 2..1660 Transcoder
Of the six with a frame difference, one was just mentioned and the rest were due to selecting the hit frame; they were
all marked as "ORF!LG", that is, not the longest. Most of the difference were because TCW was able to
use the hit coordinates as a guide (e.g. see Figure 3).
Transcoder can use Pfam results, which was not used in the above comparison. To compare the two
without any Hit/Pfam results, hits were removed from the TCW database and the
500 CDSs was used as the training set. The resulting comparison had only 35 differences, two were frame differences
due to Rule 3 and the rest were the minor differences in end coordinates (as shown in Bar_101 above).
GeneID: Parra G, Blanco E, Guigo R (2000) GeneID in Drosophilia. Genome Research 10:511-515.
Transcoder: Hass BJ, Papanicolaou A, Yassour M. et al. (2013) De novo transcript sequence reconstruction from RNA-seq
using the Trinity platform for reference generation and analysis. Nature Protocols 8:1494-1512
Search programs used in
BLAST: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.
Diamond: Benjamin Buchfink, Chao Xie & Daniel H. Huson, Fast and Sensitive Protein Alignment using DIAMOND, Nature Methods, 12, 59-60 (2015) doi:10.1038/nmeth.3176.
Ublast: Edgar,RC (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461.