- runSingleTCW - Running the ORF finder
- ORF finder algorithm
- viewSingleTCW - Viewing the ORFs
TCW has programs to build and view a single species (singleTCW), and to build and view
the orthologs of multiple species (multiTCW). For the comparison, it is necessary to have
the translated protein sequences. In order to create them,
runMultiTCW can run
ESTscan (Iseli C, Jongeneel CV, Bucher P (1999)), which uses a Hidden Markov Model to find
the correct reading, taking into account sequencing errors. The user must supply a 'smat'
file that is created using the
ESTscan software. In order to provide a simpler method,
the TCW annotation step computes the best ORF for each sequence using the annotation hits,
codon usage, hexamer usage and length of candidate ORF. It outputs a protein sequence for
every transcript that has an ORF >=30bp.
runSingleTCW - Running the ORF finder
|Go to top|
Figure 1. |
runSingleTCW has an Option menu that allows the user to
change the default ORF finder rules.
|Use Alternative Starts
||By default, TCW only uses ATG. If you select this option, it will also use
CTG and TTG.
|For selected the ORF, one of the following two methods may be selected:
|1. Use the longest ORF
||No parameters. No hits or usage data is used.
|2. Use Hits and Usage
||The following parameters are used:
| Hit e-value
||If a sequence has an annoDB hit with e-value <= this number [default 1e-05], use the corresponding frame.
| Log ratio of lengths
||If a sequence has two candidate frames where the log ratio of their lengths
<= this number [default 0.2], the codon and hexamer LLR are used to
determine the best frame. See Table 1 for some examples.
| Computing the usage matrices
||One of the following three methods is used:
| 1. Train with Hits
||TCW uses the best hit regions of the sequences
to compute the codon and hexamer usage matrices. For a sequence region to be used
in training, its must have a hit e-value <=1e-50 and no contained Stop codons.
| 2. Train with CDS file
||The file must be in fasta format, where each sequence starts with a start codon
and ends at the stop. These sequences are used to compute the codon and hexamer usage
| 3. Codon Usage table
||The file must contain a line for each codon and the corresponding frequency
per 1000 (see demo/projects/demoOl_DE/usage.txt for an example). If this is used, there will not be
hexamer values for the frames.
Running the ORF finder only
The ORF finder is run after the sequences are annotated ("Exec Annotate Sequences" in
If you want to change the ORF finder options and run it again, there is a "Exec ORF only" function
runSingleTCW. Or you can run it from the command line:
This will uses the options set using
Rule shown in TCW
viewSingleTCW overview page states the rules that were used.
Sequences have remarks such as shown in Figure 2 (on the right of the selected frame):
- !LG indicates its not the longest ORF.
- If the frame was selected based on the hit,
the remark indicates the relation:
- ORF contains Hit
- Hit contains ORF
- ORF overlaps Hit (indicating there was a Stop codon within the hit region.
Examples of Log Ratio of lengths
Table 1. Example of the Log Ratio with various lengths and cutoffs.
The 0.1, 0.2 and 0.3 are Log Ratio cutoff.
A value of 'true' indicates that
the LLR will be used to determine which ORF to select.
|Length 1||Length 2||Log Ratio||0.1||0.2||0.3
|ORFbestTranslated.fasta||recommended for input in |
|ORFallTranslated.fasta||translated ORF for each frame.
|ORFbestUntranslated.fasta||the CDS sequences.
|ORFframe.txt|| a list of the ORFs with length, codon LLR and hexamer LLR.
|ORFcodonUsage.txt||the computed codon usage frequencies (unless a codon usage table was input).
|ORFhexUsage.txt||the computed hexamer usage frequencies (unless a codon usage table was input).
The algorithm is as follows:
Step 1: All ORFs for a frame are generated. An ORF may be from the start coordinate to the first Stop,
a start site to a Stop, or from a start site to the end coordinate. If a start site is within 30bp of
the start coordinate, it is used instead of the start coordinate.
Step 2: The best ORF for a given frame is selected using the following rules.
Step 3: The overall best ORF is selected from all frames using the following rules.
Rule 1: ORF with the Best hit.
This frame is selected if the
hit e-value is less than a user-supplied cutoff [default 1E-05] and if there is an ORF
>=30bp that at least overlaps the hit region.
Rule 2: Longest ORF.
If rule 1 fails, this frame is selected if the log ratio
of its length with the second longest ORF is <= to a user-supplied cutoff [default 0.2].
Rule 3: ORF with the highest LLR.
If rule 2 fails, then between the two longest ORFs, the frame with the best
Codon LLR + Hexamer LLR is selected, where LLR is the log-likelihood ratio.
On the left are two examples, where the selected frame is listed first and the
corresponding ORF is shown within green start and stop codons.
Figure 2a. The frame with a hit was selected;
the italicized codons indicate where the hit aligns.
Figure 2b. The top two are close in length,
so the LLR values are used to select the frame.
Results are obtained from 3 datasets:
- Mouse CCDS are a high quality
set with no UTRs. This set was downloaded around August 2015.
Diaphorina citri GS
mRNA sequences, which are predicted from genome sequence and contain UTRs.
- Diaphorina citri de novo
transcripts from the Homoptera project, which are de novo assembled from 54bp paired end
Table 2: Statistics for the 3 datasets.
| ||1. Mouse CCDS||2. D. citri GS||3. D. citri de novo
|Sequences ||24833 ||21986 ||21072
|Used for training||24007 ||16446 ||11510
|ORF >= 30 ||100% ||100% ||100%
|ORF >= 300 ||97.6% ||91.7% ||83.1%
|Has Start&Stop ||0.1% ||38.1% ||53.6%
|Used Annotation ||99.6% ||91.5% ||80.0%
|Is Longest ORF ||99.6% ||97.1% ||85.9%
|Codon LLR>=0 ||99.7% ||98.4% ||94.4%
|Hexamer LLR>=0 ||99.8% ||98.7% ||94.1%
|Best Codon LLR ||62.2% ||59.2% ||49.4%
|Best Hexamer LLR||64.2% ||64.7% ||53.2%
|Average ORF length||1671 ||1095 ||798
1. Mouse CCDS dataset.
The mouse TCW database was annotated against the Rodent UniProt database,
the rest of UniProt, and Genbank nr database. 1029 sequences had hits in more than
The ORF finder was run with default parameters. When the CCDS set was used directly for training the codon and hexamer usage tables,
it produced results about the same.
For this dataset, using "Longest ORF" option always gets the correct frame.
The ORF should always be in frame 1 and extends the whole length (there are no stops at
the end), however, 60 sequences had an ORF in frame 2 or 3.
- 28 has hits in multiple frames, where the best hit was in a frame other than 1.
- As shown in Figure 4, one ORF has an e-value of 0E0 in frame 3 and the
ORF extended the length of the sequences. A similar situation occurred for two
- Most of these also had a hit in
frame 1, showing a weakness in the algorithm, i.e. it should take into account
the best hit in each frame. However, for all but 3 sequences, the e-value was
between 1E-05 and 1E-70, with the average about 1E-30.
- 10 had hits only in frames other than 1.
- 22 had no hits and an ORF in a frame other than 1 that was almost as long and had much better LLR values.
Figure 4a. Best hits per annoDB.
SP=SwissProt, TR=Trembl, Rod=Rodent, Ful=everything but Rodents, GInr is Genbank nr.
Figure 4b. The ORFs per frame.
The one difference is that the ORF in frame 3 does not start with an ATG whereas
the ORF in frame 1 does, however, since this algorithm is tuned for de novo transcripts,
the start of the transcript can be missing, hence, having a start site is not given
2. D. citri genome-sequence dataset.
This dataset was annotated against the UniProt invertebrate databases,
the full database minus the invertebrate, and the Genbank nr (which contains many of the D. citri genes).
the default ORF finding options. All but 290 sequences have a positive frame.
Of these 290:
The number of ORFs with both a start and stop is low;
7711 are full length where none of them have a Stop codon;
2458 sequences have no start or
stop for the entire sequence for the selected frame, as shown in Figure 5b.
Of course, any of the ATGs may be the 'correct' start or any of the alternative starts.
The algorithm used the default of only allowing ATG, and if there is no ATG within 30bp of
the start, the start coordinate is used.
- 24 have hits in multi-frames.
- 23 had a hit only in a negative frame.
- The rest had no hit, or a hit > 1E-05, or a hit that did not overlap an ORF.
An example of a frame -3 ORF with no hit is shown in Figure 5a.
Figure 5a. Best ORF in frame -3. No hit for sequence.
Figure 5b. Full length with no start or stop.
Assuming that all ORFs should be in a positive frame, the Hit&Usage option performs slightly better than the Longest ORF option.
3. D. citri de novo dataset.
Since these are de novo assembled with short reads (54bp),
they are not going to be as good the other two sets. It was annotated with the same databases
as stated for dataset #2. Since there will be many partial transcripts, the options to use
alternative start sites was checked to get the longest possible ORFs.
To estimate of the number of correct ORFs, the file of "ORFallTranslated.fasta",
which contains the translated ORFs for each frame, was blasted against the translated ORFs
from D. citri GS mRNA (dataset #2). The blast results were analyzed to determine how many
of the selected ORFs received the best e-value mRNA hit. The same procedure was used after
computing ORFs based on the longest ORF only.
These results indicate that for de novo transcripts, which will have error, it is
advantageous to use the Hit&Usage option.
Table 3. Comparison results.
1The annotation was not actually used, but 13177 of the longest ORFs had annotation
without specifically selecting for it.
| ||Use Hits & Usage||Use Longest Only
|Selected ORF is not best mRNA hit:
|Has good annotation||80 of 14866 (0.53%)||42 of 13177 (0.32%)1
|No good annotation2||15 of 162 (9%)||895 of 925 (96%)
|Total false frames ||95 of 15063 (0.63%)||937 of 15012 (6.24%)
2No annotation or poor annotation (e-value>1E-05 or Hit does not overlap with a valid ORF).