The University of Arizona
TCW ORF finder
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour

Contents:

  1. Overview
  2. runSingleTCW - Running the ORF finder
  3. ORF finder algorithm
  4. viewSingleTCW - Viewing the ORFs
  5. Results

Overview

TCW has programs to build and view a single species (singleTCW), and to build and view the orthologs of multiple species (multiTCW). For the comparison, it is necessary to have the translated protein sequences. In order to create them, runMultiTCW can run ESTscan (Iseli C, Jongeneel CV, Bucher P (1999)), which uses a Hidden Markov Model to find the correct reading, taking into account sequencing errors. The user must supply a 'smat' file that is created using the ESTscan software. In order to provide a simpler method, the TCW annotation step computes the best ORF for each sequence using the annotation hits, codon usage, hexamer usage and length of candidate ORF. It outputs a protein sequence for every transcript that has an ORF >=30bp.

runSingleTCW - Running the ORF finder

Go to top
Figure 1. runSingleTCW has an Option menu that allows the user to change the default ORF finder rules.

Use Alternative Starts By default, TCW only uses ATG. If you select this option, it will also use CTG and TTG.
For selected the ORF, one of the following two methods may be selected:
1. Use the longest ORF No parameters. No hits or usage data is used.
2. Use Hits and Usage The following parameters are used:
    Hit e-value If a sequence has an annoDB hit with e-value <= this number [default 1e-05], use the corresponding frame.
    Log ratio of lengths If a sequence has two candidate frames where the log ratio of their lengths <= this number [default 0.2], the codon and hexamer LLR are used to determine the best frame. See Table 1 for some examples.
    Computing the usage matrices One of the following three methods is used:
       1. Train with Hits TCW uses the best hit regions of the sequences to compute the codon and hexamer usage matrices. For a sequence region to be used in training, its must have a hit e-value <=1e-50 and no contained Stop codons.
       2. Train with CDS file The file must be in fasta format, where each sequence starts with a start codon and ends at the stop. These sequences are used to compute the codon and hexamer usage matrices.
       3. Codon Usage table The file must contain a line for each codon and the corresponding frequency per 1000 (see demo/projects/demoOl_DE/usage.txt for an example). If this is used, there will not be hexamer values for the frames.

Running the ORF finder only

The ORF finder is run after the sequences are annotated ("Exec Annotate Sequences" in runSingleTCW). If you want to change the ORF finder options and run it again, there is a "Exec ORF only" function in runSingleTCW. Or you can run it from the command line:
./execAnno  -r
This will uses the options set using runSingleTCW.

Rule shown in TCW

The viewSingleTCW overview page states the rules that were used.
Sequences have remarks such as shown in
Figure 2 (on the right of the selected frame):
  • !LG indicates its not the longest ORF.
  • If the frame was selected based on the hit, the remark indicates the relation:
    • ORF contains Hit
    • Hit contains ORF
    • ORF overlaps Hit (indicating there was a Stop codon within the hit region.

Examples of Log Ratio of lengths

Table 1. Example of the Log Ratio with various lengths and cutoffs.

The 0.1, 0.2 and 0.3 are Log Ratio cutoff.

A value of 'true' indicates that the LLR will be used to determine which ORF to select.

Length 1Length 2Log Ratio0.10.20.3
34300.125 falsetruetrue
3403000.125 falsetruetrue
340030000.125falsetruetrue
39300.262 falsefalsetrue
3903000.262 falsefalsetrue
390030000.262falsefalsetrue

Output files

ORFbestTranslated.fastarecommended for input in runMultiTCW.
ORFallTranslated.fastatranslated ORF for each frame.
ORFbestUntranslated.fastathe CDS sequences.
ORFframe.txt a list of the ORFs with length, codon LLR and hexamer LLR.
ORFcodonUsage.txtthe computed codon usage frequencies (unless a codon usage table was input).
ORFhexUsage.txtthe computed hexamer usage frequencies (unless a codon usage table was input).

ORF finder algorithm

Go to top

The algorithm is as follows:

Step 1: All ORFs for a frame are generated. An ORF may be from the start coordinate to the first Stop, a start site to a Stop, or from a start site to the end coordinate. If a start site is within 30bp of the start coordinate, it is used instead of the start coordinate.

Step 2: The best ORF for a given frame is selected using the following rules.

Step 3: The overall best ORF is selected from all frames using the following rules.

Rule 1: ORF with the Best hit.
This frame is selected if the hit e-value is less than a user-supplied cutoff [default 1E-05] and if there is an ORF >=30bp that at least overlaps the hit region.

Rule 2: Longest ORF.
If rule 1 fails, this frame is selected if the log ratio of its length with the second longest ORF is <= to a user-supplied cutoff [default 0.2].

Rule 3: ORF with the highest LLR.
If rule 2 fails, then between the two longest ORFs, the frame with the best Codon LLR + Hexamer LLR is selected, where LLR is the log-likelihood ratio.

On the left are two examples, where the selected frame is listed first and the corresponding ORF is shown within green start and stop codons.

Figure 2a. The frame with a hit was selected; the italicized codons indicate where the hit aligns.

Figure 2b. The top two are close in length, so the LLR values are used to select the frame.

viewSingleTCW - Viewing the ORFs

Go to top
  1. Viewing the ORF as shown in Figure 2: From a table of sequences, select a sequence followed by "View Selected Sequence". On the resulting display, using the drop-down next to the "Show" label, select "Show sequence by frame".

    The "Start/Stop" option can be changed to "Codon" or "Hexamer" (if available), in which case, the codons/hexamers are highlighted by quartile Q1-Q4.

  2. Both "Columns" and "Filters" have an ORF section, where the possible ORF columns and filters are shown in Figure 3a.

  3. The "Basic Query" on sequence allows search on a remark, so entering a "LG" shows all the sequences where the longest ORF, as shown in Figure 2b.
Figure 3a. ORF columns and Filters
Columns
Filters
Figure 3b. Basic Query by Sequence.
Search for sequence with "LG" in their remark.

Results

Go to top
Results are obtained from 3 datasets:
  1. Mouse CCDS are a high quality set with no UTRs. This set was downloaded around August 2015.
  2. Diaphorina citri GS mRNA sequences, which are predicted from genome sequence and contain UTRs.
  3. Diaphorina citri de novo transcripts from the Homoptera project, which are de novo assembled from 54bp paired end reads.
Table 2: Statistics for the 3 datasets.
 1. Mouse CCDS2. D. citri GS3. D. citri de novo
Sequences 24833 21986 21072
Used for training24007 16446 11510
Selected ORF:
ORF >= 30 100% 100% 100%
ORF >= 300 97.6% 91.7% 83.1%
Has Start&Stop 0.1% 38.1% 53.6%
Used Annotation 99.6% 91.5% 80.0%
Is Longest ORF 99.6% 97.1% 85.9%
Codon LLR>=0 99.7% 98.4% 94.4%
Hexamer LLR>=0 99.8% 98.7% 94.1%
Best Codon LLR 62.2% 59.2% 49.4%
Best Hexamer LLR64.2% 64.7% 53.2%
Average ORF length1671 1095 798

1. Mouse CCDS dataset.
The mouse TCW database was annotated against the Rodent UniProt database, the rest of UniProt, and Genbank nr database. 1029 sequences had hits in more than one frame. The ORF finder was run with default parameters. When the CCDS set was used directly for training the codon and hexamer usage tables, it produced results about the same.

The ORF should always be in frame 1 and extends the whole length (there are no stops at the end), however, 60 sequences had an ORF in frame 2 or 3.
  • 28 has hits in multiple frames, where the best hit was in a frame other than 1.
    • As shown in Figure 4, one ORF has an e-value of 0E0 in frame 3 and the ORF extended the length of the sequences. A similar situation occurred for two other sequences.
    • Most of these also had a hit in frame 1, showing a weakness in the algorithm, i.e. it should take into account the best hit in each frame. However, for all but 3 sequences, the e-value was between 1E-05 and 1E-70, with the average about 1E-30.
  • 10 had hits only in frames other than 1.
  • 22 had no hits and an ORF in a frame other than 1 that was almost as long and had much better LLR values.
Figure 4a. Best hits per annoDB.
SP=SwissProt, TR=Trembl, Rod=Rodent, Ful=everything but Rodents, GInr is Genbank nr.

Figure 4b. The ORFs per frame.

The one difference is that the ORF in frame 3 does not start with an ATG whereas the ORF in frame 1 does, however, since this algorithm is tuned for de novo transcripts, the start of the transcript can be missing, hence, having a start site is not given much weight.

For this dataset, using "Longest ORF" option always gets the correct frame.

2. D. citri genome-sequence dataset.
This dataset was annotated against the UniProt invertebrate databases, the full database minus the invertebrate, and the Genbank nr (which contains many of the D. citri genes). It used the default ORF finding options. All but 290 sequences have a positive frame. Of these 290:

  • 24 have hits in multi-frames.
  • 23 had a hit only in a negative frame.
  • The rest had no hit, or a hit > 1E-05, or a hit that did not overlap an ORF. An example of a frame -3 ORF with no hit is shown in Figure 5a.
The number of ORFs with both a start and stop is low; 7711 are full length where none of them have a Stop codon; 2458 sequences have no start or stop for the entire sequence for the selected frame, as shown in Figure 5b. Of course, any of the ATGs may be the 'correct' start or any of the alternative starts. The algorithm used the default of only allowing ATG, and if there is no ATG within 30bp of the start, the start coordinate is used.
Figure 5a. Best ORF in frame -3. No hit for sequence.

Figure 5b. Full length with no start or stop.

Assuming that all ORFs should be in a positive frame, the Hit&Usage option performs slightly better than the Longest ORF option.


3. D. citri de novo dataset.
Since these are de novo assembled with short reads (54bp), they are not going to be as good the other two sets. It was annotated with the same databases as stated for dataset #2. Since there will be many partial transcripts, the options to use alternative start sites was checked to get the longest possible ORFs.

To estimate of the number of correct ORFs, the file of "ORFallTranslated.fasta", which contains the translated ORFs for each frame, was blasted against the translated ORFs from D. citri GS mRNA (dataset #2). The blast results were analyzed to determine how many of the selected ORFs received the best e-value mRNA hit. The same procedure was used after computing ORFs based on the longest ORF only.

These results indicate that for de novo transcripts, which will have error, it is advantageous to use the Hit&Usage option.

Table 3. Comparison results.
 Use Hits & UsageUse Longest Only
Number hit1506315012
Multi-frame51215089
Selected ORF is not best mRNA hit:
Has good annotation80 of 14866 (0.53%)42 of 13177 (0.32%)1
No good annotation215 of 162 (9%)895 of 925 (96%)
Total false frames 95 of 15063 (0.63%)937 of 15012 (6.24%)
1The annotation was not actually used, but 13177 of the longest ORFs had annotation without specifically selecting for it.
2No annotation or poor annotation (e-value>1E-05 or Hit does not overlap with a valid ORF).
Go to top
Email Comments To: tcw@agcol.arizona.edu