The University of Arizona
TCW ORF finder v2.5
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour
This document discusses the TCW ORF finder that is released with TCW v2.5 (12 May 2018).

Contents:

  1. Overview
  2. runSingleTCW - Running the ORF finder
    1. ORF finder algorithm
    2. Running from the ORF finder only
    3. Output file
  3. viewSingleTCW - Viewing the ORFs
  4. TCW ORF finder and Trinity Transcoder
  5. Explanation of summary output
  6. References

Overview

TCW has programs to build and view a single species (singleTCW), and to build and view the orthologs of multiple species (multiTCW). For the comparison, it is necessary to have the protein sequences, i.e. translated Open Reading Frame (ORF). In order to create them, runMultiTCW can run ESTscan1, which uses a Hidden Markov Model to find the correct reading, taking into account sequencing errors; the user must supply a 'smat' file that is created using the ESTscan software. In order to provide a simpler method, the TCW annotation step computes the best ORF for each sequence using the annotation hits, 5th-order Markov model2,3 and length of candidate ORF. It outputs a protein sequence for every transcript.

runSingleTCW - Running the ORF finder

Go to top
runSingleTCW has an Option menu that allows the user to set the GO, ORF and similarity options for annotations. The section on the right are the parameters for ORF finding. The algorithm and parameters are discussed in the next section.

This ORF finder algorithm was designed to find the 'best' ORF from de novo assembled transcripts for input into runMultiTCW. However, a file of all the 'good' ORFs is output (see Output files).


Figure 1: The runSingleTCW ORF finder options.

Algorithm

The algorithm is:
   1. For each of the 6 frames:
         Find the best ORF as follows:
            If the frame computes an ORF using Rule 1, no further ORFs are found.
            Otherwise, find all possible ORFs and select the best using Rule 2 and Rule 3.
   2. From the 6 ORFs found in Step 1, find the best ORF using Rules 1, 2 and 3.
	

Rule 1: Use best hit frame if E-value < [default 1E-30] or has good Markov score
If this rule succeeds, the ORF found in the best frame is the selected one. See
Additional heuristics for information on assigning coordinates.

Rule 2: Else use the longest ORF frame if the log length ratio > [default 0.3].


If the lengths of two candidate ORFs are very different, than it is best to use the longest ORF. To determine if their lengths are 'very' different, the log length ratio is used. For example:

In Table 1 on the right, the 0.1, 0.2 and 0.3 are Log Length Ratio cutoffs.
A value of 'false' indicates that the longest ORF will be selected.
A value of 'true' indicates that Rule 3 will be used.

Table 1. Length differences
LengthLengthLogCutoff
12Ratio0.10.20.3
39300.262 falsefalsetrue
3903000.262 falsefalsetrue
390030000.262falsefalsetrue

Rule 3: Else use the best Markov score.
Markov is the 5th-order Markov model as computed by Transcoder3 and originally used in GeneID2. It is trained using either of the following:

  • Train with hits that have E-value < [default 1E-75]. Minimum Set [default 50]
    There must be at least Minimum Set sequences that pass the E-value cutoff or no training set is created. The training set is created from the exact nucleotide regions of blast4 alignments.
  • Train with CDS file [default none]
    The file name of a fasta file of nucleotide coding sequences.

Note: Its rare to use Rule 3 since generally a sequence has a good hit or is a sufficiently long ORF. However, the scores are interesting to view in viewSingleTCW, where the codon usage score is also shown (trained with the same set).

Additional heuristics: These rules were developed by studying the results.

An ORF must be greater than 30bp unless the sequence length is less than 30bp, then it must be at least 9bps.

For Rule 1:

  • If the hit frame is used, the coordinates are set as follows:
    1. If the hit E-value<1E-100 and the hit coverage is >95%, search for the first Start and Stop codons within the first 40 upstream and downstream codons, respectively. If no Start is found, the hit start coordinate will be used, and if no Stop is found, the hit end coordinate will be used.
    2. Otherwise, search to the ends for the first Start and Stop codons.
    3. In either case, the extension will not pass through a string of >8 consecutive N's.
  • The hit frame will be ignored in the following two conditions:
    1. If there are Stop codons within the hit region, the hit is ignored.
    2. If a sequence has hits to multiple frames, heuristics are applied to determine whether to ignore the hit.
  • The 'Best Eval' hit is used for this rule, but if the 'Best Eval' hit is from a nucleotide annoDB (i.e. annotation database), it cannot be used. In this case, the 'Best Annotation' hit is used if it is from a protein database.
For Rules 2 and 3:
  • If a frame is being analyzed that does not have a hit or it is being ignored: N's in the sequence are ignored. The reason that consecutive N's are allowed in ORFs is that they can be found in hit regions, as show in Figure 2.
  • If the Markov score is being compared: if both are negative, if one has a hit it will be selected, else the longest is selected.
  • For training, only the best E-value hit from each sequence is evaluated.
It is common that even if the hit is being ignored, the hit frame is selected anyway as it will have the longest ORF or best Markov score. Ambiguity tends to arise with the shorter sequence, those with many n's, and multi-frame sequences.

Figure 2: A hit containing strings of consecutive n's. The italics region is the hit region.

Running the ORF finder only

The ORF finder is run after the sequences are annotated (Exec Annotate Sequences in runSingleTCW). If you want to change the ORF finder options and run it again, there is a "Exec ORF only" function in runSingleTCW. Or you can run it from the command line:
./execAnno  -r
This will uses the option set using runSingleTCW.

The ORF finder prints the progress to the terminal, finishing with a summary of the results.

Output files

The following files are written into the project's subdirectory "ORF". The following examples are from the demoTra demo files. For a given project, the "demoTra" will be replaced with its project name.
demoTra_aaORFs.faThe translated CDS for the best ORF for each sequence. This file is also written into the projcmp/AAfiles directory in order to make it easier to use with runMultiTCW.
demoTra_ntORFs.faThe CDS for the best ORF for each sequence.
demoTra_goodORFs.faThe translated CDSs for all good ORFs*.
AllGoodORFs.txtA list of the coordinates for all good ORFs*.
BestORFscores.txtFor the best ORF for each sequence, the 6-frame scores for the ORF.
ScoreCodon.txtThe computed codon usage frequencies.
ScoreMarkov.txtThe computed 5th-order Markov model loglikelihood.
multiFrame.logProvides information on sequences with hits to multiple frames.
Table 3. Output files of the TCW ORF finder
* Selected hit, or >= 900nt, or Markov score is >0 and best frame.

Example of BestORFScore.txt of the 6-frame scores for the best ORF:

>tra_001 type:5p-partial seqLen:1024 orfLen:564 frame:1 coords:106..669 Hit:1.9E-86
 Markov   47.03   -9.11  -23.90    8.36   -6.33  -14.37    3
 Codon    37.91 -131.61 -377.46  -58.59  -83.01 -239.04    3
The first score is for the ORF starting at 0 (hence, the current frame), the next two are offset by 1 and 2 respectively, the last three are for the reverse sequence with offsets 0,1,2.

viewSingleTCW - Viewing the ORFs

Go to top
The TCW ORF finder assigns remarks to the sequence about the selected ORF; all TCW assigned remarks start with a "$". In the images below, an ORF assigned remark is searched on in A, viewed as a column in B, and shown in the sequence frame view of C.
A. Basic Sequence Query for Remark
B. Columns of the Sequence table
C. Sequence Details Frame view
D. 6-frame scores for the displayed ORF
Figure 3: viewSingleTCW features for viewing the ORFs.

A. Basic Sequence Query for Remark: The Basic Sequence Query allows searching on the remark, in this example, the remark "!LG" was searched on (substring of remark $ORF!LG), which shows the sequences where the ORF is not the longest. All rows were selected and "View Sequences" shows them in the main Sequence table as shown in Figure 3B.

B. Columns of the Sequence table: This figure shows most of the ORF columns. A row can be selected followed by "View Selected Sequence" to view the frame as shown in Figure 3C.

C-D. Sequence Details Frame view: This example shows that the longest ORF was not the selected one. The selected one has a hit with an E-value of 8E-52 and good Markov and Codon scores; by toggling the "ORFs" to "Scores", the upper region will show the 6-frame scores as shown in Figure 3D. The full sequence is shown, where the start and stop codons are highlighted in green, and the hit region is in italics.

Results

Go to top
The TCW ORF finder prints out various summary counts. The results below are from a RNA-seq assembled dataset of 20990 sequences.
  
Is Longest ORF   18767 (89%)    Has Start&Stop   9187 (44%)    Used Hit frame     16638 (79%)    
ORF>=300         17729 (84%)    Has Start       12852 (61%)    Markov Best Frame  17783 (85%)    
Average ORF len          840    Has Stop        12877 (61%)    Codon Best Frame   19518 (93%)     
The Markov and Codon Best Frame score is one that is positive and greater than all other frames for the ORF. Though the Codon percentage is better than the Markov in this case, the Markov score is much more discriminatory. For example, when the ORF finder was executed with a file of UTRs as input, where the table on the right show the difference.
Table 2: Training with CDS vs UTRs
InputMarkovCodon
Hit Regions17783 (85%)19518 (93%)
UTRs5832 (28%)13370 (64%)

 

From the 16638 ORFs that used the hit frame, 14956 (90%) were the longest ORF. If this is the case in general, it would indicate that selecting the longest ORF would be correct in at least 90% of the cases. However, the longest ORF is not necessary the best start codon, as shown in the example on the right.
Figure 3. Not longest ORF for frame. An ATG codon is further upstream then the one selected.

TCW ORF finder and Trinity Transcoder

Go to top
The Transcoder3 Perl code for computing the reading frame specific 5th-order Markov model was written in Java for the TCW ORF finder, that is, it computes the exact same scores.
  • If the TCW file HitRegion.txt was entered into the Transcoder seq_n_baseprobs_to_loglikelihood_vals.pl program, the TCW ScoreMarkov.txt file will be the same as the Transcoder hexamer.scores file.
  • If the project_ntORFs.fa file is used as input to the score_CDS_likelihood_all_6_frames.pl program, the TCW BestORFscores.txt file will have similar Markov score as the longest_orfs.cds.scores file; there is a slight variation because of a difference in precision between Perl and Java.

The Transcoder and TCW ORF finder results were compared for the same input file of 250 sequences. As the Transcoder may list more than one ORF for a frame, only the first (and best) was compared with the best TCW ORF. There were the 131 differences: Frame: 6, Start 94, Stop 12, Start&Stop 19. There was a hit for all difference except for 11; one of these had a different frame due to TCW Rule 3, the rest were insignificant differences in coordinates, e.g.

Bar_101       Diff start    
   1644  3p-partial             2    17..1660   TCW
   1662  internal               2     2..1660   Transcoder
Of the six with a frame difference, one was just mentioned and the rest were due to selecting the hit frame; they were all marked as "ORF!LG", that is, not the longest. Most of the difference were because TCW was able to use the hit coordinates as a guide (e.g. see Figure 3).

The Transcoder can use Pfam results, which was not used in the above comparison. To compare the two without any Hit/Pfam results, hits were removed from the TCW database and the Transcoder top 500 CDSs was used as the training set. The resulting comparison had only 35 differences, two were frame differences due to Rule 3 and the rest were the minor differences in end coordinates (as shown in Bar_101 above).

References

Go to top
  1. ESTscan: Iseli C, Jongeneel CV, Bucher P (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences.
  2. GeneID: Parra G, Blanco E, Guigo R (2000) GeneID in Drosophilia. Genome Research 10:511-515.
  3. Transcoder: Hass BJ, Papanicolaou A, Yassour M. et al. (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols 8:1494-1512
  4. Search programs used in TCW:
    1. BLAST: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.
    2. Diamond: Benjamin Buchfink, Chao Xie & Daniel H. Huson, Fast and Sensitive Protein Alignment using DIAMOND, Nature Methods, 12, 59-60 (2015) doi:10.1038/nmeth.3176.
    3. Ublast: Edgar,RC (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461.
Go to top
Email Comments To: tcw@agcol.arizona.edu