TCW ORF finder

TCW ORF finder

This document discusses the TCW ORF finder.

Contents:

ORF Finder
viewSingleTCW - Viewing the ORFs
TCW ORF finder and Trinity TransDecoder
References

ORF Finder

The runSingleTCW annotation step computes the best ORF for each sequence using the annotation hits, 5th-order Markov model^2,3 and length of candidate ORF. It outputs a protein sequence for every transcript. This document has been updated for v3.3.4.

runSingleTCW has an Option menu that allows the user to set the ORF options, which is shown on the right.

This ORF finder algorithm was designed to find the 'best' ORF for input into runMultiTCW. However, a file of all the 'good' ORFs is output (see Output files).

Figure 1: The runSingleTCW ORF finder options.

Algorithm

Go to top

The algorithm for computing the best ORF for a given sequence is:

   1. For each of the 6 frames, find the best ORF as follows:
		If there is a good hit (E-value < N or %HitSim>=M), use the hit frame.
			If the hit terminates with a Start and Stop codon, use the corresponding coordinates.
			Else compute all ORFs surrounding the hit and select the best based on the Rules.
		If there is not a good hit, find all possible ORFs and select the best based on the Rules.
   2. From the 6 ORFs found in Step 1, find the best ORF using all Rules.

Rules

When comparing two ORFs, the following rules are applied in order.

☞ Rule 1: Use best hit frame if E-value < N or %HitSim>=M.

☞ Rules 2: Compute lgLen=log ratio of the lengths and 3: lgMk=log ratio of Markov scores.
Evaluate the following in order:

If (lgLen>len-cutoff && (len1>100 || len2>100)) use ORF with longest length
If (lgMK>MK-cutoff && (Mk1>10.0 || Mk2>10.0)) use ORF with best Markov score
If (lgLen>len-cutoff) use ORF with longest length
If (lgMK>MK-cutoff) use ORF with best Markov score

☞ Rule 4: Use the ORF with start and stop codons.

☞ If all else fails: Use a poor hit frame or the longest ORF.

The Markov model is trained using either of following:

Train with Best Hits (E-value <= N or %HitSim >= M). Minimum Set [default 500]
There must be at least Minimum Set sequences that pass the E-value and %Sim cutoffs, otherwise Rule 3 is ignored. The training set is created from the exact nucleotide regions of hit alignments from the top longest ORFs and best hits.
Train with CDS file [default none]
The file name of a FASTA file of nucleotide coding sequences.

Considering log ratio cutoffs using length as an example:

If the lengths of two candidate ORFs are very different, than it is best to use the longest ORF.
To determine if their lengths are 'very' different, the log length ratio is used.

For example, in the Table 1 on the right, the 0.3, 0.4 and 0.5 are Log Ratio cutoffs.

A value of 'T' indicates that the longest ORF will be selected.

			Cutoff
Length 1	Length 2	Log Ratio	0.3	0.4	0.5
30	45	0.40547	T	T	F
300	450	0.40547	T	T	F
3000	4500	0.40547	T	T	F

Markov training set

The default requires at least 500 sequences (Minimum Set). The algorithm finds the 2000 unique longest sequences to train with; if there is not 2000, all unique sequences will be used. The Minimum Set can be changed in the interface. The number of training sequences can be changed by running the ORF finder from the command line, e.g.

	./execAnno <project> -r -t 5000

The training set that TCW uses to create the Markov model can make a big difference in the Markov scores, hence, resulting ORFs. If your dataset is small and there is a file of CDS sequences available, use it!

Additional heuristics

Length: An ORF must be greater than 30bp unless the sequence length is less than 30bp, then it must be at least 9bps.

Rule 1:

Multi-frame: The best hit frame will be used if it is a good hit, else, the hit is ignored.
Stops in Hit: All possible ORFs are computed, and the selected one will be from those that cover a region of the hit.
Nucleotide Best Bit-score: The 'Best Anno' hit will be used instead (assuming its protein).

Executing the ORF finder

The ORF finder is run after the sequences are annotated (Step 3. Annotate in runSingleTCW). If you want to change the ORF finder options and run it again, there is a "ORF only" function in runSingleTCW. Or you can run it from the command line:

	./execAnno <project> -r

Running from the command line will use the options set using runSingleTCW.

Output files

Go to top

The following files are written into the <project>orfFiles directory.

Table 2. Output files of the TCW ORF finder.

bestORFs.pep.fa	The translated CDS for the best ORF for each sequence.
bestORFs.cds.fa	The CDS for the best ORF for each sequence.
bestORF.scores.txt	For the best ORF for each sequence, the 6-frame scores for the ORF.
allGoodORFs.pep.fa*	The translated CDSs for all good ORFs.
allGoodORFs.scores.txt*	A list of the coordinates and scores for all good ORFs.
scoreCodon.txt	The computed codon usage frequencies.
scoreMarkov.txt	The computed 5th-order Markov model loglikelihood.

* Selected hit, or >= 900nt, or good Markov score (>0 and best of 6-frames for ORF).

Example of AllGoodORFs.txt, where the first one listed is the selected ORF.

    ### Name        ntLen aaLen        Type Frame  Start..Stop     Markov Codon
	tra_002_a        2160   720  5p-partial     3      3..2162   139.08   67.16   Hit: 3E-215  52%  84% (6..2150)
	tra_002_b         141    47  5p-partial     1   2269..2409     5.01    5.50

Example of bestORFs.scores.txt of the 6-frame scores for the best ORF:

	>tra_002         type:5p-partial seqLen:3111     orfLen:2160     ORF:3-2162(3)        Hit:2.94E-215,52%,84%
 	 Markov  139.08  -68.93  -26.47  -51.14  -25.33  -29.83

The first Markov score is for the ORF starting at 0 (hence, the current frame), the next two are offset by 1 and 2 respectively, the last three are for the reverse sequence with offsets 0,1,2.

Summary of Results

Go to top

The TCW ORF finder prints out various summary statistics to the <projects>/logs/anno.log file, and the top part of the following is in the Overview. The results below are from a RNA-seq assembled dataset of sequences.

 
ORF Stats:   Average length 1272
	Has Hit            52,854  (99.0%)    Both Ends    40,855  (76.5%)    Multi-frame    12,988  (24.3%)
	Is Longest ORF     41,639  (78.0%)    ORF>=300     48,891  (91.5%)    Stops in Hit    7,849  (14.7%)
	Markov Best Score  46,046  (86.2%)    ORF=Hit      32,253  (60.4%)    >=9 Ns in ORF       0     (0%)
	All of the above   38,628  (72.3%)      with Ends  28,922  (54.2%)

  Additional ORF info                      For seqs with hit   52,859  (99.0%)     ORF=Hit with Ends   28,922  (54.2%)
	One End             52,328  (98.0%)      Both Ends          40,717  (77.0%)      ORF>=300           28,012  (96.9%)
	Markov Good Frame   47,700  (89.3%)      Markov Good Frame  47,398  (89.7%)      Markov Good Frame  27,792  (96.1%)
	ORF=Hit             32,253  (60.4%)      Markov Best Score  45,724  (86.5%)      Markov Best Score  27,353  (94.6%)
	ORF~Hit              7,488  (14.0%)      Is Longest ORF     41,329  (78.2%)      Is Longest ORF     23,328  (80.7%)
	ORF>Hit             11,827  (22.1%)      Longest & Markov   38,628  (73.1%)      Longest & Markov   22,776  (78.7%)
	  with Ends          4,878   (9.1%)      Not hit frame                    5      Sim>=90            28,399  (98.2%)

	Frame: 3(30.5%)  2(29.9%)  1(34.3%)  -1( 1.9%)  -2( 1.6%)  -3( 1.9%)

  Both Ends:           Has Start and Stop codon
  ORF=Hit with ends:   ORF coordinates=Hit coordinates with ends
  Markov Best Score:   Best score from best ORF for each of 6 frames
  Markov Good Frame:   Score>0 and best score from 6 RFs of selected ORF

viewSingleTCW - Viewing the ORFs

Go to top

The TCW ORF finder assigns remarks to the sequence about the selected ORF (see TCW Basic Sequence Help, which has a "TCW Remark Legend" section). In the images below, (A) an ORF assigned remark is searched on, (B) the set is viewed as a Sequence table, (C) the sequence frame view and (D) the Markov scores for the best ORF from C. Further description is provided below the set of images. Click on an image for a clearer view.

A. Basic Sequence Query for Remark

B. Columns of the Sequence table

C. Sequence Details Frame view

D. 6-frame scores for the displayed ORF

Figure 3: viewSingleTCW features for viewing the ORFs.

A. Basic Sequence Filter for TCW Remark: In this example, the remark "!Lg" was searched on, which shows the sequences with ORFs that are not the longest. All rows were selected and Seq Table shows them in the main Sequence table as shown in Figure 3B.

B. Columns of the Sequence table: This figure shows most of the ORF columns. A row can be selected followed by Seq Detail to view the frame as shown in Figure 3C.

C. Sequence Details Frame view: This example shows that the selected ORF is not the longest ORF, but has a hit with an E-value of 3.3E-98 and good Markov and Codon scores.

D. By toggling the ORFs/NT to Scores/AA, the upper region will show the 6-frame scores of the selected ORF.

The Markov and Codon Best Frame score is one that is positive and greater than all other frames for the ORF. Though the Codon percentage is sometimes better than the Markov, the Markov score is much more discriminatory. For example, when the ORF finder was executed with a file of UTRs as input, the table on the right show the difference.

Table 3: Training with CDS vs UTRs

Input	Markov	Codon
Hit Regions	17783 (85%)	19518 (93%)
UTRs	5832 (28%)	13370 (64%)

TCW ORF finder and Trinity TransDecoder

Go to top

Markov is the 5th-order Markov model as computed by TransDecoder³ and originally used in GeneID². The TransDecoder² Perl code for computing the reading frame specific 5th-order Markov model was translated to Java for the TCW ORF finder. However, they use different training sets, so their scores tend to be different. They are the same +/-0.02 if either of the following is done: (1) Use the TransDecoder longest_orfs.cds.top_500_longest as the TCW CDS file and run ./execAnno <project> -f -r. (2) Use the TCW orfFiles/bestORFs.cds.fa file as input to the TD score_CDS_likelihood_all_6_frames.pl program. In either case, compare the TransDecoder longest_orfs.cds.scores (.p1 only) with the TCW orfFiles/bestORF.scores.txt file, for the ORFs where the start and ends are the same. For example:

TransDecoder: longest_orfs.cds.scores
#acc    Markov_order    seq_length      score_1 score_2 score_3 score_4 score_5 score_6
Osj_00011.p1    5       1047    83.19   0.06    15.85   27.04   26.83   5.91

TCW: bestORF.scores.txt
>Osj_00011       type:complete   seqLen:1509     orfLen:1047     ORF:286-1332(1)      Hit:2.72E-79,45%,93%
 Markov   82.69    0.06   15.86   27.04   26.84    5.92

References

Go to top

GeneID: Parra G, Blanco E, Guigo R (2000) GeneID in Drosophilia. Genome Research 10:511-515.
TransDecoder: Hass BJ, Papanicolaou A, Yassour M. et al. (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols 8:1494-1512
Search programs used in TCW:
1. BLAST: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.
2. DIAMOND: Benjamin Buchfink, Chao Xie & Daniel H. Huson, Fast and Sensitive Protein Alignment using DIAMOND, Nature Methods, 12, 59-60 (2015) doi:10.1038/nmeth.3176.

Email Comments To: tcw@agcol.arizona.edu