sTCW Assembly Guide

sTCW Assembly Guide

This document describes the TCW assembly process, which is run via the runSingleTCW interface or the script execAssm using the parameter file projects/<project-name>/sTCW.cfg.

Note that TCW cannot assemble raw RNA-seq reads. Rather, assembly in TCW serves primarily the following purposes:

Assemble ESTs or 454 reads.
Assemble multiple transcripts libraries with optional count data.
Assemble transcript library(s) with optional count data with ESTs.

Assembly of the demo project
Transcript with counts and EST libraries
Choosing Read Names
The Assembly Process
Assembly Parameters
SNPs and Extras
Trouble shooting

Assembly of the Demo Project

This demo uses the project demoAsm. The assembly process uses blast and cap3 (see Installation). IMPORTANT NOTE: The Blast path MUST be defined in the HOSTS.cfg using the blast_path parameter; see Install.

Start by running runSingleTCW and selecting project demoAsm, giving the interface shown on the right.

The project has two libraries, Illumina (an Illumina RNA-seq library) and Sanger (an EST Sanger library). The Illumina transcripts have read counts from two count libraries, tip and zone, reflecting two different tissues of the rhizome.

Press the Step 1. Build Database button to create the database and load the data.

Then set #CPUs to the number of processors your machine can spare for the assembly (one or two are sufficient for this project).

Press Step 2. Instantiate to start the assembly. Make sure the Skip Assembly is NOT checked.

When done, it prints a summary (see below).

Final assembly summary

Below is the summary (the numbers can be slightly different due to a different machine, #CPUs, and blast version).

>>>Assembly Statistics 27-Mar-16 19:31:29

	DATASET                  #SEQS                    #SINGLETONS              #BURIED
	Illumina                 112                      78 (69%)                 13 (11%)
	Sanger                   98                       7 (7%)                   1 (1%)

	Total reads:  210
	Total buried: 14  Initial buries: 14   Buried during assembly: 0

	Contig sizes (#reads)
	Counts     =2         3-5        6-10       11=20      21-50      51-100     101-1000   >1000
	#Contigs   6          12         4          1          1          0          0          0

	Contig lengths (bp)
	Length     1-100      101-500    501-1000   1001-2000  2001-3000  3001-4000  4001-5000  >5000
	#Contigs   0          60         19         21         2          1          0          1

	Total contigs:     104
	Contigs(>1 seq):    24
	  Single mate-pair: 5
	Singletons:         80

	Finished in  1m:40s

Transcript with counts and EST libraries

When assembling transcripts with counts and EST libraries, a resulting contig with one or more transcript sequences and one or more ESTs will add the transcript counts and add the aligned EST.

Choosing Read Names

Using consistent and well-chosen read names makes data analysis much easier, and is essential for some aspects of TCW.

The name of the read is the string immediately following the ">" in the fasta file. For example, if your fasta file contains the lines

	>ZM_BFa0001A01.f
	AAGATCCGCCTCATTCACACCCCCATCTACCTAGCTAGCTAGTTTACCAAAAAAAAATCTGGCCACA
	GGGATGCGGTGGCGGCTGCAGCCGGCGCCGGCGCCGACGCTGCTCCTCGTCCTGCTGGTG
	>ZM_BFa0001A01.r
	AAAAAGCAAAATACAAACCAAGCTCCAGTTCCAATACATTACTCTAGCACAAGCTTTCAG
	CACATTACAAAGTAGGAACCAAGACCACCCAAGCTCCAATCACACTACAATTCATCACCA

then the two read names are ZM_BFa0001A01.f and ZM_BFa0001A01.r.

Naming guidelines:

length/characters: Keep read names under 25 characters, using only letters, numbers, and underscores.
uniqueness: No two reads in a TCW database may have the same name.
prefixes: Use the library name as the read prefix (e.g. ZM_BFa in this example).
This makes it much easier to study the assembled contigs where different libraries are mixed up.
For 454 data, the names are meaningless to the typical user, hence, the reads should be renamed with the library name followed by consecutive numbers.
mate-pair suffixes: If your read contain 5'/3' mate pairs, indicate this with suffixes (e.g. ".r", ".f" in this example). The suffixes must be absolutely consistent within a library. If some read have ".r", while others have just "r", or if some have ".r" meaning 3', while for others it means 5', then TCW cannot use the mate pair information to improve the assembly.

The Assembly Process

Following are the main stages of TCW assembly, organized by their headings which print to the screen. The sample durations are for a 700k read assembly using 6, 2.4 Ghz CPUs and the default settings.

Section Heading	What TCW is doing	Sample duration
>>>Delete previous assembly	There was a previous assembly of the same name, which you have selected to delete and start over. This can take a while for big projects!!	2h
>>>Initial bury alignment	TCW sets aside ("buries") reads which are nearly identical to another read, in order to reduce redundant assembly effort. It runs a "self-blast" of all the reads against each other, using Megablast. It then parses the output and saves the buries to the database.	1h 10m
>>>Compute cliques	A clique is a special type of cluster¹. TCW groups the reads into cliques and builds the initial contigs from them. To do this it again runs Megablast of the reads against each other, this time using only the non-buried reads.	1h 10m
>>>Clique assembly	Each clique is given to cap3 to assemble. Any leftover read are made into singleton "contigs".	2h
>>>Clique cap buries	The initial contigs from clique assembly are now analyzed for additional reads to bury. Any read lying in a region of 5x or greater coverage may be "cap-buried" in another read whose span in the contig is close enough. This is controlled by sTCW.cfg parameters CAP_BURY_MIN_DEPTH and CAP_BURY_MAX_HANG. For a large project, often over 50% of the clones will be buried by the end of this stage.	10m
>>>Contig merge rounds	Now TCW goes through the contig merge rounds specified by the "TC"² parameters in sTCW.cfg. For each round it writes out the current contig consensus sequences, blasts them against each other, and attempts to merge each overlapping pair.	9h
>>>Finalize contigs	Mate-pair contigs are joined together by N's. All buried reads are collected and assigned to their correct final contig. Each read is re-aligned to the consensus sequence of its contig, and the SNPs and extras are identified. Suspect contigs are flagged.	1h 10m

¹ In a clique, each read must have an overlap with all reads in the clique.
² In a TC (transitive closure), each contig must have an overlap with at least one contig in the TC.

Assembly Parameters

Note, the default parameters have been extensively tested and you will probably not want to change them. Most of the parameters for assembly are available for change through the runSingleTCW interface (press the Options button in the Assembly section).

Calculation of SNPs and extras

The parameters listed in the following can be set in the LIB.cfg file with an editor.

A SNP is possible when one or more read have a different base at some location than is found in the consensus. However, base-calling error can lead to many false positives, so TCW applies two screens to the possible SNPs. First, at least two reads must contain the SNP (you can change this with the SNP_CONFIRM parameter).

Also, a probability score is applied. The probability ('p-value') is computed using a binomial score based on the number of confirming reads, the depth of the contig at that base, and the estimated basecall error rate. The error rate is estimated from mismatches seen in the clique assembly, or it can be set using BASECALL_ERROR_RATE. The p-value threshold can also be set using SNP_SCORE.

When there are extra bases in some reads which are not in the consensus sequence generated by cap3, TCW uses another probability score to determine whether to regard the extras as "real" and add a pad character (*) to the consensus. The score is computed in the same way as for SNPs, and uses the config parameters EXTRA_CONFIRM, EXTRA_RATE, EXTRA_SCORE. Extras not determined to be real are stored in the database and shown in the UI.

Trouble Shooting

Blast fails. Try running the blast executable from the command line, which will usually elucidate the problem.
IMPORTANT NOTE: The full Blast path MUST be defined in the HOSTS.cfg using the blast_path parameter; see Install.
CAP3 fails.
- If you get an "java.io.IOException: Cannot run program", then the supplied CAP3 is not compatible with your OS. Go to seq.cs.iastate.edu/cap3.html, and download a CAP3 compatible with your systems and put it in /Ext/linux/CAP3 or /Ext/mac/CAP3 for Linux or Mac, respectively.
- On recent Mac OS (e.g. Catalina), external programs that are not registered with Apple will not automatically run. To fix this, see External program.
Interruptions. An assembly lasting multiple days can be interrupted for numerous reasons, e.g. running out of memory, losing connection to the database, or having the system reboot. In most cases this is not a problem and the assembly can be restarted to resume where it left off. It will check its database for consistency, and if the assembly continues successfully, then it should be fine, while if there are errors then it should be restarted from the beginning.
Clone assigned to multiple contigs. This happens sometime when a assembly has been restarted. Unfortunately, you need to delete the database and start again.
Crashes. If the assembly crashes, it will usually write the Java exception error into a file stcw.error1.log. This is information that we can use to debug and fix the problem.
For other problems, see Trouble Shooting.

For assembly, the database must support Innodb tables

TCW checks this using the "show engines" command in MySQL. If the Innodb engine is not listed as supported, this error is shown; however, you can still perform all TCW functions except for assembly.

The most common cause of this problem is a mismatch in the innodb log file size. The MySQL error log will contain messages like

	InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes
	InnoDB: than specified in the .cnf file 0 104857600 bytes!

Solution is to delete this log file and restart MySQL.

Doing an assembly, the database is very slow

The default parameters of MySQL are not suitable for large high-performance databases. Especially, the innodb_buffer_pool_size must be increased. 100M is sufficient for one large project, but for many large projects it should be 1G at minimum. For more see
http://dev.mysql.com/doc/refman/5.0/en/innodb-buffer-pool.html. Note, this only affects usage during an assembly, when InnoDB tables are used.

Email Comments To: tcw@agcol.arizona.edu

Contents