The University of Arizona
singleTCW Guide
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour

This document covers building a TCW database for a single species. Terminology:

  • Dataset is one of the following: (1) a set of sequences to assemble, with optional quality data (2) a file of sequences with optional count data from conditions, where the sequences can be nucleotide or amino acid.
  • Conditions may be tissues, treatments, etc that are to be compared, with optional replicates.
  • AnnoDBs are fasta files of sequences (nucleotide or amino acid) to compare the dataset sequences against for annotation. TCW provides special support for UniProt, but can use any file of sequences (e.g. Genbank nr).
Note: there is still some old terminology floating around, such as the term "library" use to be used for both "dataset" and "condition", and the acronym PAVE is still found in the configuration files as that was the name before TCW.

Contents

Running the demo

Go to top
Demo as of 27Mar16DescriptionDocumentationOld demo
demoTraTranscripts with counts, locations, remarksHere - this sectiondemoOl_DE
demoAsmAssemble transcripts and ESTsAssembly GuidedemoZo_ASM
demoProProtein sequences with countsSame steps as for demoTra--

Follow steps of Installation. Make sure your HOSTS.cfg is correct.

At the command line, type

./runSingleTCW
The window shown on the right will be launched. Follow the instructions below on the left of the image.
Note that this demo project is pre-configured; to create a new project from scratch, refer to Creating a New Project. Select demoTra from the Project drop-down list. Note that you can set the number of CPUs to use, although the demo does not need more than one.

1. Click the Exec Load Data command
This loads the datasets in this section, where there is one dataset of transcripts ("Ginger") and three conditions with count data ("Root","Tip","Zone") where the first condition has five replicates and the second two conditions have one replicate.

Note: A command button will turn gray while it is executing, and there will be output to the terminal. Keep an eye on the terminal because you may be prompted with yes/no prompts which need to be answered. If you take too long to respond, it times out; in which case, ctrl-C and restart.
Note: A command will not be active (i.e. grayed out) when it is not valid to run, e.g. the Exec Instantiate command is grayed out since the datasets have not been loaded.

2. Click the Exec Instantiate command.
The Skip Assembly was checked so the transcripts will simply be loaded, without assembly.
The Use Sequence Names From File is not checked, meaning that the TCW will assign new, sequentially-numbered names prefixed by the singleTCW ID.

From this point on, you may run

./viewSingleTCW tra
after any step to view the results that have been entered.
(Click to see larger image)

3. Click Exec Annotate Sequences
This searches against several UniProt partial databases which have been provided as part of the package.

If the GO database has not been built yet (see Step 5), the following will be written to the terminal:

+++Warning: GO_tree go_demo is missing; ignoring GO step
--Please confirm above parameters. Continue with annotation?  (y/n)?
				
Answer 'y' to continue.

The output to the terminal and to the file projects/demoTra/logs/annotator.log will look something like this log (this includes the GO annotation).

4. Click the Add Remarks or Location button (bottom of window), a window will popup (not shown). Select the file "traRemarks.txt" (when you press the "...", you will see the file), then select Add Remarks from File. This illustrates how to add remarks to one or more sequences, which can be searched on in the TCW Basic search. You may also add the "traLocations.txt" file to view locations in the resulting database.

5. To build the GO database, see Demo annotation setup. Then execute Exec GO Only.

6. To compute differential expression (DE), install R and the respective packages. From the command line, execute

./runDE tra
The DE Guide describes how to install the necessary packages, and how to add DE p-values to the TCW database. If Step 5 has been run, then you can also add the p-values for the GO.

CREATING A NEW PROJECT

Go to top

Follow steps of Installation. Make sure your HOSTS.cfg is correct.

To create a new project, press the Add Project button at the top of the runSingleTCW interface. You will be prompted for a project name; enter a name using letters, numbers, or underscores (no spaces). Do not include "sTCW", that will be added. When you select "OK",
  1. The name will be entered for singleTCW ID, and for Database with the prefix "sTCW" .
  2. The libraries/<name> and project/<name> directories will be created with files LIB.cfg and sTCW.cfg, respectively. These maintain all information you enter.
For example, if you enter the name "example", the ID will be "example" and the database will be "sTCW_example". You cannot change the Database name, but you can shorten the singleTCW ID.

singleTCW ID: This can be used as a command line parameter to runDE and viewSingleTCW. Also, if TCW generates the sequence names (if Use Sequence Names from File is not checked), the singleTCW ID followed by sequential numbers will be used.

Instead of using the Add Projects, you may create a directory under /libraries and put your sequence files along with any other optional files (i.e. quality and count); when you select the project pulldown, you will see your project (the project pulldown lists all directories under /libraries).

Load Data

Go to top
Define all datasets, then select Exec Load Data to load the data into the database. Datasets cannot be added to an existing database.

Defining a sequence dataset

A TCW project must have at least one sequence dataset, which is a FASTA file of sequences. Select Add beside the Sequence Datasets and the panel will be replaced with the one shown on the lower right.

After entering the files and attributes, select Keep. The main panel will reappear with the SeqID and Title added to the Sequence Datasets table. Additionally, the information will be written in libraries/<name>/LIB.cfg.

SeqID: Enter the dataset name (a brief identifier) in the first box; if you want TCW to generate sequence names, it will use this identifier followed by consecutive numbers. It is very important that you create very short descriptive name. Add a more descriptive title and information in the ATTRIBUTES section, which is shown on the Overview page of viewSingleTCW.

Sequence File: Click the browse button labeled "..." to select the fasta file of sequences. The ">" lines are the sequence names, where characters other than letters, numbers, and underscores will be changed to underscores. Additional files may be added:

  1. For already assembled transcripts (e.g. Illumina) or protein sequences, there maybe associated count files. Adding them is covered in the next section.
  2. For Sanger ESTs or 454 data, there may be quality data. Enter the name of the quality file.
  3. For Sanger ESTs, TCW assumes the 5' ESTs have the ".f" suffix and the 3' ESTs have the ".r" suffix. If there are different from this, enter the correct ones.

Defining attributes and updating attributes

Enter any additional information as desired in this section. This information will be shown on the Overview panel of viewSingleTCW.

The attribute information can be added or changed after the database is created by using the Edit button on the main panel. Note, they can only be changed from the directory which has the libraries/<name>/LIB.cfg as a subdirectory.

Defining count data

Associated with each dataset may be one or more conditions with count data. Each condition has a count for each of the sequences.

Count File: Click the browse button labeled "..." to select the file containing the table of counts for the sequences (see sample below).

If you have the replicate library counts in separate files, you can use the Build combined count file option to create the 'table of counts' file, as discussed below.

 

Keep: The dataset panel disappears and the main panel returns. Note that the sequence dataset is now shown in the first table, while the conditions (column headings from the count file) are listed in the second table.

Use Define Replicates to group replicates and Edit Attributes to add information for the conditions, which will be shown on the viewSingleTCW overview.

Biological replicates

If you have biological replicates, you now need to click Define Replicates to define them. This brings up the panel shown on the right, which is shown with the "demoTra" example data.

The first column shows the sequence SeqID, which you entered when creating the dataset. The second column shows the "Column name from file", which are the column headers in the count file (see example below). The third column shows the "Condition", which groups the replicates into a single condition.

If the replicates names in the count file are not correct named with the format "NameNumber", then TCW cannot automatically group them correctly. You will need to edit the "Condition" column to map your replicate names to their condition.

On Keep, the Associated Counts table on the main panel will be updated, showing the correct condition list and number of replicates for each. In order to add metadata (e.g. title), select a row followed by Edit.

Build combined count file

The Count File that is input to the Load Data routine is a tabular file where the first line contains the word "SeqID" (or any label) followed by the condition column headings. The file must be space delimited; if your file uses commas, they can be replaced with sed -i"" 's/,/" "/g' filename

Example Count File: This is part of the demoTra count file, showing the first 4 transcripts and counts for three conditions, where the first has five replicates. The condition names and replicate numbers are automatically determined from the column headers. Note that the sequences in sTCW_demoTra are prefixed with "tra_" and numbered sequentially; this is because the 'Use Sequence Names from File' was not selected.

SeqID           Root1 Root2 Root3 Root4 Root5 Tip1 Zone1
tZoR_000023     378   1002  1649  826   1195  726  151
tZoR_000117     101   206   151   109   185   129  58
tZoR_000246     1859  2506  1334  1541  2307  3012 976
tZoR_000335     529   919   1103  810   1427  2338 1438
		

However, it is common to have separate count files for each sample. In this case, you can use the Build combined count file to generate the combined count file.

To see how this works, create a new project and add a dataset using the demoTra sequences, as shown above, but instead of adding the demo demoTra.cnt.txt, select Build combined count file. This brings up the interface shown at right (which shows the files already added).

All the files have been put in a sub-directory called "count":

Root1.cnt	Root2.cnt	Root3.cnt	Root4.cnt	
Root5.cnt	Tip1.cnt	Zone1.cnt
				
Note that each file name start with the condition name followed by the replicate number. Since the files are all in one directory, and the filename up to the first "." is the "library name + replicate number", we can use the bulk load. Select Add Directory of Files, which brings up a file chooser window; select the directory containing the files (e.g. "count") and it loads all files from the selected directory, and as shown on the right.

If your files are not in one directory, or not named correctly, you will need to add them individually using the Add Rep, as shown on the right. The Rep Name must be the (abbreviated) condition name followed by the replicate number.

After adding all files, click Generate File, and the panel closes. The Add/Edit panel returns, and now the Count File is filled in with a file named "Combined_read_counts.csv".

As already described, the replicates are grouped by clicking the Define Replicates button on the main window.

Instantiation (with optional assembly)

Go to top
If the input is already assembled transcripts, or protein sequences, or gene sequences, check Skip Assembly on the project interface.

If you want the original sequence names to be used by TCW, check Use Sequence Names from File; otherwise, TCW will rename the sequences using a simple naming scheme. Note that, if the original sequence names are used, characters other than letters, numbers, or underscores will be replaced by underscores. Note also that the names supplied by sequencers are often longer than necessary, in which case, it is better to have the TCW assign names, which will use the library name followed by consecutive numbers, hence, retaining their order information.

If the datasets are ESTs, or multiple transcript datasets that you want to assemble together, do not select Skip Assembly. You can tune the assembly with Options, but it typically is not necessary.

In either case, you must press Exec Instantiate to either assemble or finalize the sequences in the database.

Annotation

Go to top
Annotation should be defined through the Manager, as in step 3 above. Three types of computation are performed:
  1. Basic annotation: GC content and ORFs
  2. Functional annotation: Compares one or more protein and/or nucleotide databases ("annoDBs") to each sequence in the database.
  3. Similar sequences: Compute similar sequences, which is particularly useful for analyzing the results of assembled transcripts.

AnnoDBs and UniProt

The term "AnnoDB" refers to a fasta file of nucleotide or protein sequences, where the TCW sequences (transcript or protein) will be blasted against the annoDBs for functional annotation. Read Annotation Setup for obtaining UniProt and other annotation databases.

Import AnnoDBs provides a way to enter all UniProt databases at once. If you have used runAS for the annotation setup, make sure the last step was TCW.anno, which creates a file of UniProt databases. Select Import AnnoDBs, which pops up a file chooser, then select the projects/TCW.anno.<date>; all your UniProts will be added at once with appropriate taxonomy values. Any additional databases (e.g. Genbank nr) need to be added one by one.

To add an annotation database, press the Add button next to the AnnoDB table. This brings up the panel shown on the right.

Taxonomy: this does not need to be unique, but the DB type must be, which is created as follows:

  • Type: defined in the ">" of the annoDB fasta file of sequences, e.g. "sp" for SwissProt.
  • Taxonomy: first 3 letters of the taxonomy, e.g. "pla" for plants.
  • DB type: the type+taxonomy, e.g. if the type is "sp" and the taxonomy is "plant", the TCW "DB type" will be "SPpla".
DB type is shown in various tables of viewSingleTCW to indicate the origin of the hit.
Generate Hit Tabular File:
Search Program: The drop-down will list TCW Select if you have listed additional search programs (e.g. diamond) in the HOSTS.cfg file. The drop-down will also contain blast and any other search programs.
  • If the drop-down is set at TCW Select, then TCW will select which search program to use, as described in selecting a search program.
  • If you select a program, then you can also alter the parameters.
  • If only blast is shown, then you have not identified any additional search programs in HOSTS.cfg.
You can supply your own blast results file (must be in tabular format). You must still provide the name of the annoDB fasta file as it extracts the description and species from it.

The Options button below the AnnoDBs table provides additional options for:
  1. Define the GO database for GO, KEGG, Pfam, and EC annotation.
  2. ORF finder parameters.
  3. Options for self-comparison of transcript sequences ("similar sequences").
The second two options are not available for TCW database created from protein sequences.

GO Database (GO, KEGG, EC, InterPro, Pfam annotations)

See Annotation Setup for creating the GO database. Once the GO database is created, it can be selected, as shown on the right. Once selected, it will display the available GO Slim categories in the drop-down below it; alternatively, a OBO formated file may be entered.

ORF finder options

See the ORF document.

Similar Sequences

The annotator can also (optionally) compare all sequences and determine the top N pairs, where their alignments can be viewed in viewSingleTCW. This can be helpful for assessing the stringency of an assembly, by noting how similar sequences are.

Adding to annotation

Additional annotation can be added at a later time; see Update and Redo annotation in the Annotation Guide.

Adding remarks and locations

...Select to enter a file name.
Add Locations from FileReads the file, removes any existing location information, and adds the location information to each sequence.
Add Remark from FileReads the file and adds the remarks to any existing remark for the sequence.
Remove RemarkRemoves all remarks.

Locations

The file is a set of rows where the first word of each row is the sequence ID and the rest of the row is the format:
		group:start-end(strand), e.g. SC_1:392-496(-)
  • The group would be the supercontig, scaffold, chromosome, linkage, etc. If TCW can extract numbers/X/Y from the end of the group (e.g. chr1, chrX), it adds a column containing just the "Group" number that can be sorted numerically in viewSingleTCW.
  • The sequence ID must match a sequence ID in the database, so you probably will want to "Instantiate" the sequences using "Use Sequence Names from File".

A script is available, scripts/extractCodingLoc.pl, to generate the transcript sequence file from a genome sequence and GFF3 file. It also generates the location file in the format needed by TCW. It only works with a subset of the GFF3 files, so probably needs to be modified for your GFF3 or GTF file. It uses BioPerl.

The group name, start, end and strand are columns in the TCW database that can be viewed in viewSingleTCW by selecting the "Columns" tab on the left, then checking the columns under "General".

Remarks

The file is a set of rows where the first word of each row is the sequence ID and the rest of the row is the Remark.

  • Single and double quotes will be changed to spaces.
  • Semi-colon will be changed to a colon.
  • Do not use the following remarks, as they are added during annotation:
        Multi-frame; ORF contains Hit; Hit contains ORF; ORF overlaps Hit; !LG

In viewSingleTCW, the Remark can be viewed by selecting the "Columns" tab on the left, then checking the "Remark" column under "General". The remark can also be search on in the "Basic Queries Sequence"; this is a great way to add additional information about your sequences. If you make the remarks "keyword=value", you can then search on the keyword to get a specific group of sequences.

Trouble shooting

If errors occur, a message is written to the terminal and the Java stack-trace is written to the file sTCW.error.log. This can be sent to us so we can help you trouble shoot the problem, that is, if the message to the terminal is not sufficient to indicate how to fix the problem. NOTE: errors are appended to this file, so if an error keeps re-occurring without getting fixed, the file can get quite large.

If the information on the runSingleTCW panel does not look right, remove /libraries/<project>/LIB.cfg to start over. You can try fixing the problem by editing this file, but its necessary to format it correctly. If that does not work, email tcw@agcol.arizona.edu and we will guide you on how to enter the information.

If you have many sequences (e.g. transcripts) in the database and/or many annoDBs, this can take a lot of memory. Running Exec Annotate Sequences sets the memory to 4096, which may not be enough; in this case, once runSingleTCW is ready to annotate your database, exit and run from the command line:

	./execAnno <project>
You can increase the memory size in the execAnno script.

Differential Expression

Go to top
TCW integrates several different R packages for computing differential expression of transcripts, including EdgeR and DESeq2; an R script containing the R commands to compute DE can also be supplied. GOseq is supported for computing DE enrichment of GO categories. The DE computations are pairwise, i.e. conditions are compared two at a time. Each comparison results in a column added to the database, which may be viewed and queried. The DE modules are accessed either through ./runDE on the command line, or the 'Launch DE' button at the bottom of the Manager interface (Fig. 1).

For full details on the DE modules, see the Differential Expression Guide.

IMPORTANT DETAILS

Directory structure and configuration files

Go to top
Users do not ordinarily need to look at the underlying directories and files used by TCW, however they are described here since it may be helpful at times.

When a project is added with Add Project, runSingleTCW creates one directory each under /libraries and /projects with the user supplied name (referred to here as <project>). The libraries/<project> directory has a LIB.cfg file where runSingleTCW saves all the library information and the projects/<project> directory has a sTCW.cfg file with all the assembly and annotation information.

Though you can put your data files anywhere that runSingleTCW can access, you may want to put your data files in the libraries/<project> directory in order to keep everything in one place. Also, the /DBfasta directory contains a subset of the UniProt files for the demo; this is a good location to put all annoDB files.

Both execAssm and execAnno write to the projects/<project> directory, where execAssm writes mainly log files, while execAnno puts all blast results in uniblast subdirectory. Blast files are NOT removed as they may be reused. Both programs write log files to a log subdirectory.

Batch Processing

Go to top
As mentioned above, runSingleTCW creates /libraries/<project>/LIB.cfg and projects/<project>/sTCW.cfg. Once created, you can run the three steps from the command line instead of through the interface.
ActionExecutableConfiguration
Load DataexecLoadLibLIB.cfg
InstantiateexecAssmsTCW.cfg
Annotate SequencesexecAnnosTCW.cfg

The LIB.cfg and sTCW.cfg files can be edited with a text editor, but be careful as runSingleTCW and the executables expects the syntax to be exactly as it writes the file.

VIEW/QUERY (viewSingleTCW)

Go to top
Once a project is created, it is viewed and queried using viewSingleTCW. This program is launched either:
  1. From the command line (./viewSingleTCW), which brings up a panel of mySQL databases with the sTCW_ prefix, where databases can be selected to view.
  2. From the command line using the singleTCW ID or database name as a parameter (e.g. viewSingleTCW tra).
  3. Through the button at the bottom of the Manager interface (Fig. 1).
The TCW Tour shows the various displays.

Running as an applet

Go to top
The following can be used to display viewSingleTCW on the web with a given TCW database:
		<HTML>
		<HEAD>
		<TITLE>viewSingleTCW</TITLE>
		</HEAD>
		<BODY>
		Running viewSingleTCW  (please wait)
  		<applet 
    		CODE="jpave.viewer.STCWApplet"
        	ARCHIVE="your_location_of_the_jar/stcw.jar,jars/mysql-connector-java-5.0.5-bin.jar"
    		width=0 height=0
  		MAYSCRIPT>
  		<param name="ASSEMBLY_DB" value="sTCW_your_db_name">
  		<param name="DB_URL"      value="www.your.hostname">
  		<param name="DB_USER"     value="your username (read-only)">
  		<param name="DB_PASS"     value="your password">
  		<param name="ASSEMBLY_ID" value="your singleTCW ID">
  		<param name="DESCRIPTION" value="">
  		<hr>
  		Unable to display the viewSingleTCW applet. Please verify your Java installation.
  		<hr>
  		</applet>
		</BODY>
		</HTML>
		
Important Notes:
  • Before using it on the web, run it as a desktop application as it may need to update the Overview page.
    That is, every time the database is changed, the Overview needs to be updated, which is done when you execute viewSingleTCW or the Overview option from any of the "run" executables. If it is not updated before running from the web, the web applet will not work because it cannot write to the database for overview update.
  • Also, you may need to use a port other than the default MySQL port; see MySQL port access. Also, users may need to increase their allowed memory for Java applets, see Out of memory.

Email Comments To: tcw@agcol.arizona.edu