1. Load Data
- The input may be transcripts with optional read counts, proteins with optional
or sequences to be assembled such as Sanger ESTs, 454 reads, and/or transcript libraries.
- The Add window allows you to define a dataset along with its conditions (e.g. tissue, treatment, etc).
It allows you to define the count file(s) for transcript or protein libraries, where the
information will be listed under Associated Counts on the main window.
- If there are replicates, you may define
them with the Define Replicates window.
- Skip Assembly - the sequences can be instantiated with no assembly.
- Assembly - the sequences can be assembled. If Read sequences (i.e. ESTs) are mixed with
transcripts that have read counts, the EST dataset 'counts' for the contig will be the number
of ESTs in the given contig, and the transcripts will retain their counts from the input.
3. Annotate Sequences
- Use runAS to download and format UniProts and the GO databases; these
are input to (b) and (c).
- Add one or more databases to search against, which can be protein or nucleotide.
- In the Options window, define the GO database
4. Add Remarks and Locations
Remarks and Locations (i.e. chromosome, start, end, strand) can be added to sequences and queried in the viewSingleTCW.
(Click any image to see larger)
|Load Data - Add/Edit Selecting Add shows this window. Selecting Edit shows
a similar window, but only the Attribute values can be changed after the database has been created.
The counts can be in one file or generated with the Build combined count file option.
On Save, the condition names will be written in the Associated Counts table on the
Build count file - Generate File will generate a file called Combined_read_count.csv file
where the columns are conditions with their respective counts.
Define Replicates (Main window) If the Associated Counts table has replicates,
they can be defined in this window, which will update the table as shown above in the Main window.
Annotate Sequences - Add/Edit Databases to search against are referred to as "annoDBs".
Any FASTA file can be used an annoDB, though TCW gives special support to using UniProt taxonomic
databases; besides allowing taxonomic specific querying in viewSingleTCW,
the UniProt .dat file is used to extract GO, KEGG, EC and PFam information.
(A script is provide to download and format the UniProt databases).
As the UniProt Trembl databases increase in size (tr_bacteria.dat is 76GB as of 5/17/15), it has
become impractical to search these with BLAST1. Fortunately, the
programs provide supper fast results
(see fast searching) for timing and results.
- BLAST is used for assembly, annotation and interactive searching in viewSingleTCW.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.
- Diamond can be used for annotation of blastx and blastp searches.
Buchfink B, Xie C, Huson D (2015) Fast and Sensitive Protein Alignment
using DIAMOND, Nature Methods, 12, 59-60 doi:10.1038/nmeth.3176.
- Usearch can be used for annotation for blastp and modified blastx searches.
Edgar,RC (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461.
- CAP3 is used for assembly.
Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9: 868-877.
- UniProt is recommended for protein annotation as the GO and other information can
be extracted by the TCW and added to the database.
Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O'Donovan C, et al. (2012) The UniProt-GO Annotation database in 2011. Nucleic Acids Res 40: D565-570.
- Gene Ontology mySQL database is used for levels and descriptions.
GO Consortium (2012) The Gene Ontology: enhancements for 2011. Nucleic Acids Res 40: D559-564.