To prepare for annotation with
runSingleTCW, it is necessary
to download the databases to compared against. TCW provides support
for downloading the taxonomic and full UniProts
and mapping from the UniProt IDs to GO, KEGG, Pfam, EC, and InterPro.
runAS has been tested on Mac OSX 10.9.5 and Linux.
I would really appreciated you letting me know if you have any problems.
Email cari at agcol.arizona.edu or tcw at agcol.arizona.edu.
The setup uses
curl for downloading and
mySQL for the GO database.
For mySQL, the command mysqladmin is used, so
you may need to define its path, e.g. on Mac,
alias mysqladmin '/usr/local/mysql/bin/mysqladmin' #tcsh
alias mysqladmin='/usr/local/mysql/bin/mysqladmin' #bash
Note -- if you do not have
runAS will still work, though may not be
Terminology: the term "AnnoDB" refers to any database that will be used for annotation,
i.e. the sequences in TCW will be searched against all AnnoDB databases and the hits stored
in the TCW database for query. For a TCW created with nucleotide sequences, an annoDB may be nucleotide or protein. For
a TCW created with protein sequences, an annoDB can only be protein. TCW can use the super-fast
usearch programs, see fast searching.
If multiple species are to be compared using
multiTCW, it is important they all
use the same set of AnnoDBs and GO database.
Processing steps of
- Create the directory for the downloads and generated FASTA files.
The default location is the TCW sub-directory projects/DBfasta, but it can be put elsewhere.
- Download Taxonomic UniProts ".dat" files and create FASTA files.
- Download full UniProt, remove entries from taxonomic databases, and create FASTA files.
- Create GO database, which contains mappings from UniProt IDs to GO, KEGG, EC, Pfam and InterPro.
- Download go_<date>-termdb-tables.tar.gz.
- Create a local mySQL GO database with the information from this tar file.
- Add information to the local GO database from the .fasta and .dat files in the UniProt directory.
Other databases can be used for annotation, as discussed in using other databases,
but they will not have GO,KEGG,EC,Pfam,InterPro support.
Note: I have not downloaded the Full TrEMBL UniProt in years; I find it sufficient to
download the SwissProt of all relevant taxonomies and the Full SwissProt database, as these are the best annotation;
then I include the TrEMBL of the most relevant taxonomies.
Typically, all you need to do is make sure you have an internet connection open
and that you have enough disk space (see Memory),
The TrEMBL Bacteria and Full databases are
very very large (41GB and 61GB as of
Nov 2017), so do not download these unless your really really need them. Also, they take a lot of memory
- The "TCW Annotation Directories" define where the files will be put (see Details below);
TCW provides defaults as shown on the right.
- Select the taxonomic databases you want to use, then select Build Tax,
which downloads the respective files and creates FASTA files.
- Select the full databases you want to use, then select Build Full,
which downloads the respective files, creates a subset of the full by removing
all entries found in the downloaded taxonomic databases, and creates a FASTA file.
- Select Build GO, which downloads the GO database,
creates a local mySQL GO database with a mapping of the UniProts from your downloaded
set. This uses the information in HOSTS.cfg.
Example log file.
- Select TCW.anno, which writes a file projects/TCW.anno.<date>.
- Run ./runSingleTCW.
- Select Import Anno, a file chooser will popup, select
- Select Options in the annotation section, and select the GO database.
- Now you are ready to Exec Annotate Sequences with the UniProt and GO you
Check: Selecting this button highlights everything that has been done.
For example, the figure on the upper right shows that the directory
UniProt_Sep2018 has been created and only Archaea SwissProt has been downloaded and processed.
The Check automatically runs on startup.
The download uses |
You can get this on most Linux machines with 'sudo yum install curl'.
If you cannot install it,
runAS will prompt you as shown on
the right; if you select Continue, it will perform the download with its own Java code,
though it may take longer and is not as robust, i.e. could have potential problems due to network latency, etc.
The rest of this section provides details:
The default directory for annoDBs is projects/DBfasta.
UniProt directory is where the UniProt files will be downloaded. If the
directory does not exist, it will be created. As shown in the image above, the default
name is UniProt_<date>, though it does not have to be this name.
GO directory is where the GO file will be downloaded. If the directory
does not exist, it will be created. As shown in the image above, the default
name is go_tmp<date>, though it does not have to be this name.
Swiss and TrEMBL headings indicate that selecting
a check box under Swiss will download the SwissProt database for the corresponding
taxonomic database (label to the right). Similarly, selecting the check box
under TrEMBL will download the TrEMBL taxonomic database.
Full UniProt will download the full SwissProt or TrEMBL database.
It expects that you will have downloaded at least one taxonomic database, i.e. the one
that corresponds with your species.
GO (Gene Ontology): a tar file containing the schema and data is downloaded
The most current file is go_daily-termdb-tables.tar.gz
Database: This is the name of the GO database that will be created; you will enter
this name in
runAS will not replace an existing downloaded file. If you select Continue
on the prompt on the right, it will skip the download but perform the rest of the processing. This
is necessary if you have (1) run Build Full to create the subset, or (2)
run Build GO to create the GO database, but then download another Taxonomic UniProt;
these two steps need to be re-run.
For each of the three download steps, there will be an initial prompt to ensure that you meant to
select the download.
runAS does not remove files that are no longer necessary, which are
the files downloaded from the internet:
These files can be removed, as
- All "dat.gz" files in the UniProt directories, as the information has
been transferred to the FASTA files and GO database.
- The GO directory, as the information has been transferred to the GO database.
uses the FASTA files in the UniProt directories and the GO mySQL database. However, if you
do not have a space problem, keep them just for insurance.
When UniProt does the monthly update, your downloaded files will no longer available
on their site.
For the FASTA files that you will be using
As of 26-Nov-17, the table shows the sizes of the UniProt .dat files that will be downloaded
To view the latest sizes, go to
diamond to search against, you
can gzip them as
diamond can search against gzipped files.
Sizes (rounded) in Megabytes for the .dat.gz file
In addition to the .dat.gz files, you also need space for the .fasta file.
Here are a few numbers from Aug 2018:
The FASTA files are not written in gzipped format.
- Only download what is relevant!
For example, download both SwissProt and TrEMBL taxonomic databases
for your species of interest, plus any related taxonomies, then only download the full SwissProt.
- Avoid the TrEMBL Bacteria and Full UniProt unless you really need them.
- Do not perform constant downloads, it is a drain on the UniProt servers.
They do not
change that fast, and it changes 'best' hits in TCW, which can disturb any on-going analysis.
Times: The following shows the times for downloading 4 SwissProt and 2 TrEMBL files using a ~214 Mbsp download connection (Aug2018). (I don't
know why the trembl_invertebrate took so much longer than the trembl_plants, where its not that much bigger, probably contention on the network).
Download SwissProt files
curl complete ./projects/DBfasta/UniProt_Aug2018/sp_bacteria/uniprot_sprot_bacteria.dat.gz 6m:50s
curl complete ./projects/DBfasta/UniProt_Aug2018/sp_fungi/uniprot_sprot_fungi.dat.gz 2m:35s
curl complete ./projects/DBfasta/UniProt_Aug2018/sp_invertebrates/uniprot_sprot_invertebrates.dat.gz 2m:15s
curl complete ./projects/DBfasta/UniProt_Aug2018/sp_plants/uniprot_sprot_plants.dat.gz 2m:48s
Download TrEMBL files
curl complete ./projects/DBfasta/UniProt_Aug2018/tr_invertebrates/uniprot_trembl_invertebrates.dat.gz 3h:3m:15s
curl complete ./projects/DBfasta/UniProt_Aug2018/tr_plants/uniprot_trembl_plants.dat.gz 22m:56s
Create FASTA files
Make FASTA from ./projects/DBfasta/UniProt_Aug2018/sp_bacteria/uniprot_sprot_bacteria.dat.gz
333,576 written to uniprot_sprot_bacteria.fasta 0m:25s
Make FASTA from ./projects/DBfasta/UniProt_Aug2018/sp_fungi/uniprot_sprot_fungi.dat.gz
33,564 written to uniprot_sprot_fungi.fasta 0m:04s 3Mb
Make FASTA from ./projects/DBfasta/UniProt_Aug2018/sp_invertebrates/uniprot_sprot_invertebrates.dat.gz
27,252 written to uniprot_sprot_invertebrates.fasta 0m:02s 3Mb
Make FASTA from ./projects/DBfasta/UniProt_Aug2018/sp_plants/uniprot_sprot_plants.dat.gz
41,977 written to uniprot_sprot_plants.fasta 0m:04s 3Mb
Make FASTA from ./projects/DBfasta/UniProt_Aug2018/tr_invertebrates/uniprot_trembl_invertebrates.dat.gz
8,862,962 written to uniprot_trembl_invertebrates.fasta 8m:47s 3Mb
Make FASTA from ./projects/DBfasta/UniProt_Aug2018/tr_plants/uniprot_trembl_plants.dat.gz
7,308,873 written to uniprot_trembl_plants.fasta 6m:50s 3Mb
Complete Taxonomic UniProt 3h:56m:58s
Time for creating GO database: The following is the times for build the GO database for 4 SwissProt, Full SwissProt, and 2 TrEMBL.
Create a table of levels
988,921 tree entries for biological_process
196,066 tree entries for cellular_component
23,744 tree entries for molecular_function
Add GO level numbers to term
Complete GO database modifications 2h:57m:27s 4Mb
Complete creating GO database go_Aug2018 4h:16m:41s
NOTE: It takes much longer if you have many large TrEMBL taxononic databases downloaded; e.g. with 5 TrEMBL including tr_bacteria, it took 16 hours to build the GO database, so I recommend starting this in the evening and let it run over night.
In order to add GO annotations, a local GO mySQL 'demo' database needs to be created.
- From the TCW_2 directory, execute:
The "-d" will cause it to enter the demo parameters, as shown on the right.
The highlighted entries already exist. Its only necessary to build the GO database,
which takes about 10 minutes.
- Execute Build GO.
Enter "Confirm" on the first prompt to continue.
The GO tables are available for the demo, i.e. they will not be downloaded. This is evident from the purple GO label.
Building the GO database takes anywhere from 10 to 60 minutes.
Details about the Demo setup
In the projects/DBfasta directory, there is the sub-directory UniProt_demo and GO_tmpdemo,
which contains the following:
sp_bacteria/ sp_fungi/ sp_plants/ tr_plants/
sp_fullSubset/ sp_invertebrates/ tr_invertebrates/
Each taxonomic directory has a .dat and a .fasta file,
which are very small subsets of the original UniProt taxonomic .dat file.
See Overview for more details.
Why use taxonomic databases instead of the full UniProt
|Go to top|
- TCW refers to the annoDBs by the 'type' and 'taxonomy', e.g. sp_fungi has
a type of 'sp' and a taxonomy of 'fungi'. Within TCW, you can query by taxonomy.
Also, all results will show this information so you can easily see which taxonomic database a hit is from.
The following shows an example:
- You only need annotate against the taxonomies of interest. The bacterial
subset is especially large and should be left out if not specifically desired.
- The complete UniProt has everything but bacteria. There is an option to download it, then make
a subset to search against that has everything but what is in the taxonomic databases. This will be
referred to as the "Full" database below. In the above figure, "SPful" refers to the SwissProt full subset
database and "TRful" refers to the TrEMBL full subset.
NCBI-nr works with TCW. For other databases, you will need to make sure they have a TCW
accepted descriptor line.
Using other databases for annotation
|Go to top|
The description line is the ">" line that describes the subsequent sequence in a FASTA file.
From it, execAnno extracts
- DB type: used in naming the blast output file and is used in
viewSingleTCW to aid
in identifying where the hitID is from.
- hitID: the unique identifier of the hit.
- description: generally the functional description
- species: the species
>sp|Q9V2L2|1A1D_PYRAB Putative 1-ami OS=Pyrococcus abyssi GN=PYRAB00630 PE=3 SV=1v
- For TrEMBL, the first two characters would be 'tr'. The 'sp' or 'tr' are the DB type
- The third entry of the first string is the identifier (e.g. 1A1D_PYRAB)
- The string up to the OS is the description.
- The string after the "OS=" is the species.
The NCBI-nr database can be downloaded from
Zipped, this is 32Gb as of Nov2017. Using the Diamond program can search this database quickly;
however, there will be no GO, KEGG, EC, or Pfam information associated with these hits.
In the summer of 2016, NCBI changed the format of the subject line. TCW can parse the old or new format,
though the recent NCBI-nr file (Oct 2016) has some badly formed entries, which are ignored. The descriptor line
may have multiple entries separated by "CNTL+A"; only the first entry is used.
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
- The first entry is the identifier (e.g. XP_642837.1).
Note, there is no longer a way to detect the database origin within the file, hence, the DB type will be the generic 'PR' for protein.
- The text from the first space to the first "[" is the description.
- The text within the "" is the species.
>gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
- The 'gi' is the DB type.
- The fourth entry of the first string is the identifier (e.g. XP_642837.1)
- The text from the last "|" to the first "[" is the description.
- The text within the "" is the species.
If you have a file other than UniProt or nr, make the descriptor names as follows:
>CC|ID description OS=species
You want the CC type code + taxonomy (entered through the
- CC is the type code, and will be used as the DB type in TCW.
- ID is the unique identifier
- Everything up to the OS is the description
- Everything after the OS is the species
to be unique for each annoDB. The type code + first three letters of the taxonomy are used to
name the blast output, and is also used in
viewSingleTCW to easily determine
what annoDB a hit came from.
Entering this data into
|Go to top|
The AnnoDBs can be entered using the "Add" button, where the taxonomy is defined.
Alternatively, use the "Import AnnoDBs" to add the databases from an existing sTCW.cfg
file or from the TCW.anno file created from selecting "TCW.anno" on
the full path of the AnnoDBs is not shown in this image.
The GO database is defined in the "Options" menu.
Go to Top