Annotation Setup and runAS

To prepare for annotation with runSingleTCW, it is necessary to download the databases to compared against. The runAS program provides support for downloading the taxonomic and full UniProts along with mapping from the UniProt IDs to GO, KEGG, Pfam, EC, and InterPro.

Tested: runAS has been tested on Linux, MacOS 10.9 and 10.15. If you have any problems, please let me know at tcw at agcol.arizona.edu. On 18-Mar-21, the runAS program was updated to use the "GO Basic OBO file" in place of the "old GO MySQL tar file"".

Contents:
Overview Demo annotation setup Using Java graphical interface -- runAS Details and file structure Cleanup Memory and time	What AnnoDBs to use Creating AnnoDBs from other databases (e.g. NCBI-nr) Entering AnnoDBs and GOs into runSingleTCW Why use taxonomic databases Parsing go-basic.obo Links to relevant databases

Overview

Go to top

Terminology:

The term "AnnoDB" refers to any database that will be used for annotation, i.e. the sequences in TCW will be searched against all AnnoDB databases and the hits stored in the single TCW database (sTCWdb) for query.

Requirements:

RunAS uses curl for downloading annoDBs and the GO database.

You can get curl on most Linux machines with 'sudo yum install curl', and MacOS comes with it. If you cannot install it, runAS will prompt you as shown on the right; if you select Continue, it will perform the download with its own Java code, though it may take longer and is not as robust, i.e. could have potential problems due to network latency, etc.

Processing steps: The TCW runAS will perform the following:

Create the directory under projects/DBfasta for the downloads and generated FASTA files.
Download the selected Taxonomic UniProts .dat files and create FASTA files.
Download the selected full UniProt .dat file and create a FASTA file of the sequences not found in any the downloaded taxonomic files.
Create GO database, which contains mappings from UniProt IDs to GO, KEGG, EC, Pfam and InterPro.
1. Download go-basic.obo from http://current.geneontology.org/ontology/
2. Create a local mySQL GO database (GOdb) with the information from this file.
3. Add information to the GOdb from the .fasta and .dat files in the UniProt directory.
Create the file projects/AnnoDBs_UniProt_<date>.cfg to be imported to runSingleTCW.

Important:

Time and Memory	This can take a lot of time and memory, so make sure to read this section.
What AnnoDBs to use	To reduce the time and memory, make sure to read this section.
Using other databases	Other databases, such as NCBI nr, can be used for annotation but they will not have GO, KEGG, EC, Pfam, or InterPro.

Demo annotation setup

Go to top

The TCW package provides subsets of UniProt for annotating the demo. In order to add GO annotations, a local GO mySQL 'demo' database needs to be created.

From the TCW_4 directory, execute:
./runAS -d
The "-d" will cause it to enter the demo parameters, as shown on the right. The highlighted entries already exist. Its only necessary to build the GO database.
Execute Build GO.
The GO tables are available for the demo, i.e. they will not be downloaded. This is evident from the purple GO label.

Building the GO database takes approximately 5-10 minutes.

Details about the Demo setup

In the projects/DBfasta directory, there is the sub-directory UniProt_demo and GO_obodemo, which contains the following:

	GO_obodemo:
	go_basic.obo

	UniProt_demo:
	sp_bacteria/	sp_fungi/	   sp_plants/	      tr_plants/
	sp_full/		sp_invertebrates/  tr_invertebrates/

Each taxonomic directory has a .dat and a .fasta file, which are very small subsets of the original UniProt taxonomic .dat file.

Java graphical interface -- runAS

Go to top

Typically, all you need to do is make sure you have an internet connection open and that you have enough disk space (see Memory), then start the interface shown on the lower left by typing at the command line: ./runAS

The TCW Annotation Directories define where the files will be put. TCW provides defaults as shown on the right; it is recommended you use the defaults.
Select the taxonomic databases you want to use, then select Build Tax, which downloads the respective .dat.gz files and creates FASTA files.
Select the full databases you want to use, then select Build Full, which downloads the respective .dat.gz file and creates a subset FASTA file that only contains the sequence NOT in the downloaded taxonomic FASTA files. See Full subsets for more detail.
Select Build GO, which downloads the GO database, creates a local mySQL GO database with a mapping of the UniProts from your downloaded set. This uses the information in HOSTS.cfg.
Select AnnoDB.cfg, which writes a file called projects/AnnoDBs_UniProt_<date>.cfg that contains all the information downloaded; this can be used as input to runSingleTCW (see Import AnnoDBs).
Check: The Check function automatically runs on startup and after any Build. It highlights everything that has been done. For example, the figure on the right shows that fungi and plant SwissProt have been downloaded and processed. To force a check, or to view the UniProts in an existing goDB, select this button.

A log of the processing is written to projects/DBfasta/logs/runAS.log. See the log file for an example.

Important points:

runAS will not replace an existing downloaded file: It will overwrite a .fasta file, but never a .dat file. If you want a .dat file downloaded again, you must remove it yourself.
Build GOdb only after all desired taxonomic and full databases are downloaded: It is important that you create the GO database right after downloading the UniProt files so that they correspond. It is also important that you have downloaded all desired taxonomic and full UniProt databases.
Only download what you need! See Memory and Time and What AnnoDBs to use.
runMultitCW: If multiple sTCWdbs are to be compared using multiTCW, it is important they use the same set of AnnoDBs and GO database (see Entering AnnoDBs).

Full subsets:

When you select Build Full, a pop-up similar to the one on the right will be shown, where only the taxonomic names will be shown that correspond to downloaded taxonomic .dat files. This allows you to create different subsets. Typically, you will only want one subset, which is the one corresponding to the taxonomic files downloaded.

The FASTA file will have a suffix indicating what subset it corresponds to. For example, the selection on the right would create the file uniprot_sprot_xBFxIxPxxV.fasta, where the 10 characters represent the 10 taxonomic databases in alphabetic order, and the capital letters represent the taxonomic sequences removed (Bacteria, Fungi, Invertebrate, Plant, Virus).

Details: You may unselect all entries and it will create a FASTA file of all sequences. When runSingleTCW loads UniProt IDs, it only loads the first occurrence of a UniProt ID, so duplicates will not cause errors. However, by using the proper subset, processing is faster and the e-values are lower since there are less sequences in the database. You may create new subsets at any time as it does not effect the GOdb. Only one file will be shown in the AnnoDB.cfg file.

Details and file structure

Go to top

Check: Select to update the highlighting, as discussed below. Check is automatically run on startup, and when any of the three "Builds" are executed.

Label Highlights

At the top:
- If the UniProt directory label is highlighted in blue, it exists.
- If the GO directory label is highlighted in pink, it exists but the GO OBO file has not been downloaded.
  If the GO directory label is highlighted in blue, the GO OBO file have been downloaded.
On the middle right:
- If the GO Database label is highlighted in blue, the GO database exists.

Taxonomic and Full UniProt Highlights

Clear checkbox: If a Taxonomic is clear, then neither the .dat file or .fasta file exists for it. When you check the box followed by Build Tax, you will need to confirm a popup that states "Download SP - xxx", where xxx will be the list of files to download. The download is always automatically followed by creating the .fasta files. The same applies to the Full checkboxes.

Pink checkbox: If the .dat file exists, but the .fasta file does not, the checkbox will be highlighted pink. Check the pink box(s) and run Build Tax in order to create the .fasta file only. The same applies to the Full checkboxes.

Blue checkbox: If both the .dat file and the .fasta file exists, the check box will be highlighted blue.

File Structure

For each taxonomic and full UniProt that you downloaded, a directory will be created under the UniProt directory. For example,

	./TCW/projects/DBfasta/UniProt_Dec2021%> ls *
	sp_archaea:
	uniprot_sprot_archaea.dat.gz	uniprot_sprot_archaea.fasta

	sp_full:
	uniprot_sprot.dat.gz	uniprot_sprot_AxxxxxxxxV.fasta

	sp_viruses:
	uniprot_sprot_viruses.dat.gz	uniprot_sprot_viruses.fasta

When you run the BLAST or DIAMOND search programs from runSingleTCW, the formatted files will be placed in the corresponding directory.

Compress Fasta: If you plan on using DIAMOND as the search program, you may compress the fasta files after download, e.g.

	cd projects/DBfasta/UniProt_<date>
	gzip */*.fasta

GO (Gene Ontology)

The go-basic.obo file is downloaded from http://current.geneontology.org/ontology/.

Database: This text entry on the runAS interface is the name of the GO MySQL database that will be created; you will enter this name in runSingleTCW.

The processing steps are as follows:

Download the GO Basic OBO file to GO directory.
Build a GO specific MySQL database (referred to as GOdb) with the contents of the file.
Add the UniProts from all subdirectories under the UniProt directory (e.g. projects/DBfasta/UniProt_Mar2021) to the GOdb.

Clean up

Go to top

runAS does not remove files that are no longer necessary, which are the files downloaded from the internet:

All "dat.gz" files in the UniProt directories, as the information has been transferred to the FASTA files and GO database.
The GO directory, as the information has been transferred to the GO database.

These files can be removed, as runSingleTCW uses the FASTA files in the UniProt directories and the GO mySQL database. However, if you do not have a space problem, keep them just for insurance; when UniProt does the monthly update, your downloaded files will no be longer available on their site.

For the FASTA files that you will be using DIAMOND to search against, you can gzip them as DIAMOND can search against gzipped files.

When your calculating space, remember that the BLAST and DIAMOND programs will format the .fasta file, which takes up even more space. For example:

	/TCW/projects/DBfasta/UniProt_Dec2021/sp_full% ls -hlG
	-rw-r--r--  1 cari  staff   597M Dec 20 07:07 uniprot_sprot.dat.gz
	-rw-r--r--  1 cari  staff    54M Dec 20 15:55 uniprot_sprot_xBFxIxPxxV.fasta
	-rw-r--r--  1 cari  staff    55M Dec 20 16:15 uniprot_sprot_xBFxIxPxxV.fasta.dmnd

Memory and Time

Go to top

Taxonomic

Downloads on 6-Jun-2021 onto a Linux machine with a ~500 Mbsp download connection and 128Gb of RAM on a Sunday afternoon. Note, there can be considerable difference in download times.

File	.dat Size	Download	.fasta Size¹	Creation
uniprot_sprot_bacteria.dat.gz	203Mb	0m:27s	150Mb	0m:25s
uniprot_sprot_fungi.dat.gz	49Mb	0m:15s	21Mb	0m:04s
uniprot_sprot_invertebrates.dat.gz	34Mb	0m:05s	14Mb	0m:02s
uniprot_sprot_plants.dat.gz	51Mb	0m:10s	21Mb	0m:04s
uniprot_sprot_viruses.dat.gz	16Mb	0m:06s	9Mb	0m:01s
uniprot_sprot.dat.gz	587Mb	1m:09s	55Mb²	1m:43s

uniprot_trembl_bacteria.dat.gz	87Gb	1h:57m:02s	64Gb	2h:24m:45s
uniprot_trembl_fungi.dat.gz	8.3Gb	13m:41s	7.4Gb	13m:33s
uniprot_trembl_invertebrates.dat.gz	7.8Gb	12m:05s	6.8Gb	12m:50s
uniprot_trembl_plants.dat.gz	12Gb	16m:21s	10.3Gb	18m:57s
uniprot_trembl_viruses.dat.gz	3.5Gb	5m:20s	2.3Gb	5m:21s

¹When TCW extracts the sequence into a FASTA file, it is not written in a gzipped format. However, if you are going to use DIAMOND, you can zip them (the uniprot_trembl_bacteria.fasta zipped file is 31Gb).
²The subset, i.e. full SwissProt minus all downloaded taxonomic entries.

GO database

It takes less than a minute to download the GO file. The time it takes to build the GO database is proportional to the number of UniProts to be processed. For example,

Machine	AnnoDBs	Time	Database size
MacOS Catalina	SwissProt Plant and Full, TrEMBL Plant	24m:47s	2.7Gb
Linux (as specified above)	The 11 taxonomic and full listed above	10h:42m:27s¹	26Gb¹

¹Most of the time, 8h:23m:30s, was for loading the uniprot_trembl_bacteria.dat, which would also account for the database size.

What AnnoDBs to use

Go to top

Strong suggestions:

Only download what is relevant!
- Download all relevant SwissProt files and the Full SwissProt UniProt.
- Download only the most relevant TrEMBL files, and never the Full TrEMBL UniProt unless absolutely necessary.
Do not perform constant downloads, it is a drain on the UniProt servers.
The UniProts do not change that fast, and it changes 'best' hits in TCW, which can disturb any on-going analysis.

Evidence

The dataset used for the following tests is from de novo assembled sequences from Andropogon gerardii, which is related to Sorghum. It was downloaded from Dryad and published by Hoffman and Smith (2017). The full dataset had >60k transcripts, which was reduced to 27,085 (it was reduced to be able to run faster tests, though care was taken to use unannotated sequences from an earlier annotation).

Four annotations were compared:

Annotation	AnnoDBs	#Annotated
#1	sp_plants, tr_plants, sp_ful	25,049 (92.5%)
#2	#1 + sp_virus, sp_fungi, sp_invertebrate, sp_bacteria	25,052 (92.5%)
#3	#2 + tr_virus, tr_fungi, tr_invertebrate, tr_bacteria, tr_full	25,070 (92.6%)
#4	#1 + nr	25,160 (92.9%)

If your organism is not closely related to any model organism, then there will likely be a bigger difference.

Creating AnnoDBs from other databases

Go to top

UniProt and NCBI-nr descriptor lines works with TCW. For other databases, you will need to make sure they have a TCW accepted descriptor line.

Description lines

The description line is the ">" line that describes the subsequent sequence in a FASTA file. From it, runSingleTCW extracts:

DB type: used in naming the tab output file and is used in viewSingleTCW to aid in identifying where the hitID is from.
hitID: the unique identifier of the hit.
description: generally the functional description
species: the species

UniProt

	>sp|Q9V2L2|1A1D_PYRAB Putative 1-ami OS=Pyrococcus abyssi GN=PYRAB00630 PE=3 SV=1v

For TrEMBL, the first two characters would be 'tr'. The 'sp' or 'tr' are the DB type
The third entry of the first string is the identifier (e.g. 1A1D_PYRAB)
The string up to the OS is the description.
The string after the "OS=" is the species.

NCBI nr (See Download NR)

	>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]

The first entry is the identifier (e.g. XP_642837.1). Note, there is no longer a way to detect the database origin within the file, hence, the DB type will be the generic 'PR' for protein.
The text from the first space to the first "[" is the description.
The text within the "[]" is the species.

As it does not have an "type code", its type will default to "PR". If the taxonomy is given as "nr", the abbreviation for this database will be PRnr.

Generic

If you have a file other than UniProt or nr, make the descriptor names as follows:

	>CC|ID description OS=species

CC is the type code, and will be used as the DB type in TCW.
ID is the unique identifier
Everything up to the OS is the description
Everything after the OS is the species

Example 1: The TCW perl script scripts/formatPlantTFDB.pl takes as input a file from PlantTFDB, which has header lines like:

	>KFK36254.1 Arabis alpina|G2-like|G2-like family protein

and converts them to header lines:

	>tf|G2_like_1 G2-like family protein {KFK36254.1} OS=Arabis alpina

The type will be "TF"". If the taxonomy entered into runSingleTCW is "plants"", the abbreviation for this database will be TFpla.

Example 2: The TCW python script scripts/formatNCBIrna.py takes as input an RNA file from NCBI, which has header lines like:

	>XM_002436391.2 PREDICTED: Sorghum bicolor GDP-mannose 4,6 dehydratase 1 (LOC8069086), mRNA

and converts them to header lines:

	>XM_002436391.2 GDP-mannose 4,6 dehydratase 1 (LOC8069086), mRNA OS=Sorghum bicolor

As this does not have a type code at the beginning, its type will default to "NT". If the taxonomy is entered as "sb", the abbreviation for this database will be NTsb. The script can be modified to add a type code.

Entering AnnoDBs and GOs into runSingleTCW

Go to top

Execute ./runSingleTCW and select your project.

Select Import Anno, a file chooser will popup. Select either of the following to enter the names of the UniProt in the AnnoDB table and the GO database:

projects/AnnoDBs_UniProt_<date>.cfg This will use the AnnoDBs & GO written by AnnoDB.cfg.
projects/<project-name>/sTCW.cfg This will use the AnnoDBs & GO used by another project.
Now you are ready to run Annotate with the UniProt and GO you just downloaded.

AnnoDBs can be entered using the Add button, where the taxonomy is defined. They can also be changed with Edit.

The GO database and GO slim category are defined or changed in the Options menu.

Why use taxonomic databases instead of the full UniProt

Go to top

viewSingleTCW refers to the annoDBs by the 'DBtype' and 'taxonomy', with them combined into 'DBtax'. The DBtype and taxonomy can be queried on and columns of the data viewed. The "sp" is SwissProt and the "tr" is "TrEMBL".

The following shows an example of a set of hit proteins:

The following shows a table of sequences:

The following shows the details of a specific sequence:

Parsing go-basic.obo

Go to top

The following is an example record in the OBO file:

[Term]
id: GO:0000785
name: chromatin
namespace: cellular_component
alt_id: GO:0000789
alt_id: GO:0000790
alt_id: GO:0005717
def: "The ordered and organized complex of DNA, protein, ....
comment: Chromosomes include parts that are not part of  ....
synonym: "chromosome scaffold" RELATED []
synonym: "cytoplasmic chromatin" NARROW []
synonym: "nuclear chromatin" NARROW []
xref: NIF_Subcellular:sao1615953555
is_a: GO:0110165 ! cellular anatomical entity
relationship: part_of GO:0005694 ! chromosome

TCW parses for the following keywords:

Keyword	AmiGO term	TCW term	Example
id	Accession	GO ID	GO:0000785
name	Name	Description	chromatin
namespace	Ontology	Domain	cellular_component
is_a	is_a	is_a	GO:0110165
relationship: part_of	?	part_of	GO:0005694
alt-id	Alternate ID	Alternate ID	GO:0000790
	replaced by	Replaced by	GO:0000785
is_obsolete: true	Name: obsolete	Description: obsolete	obsolete replicative cell aging

Views in AmiGO and TCW:

AmiGO	TCW

NOTES:

UniProt occasionally uses the Alternate IDs and has a few Obsolete GO terms.
I cannot guarantee that AmiGO always treats "alt_id" as specified here.

Links to relevant databases

Go to top

To download the UniProt files without runAS:

Go to UniProt Downloads.
In the second line from the top, it says "For downloading complete data sets we recommend using ftp.uniprot.org." Click the ftp.uniprot.org.
This brings up the UniProt download directories in a Finder window. You may view it as "Guest".
Click "Current_release", "knowledgebase". Here you will see "complete" and "taxonomic_divisions".

The NCBI-nr database can be downloaded:

NCBI nr (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA).
As of 24-Jan-21, it is 89GB and took 1h:45m to download.
It is called nr.gz; since the File Chooser requires a FASTA suffix, rename it: mv nr.gz nr.fa.gz

GO Basic OBO file: http://current.geneontology.org/ontology/.

Email Comments To: tcw@agcol.arizona.edu