1. Raw reads (pipeline only)
The naming is very important, as the runAW software uses the name to decode the
conditions and replicates. The AW allows one or two conditions, which you will define in
the runAW program that builds the database. The runAW has a table for each condition,
where it asks for the name and abbreviation. The best way to explain is with an example:
Say there were two inbred parents and a hybrid, where the transcriptome was extracted
from the root and leaves with two replicates; there would be 6 libraries and 12 samples.
If you start with unpaired raw read files as follows (for pairing see below):
R9Rt1.fq, R9Rt2.fq, R9Lf1.fq, R9Lf2.fq
XzRt1.fq, XzRt2.fq, XzLf1.fq, XzLf2.fq
X9Rt1.fq, X9Rt2.fq, X9Lf1.fq, X9Lf2.fq
The names will be preserved through the AW pipeline, resulting in the heterogyzous SNP files,
which are input to runAW. They will be named:
R9Rt1.bed, R9Rt2.bed, R9Lf1.bed, R9Lf2.bed
XzRt1.bed, XzRt2.bed, XzLf1.bed, XzLf2.bed
X9Rt1.bed, X9Rt2.bed, X9Lf1.bed, X9Lf2.bed
- The first condition must be the first part of the file name, the second part must be
the second condition (if it exists) and the replicate number must be last.
- There can be '-', '_' between the conditions and replicate number, e.g. R9_Rt_1.fq,
but no other characters are allowed.
- The abbreviations must be EXACTLY like you have entered into runAW interface.
If necessary, rename your files -- it does not take long.
- If you have a 3rd condition, just merge two conditions, e.g.
Infected root->iRT, uninfected root->Rt, infected leaf->iLf, uninfected leaf->Lf.
- The input file to AW must end with the suffix ".bed".
Anything from the first "." to the end is removed before parsing for the library name
and replicate number (e.g. X9Rt1.ase.bed, the "ase.bed" will be removed.)
- If there are no replicates, then it is okay for none of the files to have replica numbers.
- To indicate pairing, add suffixes "_R1","_R2", e.g. "R9Rt1_R1.fq". (You can use other
suffixes too; see the Pipeline documentation)
2. Genome Annotation GTF file
This file must be formated according to http://mblab.wustl.edu/GTF22.html.
The AW parser works well with the Ensembl files; if you are using a GTF other than Ensembl,
you may need to rename some keywords.
The GTF file must have 8 columns, where the following are required:
1st column: Seqname
- This is generally the chromosome or contig.
- AW determines the "root" of the seqname, where it must be followed by a number, e.g. chr1 the root is "chr",
or contig1 the root is "contig".
- As long as there are entries with this format, then it also accepts chrX, chrY, etc.
- All other input files that specify a seqname must use the same root or identify the
seq without a root,
e.g. if the GTF has chr1, chr2 and chrX, then all other files must use
those terms or the suffixes without the root, i.e. 1,2 and X.
2nd column: Source
- If the file is from Ensembl, then the source is the type, and only the type of
"protein_coding" is entered into the database.
- If the file does not contain "protein_coding" for the first 10000 entries, then all
entries are added and this column is ignored.
3rd column: Feature.
- The feature CDS is required.
- The features exon, start_codon and end_codon are also used if available.
8th column: Attributes.
- Attributes "gene_id" and "transcript_id" are required.
If you only have genes, then just name the transcript_id the same as the gene_id.
- It will use the attribute keywords "gene_name" and "transScript_name" if they are available.
- If you want to add the NCBI annotation (see below), these should be the common names used in NCBI.
- If there is no gene_name or transcript_name, then the "name" will just be geneN and transN
where N is a sequential number.
- If a gene name is duplicated (i.e. associated with two different gene_id), then a numbered
suffix is added to the gene name.
- All entries for a gene must be contiguous in the file.
3. Genome Sequence
For input into runAW, the genome sequence must be split by chromosome (or contig).
The filename of the chromosome must correspond to those found in the GTF file, e.g. chr1.fa,
chr2.fa and chrX.fa. You may use the script GSsplit.pl.
This is used in conjunction with the GTF annotation file to create the spliced nucleotide
sequences and amino acid sequences, which are written into the project directory.
The routine was tested using the Ensembl Mouse GTF file. The AW transcript cDNA sequences are exactly like
the ones from Ensembl, and most of the proteins are the same except for some without start_codon
and/or end_codon. A small set (0.3%) had no translations (i.e. frame without stop codons), whereas the
Ensembl proteins were fine; runAW will remark such transcripts; for a future release, the cause
will be determined and fixed.
3. Variant Call Format file
Standard format. You may want to use dbSNP names where possible.
4. Variant effect (runAW - optional)
Either snpEFF or Ensembl variant predictor annotation can be used.
runAW will expect either of the following two lines at the top:
## ENSEMBL VARIANT EFFECT PREDICTOR
And will fail if neither are present. This tells runAW/buildAW what the file format is.
Ensembl Variant Predictor can be run from Ensembl website for specific organisms.
snpEff is available from //snpeff.sourceforge.net, and is very easy
to use if your genome sequence is known by it.
5. Gene (NCBI) Annotation (runAW - optional)
If your GTF file has "gene_name" that corresponds to LOCUS in Genbank, then you can add the
gene annotation. Go to NCBI, select database "Protein", search on your organism, then select
"Send to:", select File, select format GenPept. The lines the AW parser uses are, e.g.
DEFINITION transient receptor potential cation channel subfamily V member 6
Where it will match the gene with the GTF gene_name, and enter the DEFINITION and gene_synonym.