Pipeline Processing Scripts
The AW package includes a set of scripts designed to facilitate batch processing for
transcript DE and ASE analysis. The scripts cover quality control, trimming, alignment
to reference, variant calling, quantification of expression, and removal of reference bias (for ASE studies).
The scripts call on a number of external tools (such as samtools and tophat2) which have
also been included in the package (in the "ext" directory).
These batch functions are also available on the iPlant Atmosphere image, which you may
wish to use if your computational resources are limited. You can expand and reduce the image
size as needed for a pipeline step.
The batch scripts follow consistent conventions:
- Running any script with no parameters gives its help text
- All scripts create a capitalized directory (e.g., TRIM) containing output
- The main outputs are in a subdirectory "Results" (e.g.,TRIM/Results)
- All scripts create a Summary.html with collected run statistics
- The primary input to one script is usually the Results directory from a prior
- Most scripts have a "-t" flag for number of threads/processes to use
- If you wish to make more detailed parameter changes or use different versions of
the underlying tools such as tophat, you can do this easily by editing variables at the top of the scripts
Important:If your read files are paired, you must name them using suffixes to
identify the pairs. By default, the scripts look for suffixes "_R1", "_R2" to indicate
pairing; if you use a different convention then edit the variables "$fSuffix" and "$rSuffix"
at the top of the Align and Trim scripts.
In addition, for loading into the AW database, filenames
need to contain labels identifying the sample and conditions; see the Files documentation
The following table summarizes the batch scripts provided and the tools they call. (Before
running the pipeline, it is important also to familiarize yourself with the underlying tools
such as tophat).
|1. GSmask||Genome sequence file, variant file (vcf)||Masked genome sequence
||Bedtools || -
|2. QC||NGS read files (fastq)||HTML files with quality metrics.
||FastQC || Unified HTML
|3. Trim||NGS read files ||Trimmed read files
||Trimmomatic || -
|4. Align ||Trimmed read files, masked genome||alignment files (.bam)
||Tophat2 || -
|5. snpASE||Alignment files + variant file||Ref/alt coverage counts at each SNP location
(in .bed file format)||Samtools || Parse pileup to counts
|6. transASE||Parental transcript files (see below)||Ref/alt estimated expression levels
for each transcript||STAR + eXpress || Create TCW counts
|7. Variants||Alignment files||SNPs/Indels for each alignment, and combined
||Samtools ||Combine sample calls
The package also includes a utility script GSsplit.pl which splits a whole-genome fasta file into its separate chromosomes, for loading into AW.
The required input is:
For ASE you will need also a variant file (.vcf); this may also be generated by calling the variants
on your read set. (Note that indels may be loaded and viewed in AW, but only SNPs are used in the expression analysis.)
- Raw read files; see the Files documentation for important naming requirements.
- Genome annotation in GTF2.2 format (see http://mblab.wustl.edu/GTF22.html for details),
along with the Files documentation.
- Genome sequence directory of chromosome FASTA files (see the Files documentation for details; if
your genome is in a single file, use the GSsplit.pl utility to split it).
Example Batch Pipeline Run
- Start with fastq paired read files in directory reads/, genome fasta file(s) in directory genome/,
and annotation file annotation.gtf. We will assume that you also already have SNP calls
snps.vcf (for SNP calling, see below).
- Start by checking the read quality:
scripts/QC.sh -d reads
- Study the html files in the resulting QC directory and decide on trimming parameters.
- Execute a trim, e.g.:
scripts/Trim.pl -d reads -p -P "LEADING:30 TRAILING:30 HEADCROP:5 MINLEN:30"
- and then check the quality of the resulting reads:
scripts/QC.sh -d TRIM/Results
If the quality is not acceptable, re-run the trim with different parameters (rename the TRIM
the directory if you wish to preserve prior results).
- Next, for ASE studies it is important to reduce reference bias, which can be done by masking
the SNP loci in the genome before aligning reads:
scripts/GSmask.pl -i genome -v snps.vcf
- Now the masked genome files are in GSMASK/Results. Align the trimmed reads to the
genome, providing the annotation as guidance, and using 10 threads:
scripts/Align -g GSMASK/Results -r TRIM/Results -p -a annotation.gtf -t 10
This can take some time and generates bam alignment files in TOPHAT/Results, as
well as an alignment summary TOPHAT/Summary.html.
- Check the summary and if it seems
ok then proceed to generate the ASE SNP coverage counts; these measure ASE by counting the read
coverage for the ref/alt alleles at each SNP locus.
scripts/snpASE.pl -i TOPHAT/Results -v snps.vcf
This script produces count files in the SNPCOV/Results directory which have a .bed
format and may be loaded into AW.
- At this point you have everything needed for AW analysis of ASE. Build the AW project using runAW.
- When building the AW database, the SNP coverage is summed for each transcript, however
you may also want a more explicit estimation of read coverage per transcript.
For this purpose the eXpress package has been included, which takes into account isoforms,
and can also compute ASE if provided with the transcripts from the different alleles.
- To run eXpress, first create the AW project first create the AW database
using data generated so far.
This automatically creates "ref" and "alt" transcript sets by using the given
SNP and annotation files. The transcript files are located
in directory projects/<project_name>/AW_outputs, and are named
- Once you have the generated transcript files, align the reads to them using the
transASE.pl batch script. This
script calls the STAR aligner to do the alignment, and then eXpress to quantify
the expression; ASE is seen as expression differences between ref and alt transcripts.
(STAR is used because
it runs much faster than Bowtie with the lenient parameters which are recommended for eXpress.)
scripts/transASE.pl -t ntRef.fasta -u ntAlt -i TRIM/Results -p
- The result is expression files (.xprs) which can be loaded into AW for the
second measure of ASE. A directory ResultsTCW is also created which contains
ordinary (non-ASE) expression quantification, and can be loaded into TCW for DE studies.
Batch Script Parameters and Versions of Tools
The batch scripts expose only a few of the underlying parameters of the tools which they use.
If you wish to adjust more of these parameters, you can edit variables at the head
of each script. Also, if you wish to use a different tool version than supplied
in the ext directory, you can edit the tool path variables at the head of the scripts.
If the strains you are working with do not have predetermined variants, then you will need to
call them using your dataset. For this purpose a script Variants.pl has
been provided, which uses the Samtools/Bcftools variant call functions. Before calculating
the variants, you must trim your reads and alignment them with align.pl.
will have one bam alignment file per sample, and the tool will call separately for each alignment
and then combine them into one using a simple threshold (by default, called in at least 5 samples):
scripts/Variants.pl -g genome -i TOPHAT/Results -t 01 -m 5
The result is a file VARIANTS/combined.vcf with the combined calls, as well as Results directory
containing the calls for each sample.
The script uses samtools to call variants separately for each alignment
bam file, and then calls another script VarComb.pl to combine them
into a single output combined.vcf.
The -t and -m are VarComb.pl parameters.
If there are no results in combined.vcf, then your -m value
is probably too high. Run VarComb.pl from the VARIANTS directory using perl ../scripts/VarComb.pl with the
- -t indicates selecting for heterozygous variants, appropriate to an F1 hybrid. If you have an inbred, choose -t 11 or leave off
the parameter to keep all variants.
- -m sets the threshold number of samples need to call a combined snp.
After SNP calling, then you must do the masking (step 6), re-alignments (step 7), and then continue.
SNP-calling is highly dependent on project details, e.g. whether you have data for inbred (homozygous) lines,
whether the reference sequence matches at least one strain, etc. The provided scripts should prove useful
in many scenarios but will often need to be supplemented with other SNP manipulation
programs such as vcftools.
The external binaries supplied are 64-bit Linux and the pipeline has been
tested on RedHat Enterprise Server 6.5 and CentOS 6.5. On other platforms
the scripts should work, however you may need to supply some or all of
the external tools and alter the path variables in the script headers.