The University of Arizona
AW Pipeline
AW Guide v1.0 | Pipeline | Files | runAW

Pipeline Processing Scripts

The AW package includes a set of scripts designed to facilitate batch processing for transcript DE and ASE analysis. The scripts cover quality control, trimming, alignment to reference, variant calling, quantification of expression, and removal of reference bias (for ASE studies). The scripts call on a number of external tools (such as samtools and tophat2) which have also been included in the package (in the "ext" directory).

These batch functions are also available on the iPlant Atmosphere image, which you may wish to use if your computational resources are limited. You can expand and reduce the image size as needed for a pipeline step.

The batch scripts follow consistent conventions:

  • Running any script with no parameters gives its help text
  • All scripts create a capitalized directory (e.g., TRIM) containing output
  • The main outputs are in a subdirectory "Results" (e.g.,TRIM/Results)
  • All scripts create a Summary.html with collected run statistics
  • The primary input to one script is usually the Results directory from a prior
  • Most scripts have a "-t" flag for number of threads/processes to use
  • If you wish to make more detailed parameter changes or use different versions of the underlying tools such as tophat, you can do this easily by editing variables at the top of the scripts

Important:If your read files are paired, you must name them using suffixes to identify the pairs. By default, the scripts look for suffixes "_R1", "_R2" to indicate pairing; if you use a different convention then edit the variables "$fSuffix" and "$rSuffix" at the top of the Align and Trim scripts.

In addition, for loading into the AW database, filenames need to contain labels identifying the sample and conditions; see the Files documentation for details.

The following table summarizes the batch scripts provided and the tools they call. (Before running the pipeline, it is important also to familiarize yourself with the underlying tools such as tophat).

ScriptInputOutputTools UsedPostprocess
1. GSmaskGenome sequence file, variant file (vcf)Masked genome sequence Bedtools -
2. QCNGS read files (fastq)HTML files with quality metrics. FastQC Unified HTML
3. TrimNGS read files Trimmed read files Trimmomatic -
4. Align Trimmed read files, masked genomealignment files (.bam) Tophat2 -
5. snpASEAlignment files + variant fileRef/alt coverage counts at each SNP location (in .bed file format)Samtools Parse pileup to counts
6. transASEParental transcript files (see below)Ref/alt estimated expression levels for each transcriptSTAR + eXpress Create TCW counts
7. VariantsAlignment filesSNPs/Indels for each alignment, and combined Samtools Combine sample calls

The package also includes a utility script GSsplit.pl which splits a whole-genome fasta file into its separate chromosomes, for loading into AW.

Input Files

The required input is:
  • Raw read files; see the Files documentation for important naming requirements.
  • Genome annotation in GTF2.2 format (see http://mblab.wustl.edu/GTF22.html for details), along with the Files documentation.
  • Genome sequence directory of chromosome FASTA files (see the Files documentation for details; if your genome is in a single file, use the GSsplit.pl utility to split it).
For ASE you will need also a variant file (.vcf); this may also be generated by calling the variants on your read set. (Note that indels may be loaded and viewed in AW, but only SNPs are used in the expression analysis.)

Example Batch Pipeline Run

  1. Start with fastq paired read files in directory reads/, genome fasta file(s) in directory genome/, and annotation file annotation.gtf. We will assume that you also already have SNP calls snps.vcf (for SNP calling, see below).

  2. Start by checking the read quality:

    scripts/QC.sh -d reads

  3. Study the html files in the resulting QC directory and decide on trimming parameters.

  4. Execute a trim, e.g.:

    scripts/Trim.pl -d reads -p -P "LEADING:30 TRAILING:30 HEADCROP:5 MINLEN:30"

  5. and then check the quality of the resulting reads:

    scripts/QC.sh -d TRIM/Results

    If the quality is not acceptable, re-run the trim with different parameters (rename the TRIM the directory if you wish to preserve prior results).

  6. Next, for ASE studies it is important to reduce reference bias, which can be done by masking the SNP loci in the genome before aligning reads:

    scripts/GSmask.pl -i genome -v snps.vcf

  7. Now the masked genome files are in GSMASK/Results. Align the trimmed reads to the genome, providing the annotation as guidance, and using 10 threads:

    scripts/Align -g GSMASK/Results -r TRIM/Results -p -a annotation.gtf -t 10

    This can take some time and generates bam alignment files in TOPHAT/Results, as well as an alignment summary TOPHAT/Summary.html.

  8. Check the summary and if it seems ok then proceed to generate the ASE SNP coverage counts; these measure ASE by counting the read coverage for the ref/alt alleles at each SNP locus.

    scripts/snpASE.pl -i TOPHAT/Results -v snps.vcf

    This script produces count files in the SNPCOV/Results directory which have a .bed format and may be loaded into AW.

  9. At this point you have everything needed for AW analysis of ASE. Build the AW project using runAW.

  10. When building the AW database, the SNP coverage is summed for each transcript, however you may also want a more explicit estimation of read coverage per transcript. For this purpose the eXpress package has been included, which takes into account isoforms, and can also compute ASE if provided with the transcripts from the different alleles.

  11. To run eXpress, first create the AW project first create the AW database using data generated so far. This automatically creates "ref" and "alt" transcript sets by using the given SNP and annotation files. The transcript files are located in directory projects/<project_name>/AW_outputs, and are named ntRef.fasta, ntAlt.fasta.

  12. Once you have the generated transcript files, align the reads to them using the transASE.pl batch script. This script calls the STAR aligner to do the alignment, and then eXpress to quantify the expression; ASE is seen as expression differences between ref and alt transcripts. (STAR is used because it runs much faster than Bowtie with the lenient parameters which are recommended for eXpress.)

    scripts/transASE.pl -t ntRef.fasta -u ntAlt -i TRIM/Results -p

  13. The result is expression files (.xprs) which can be loaded into AW for the second measure of ASE. A directory ResultsTCW is also created which contains ordinary (non-ASE) expression quantification, and can be loaded into TCW for DE studies.

Batch Script Parameters and Versions of Tools

The batch scripts expose only a few of the underlying parameters of the tools which they use. If you wish to adjust more of these parameters, you can edit variables at the head of each script. Also, if you wish to use a different tool version than supplied in the ext directory, you can edit the tool path variables at the head of the scripts.

Variant Calling

If the strains you are working with do not have predetermined variants, then you will need to call them using your dataset. For this purpose a script Variants.pl has been provided, which uses the Samtools/Bcftools variant call functions. Before calculating the variants, you must trim your reads and alignment them with align.pl.

Typically you will have one bam alignment file per sample, and the tool will call separately for each alignment and then combine them into one using a simple threshold (by default, called in at least 5 samples):

scripts/Variants.pl -g genome -i TOPHAT/Results -t 01 -m 5

The result is a file VARIANTS/combined.vcf with the combined calls, as well as Results directory containing the calls for each sample.

The script uses samtools to call variants separately for each alignment bam file, and then calls another script VarComb.pl to combine them into a single output combined.vcf. The -t and -m are VarComb.pl parameters.

  • -t indicates selecting for heterozygous variants, appropriate to an F1 hybrid. If you have an inbred, choose -t 11 or leave off the parameter to keep all variants.
  • -m sets the threshold number of samples need to call a combined snp.
If there are no results in combined.vcf, then your -m value is probably too high. Run VarComb.pl from the VARIANTS directory using perl ../scripts/VarComb.pl with the appropriate parameters.

After SNP calling, then you must do the masking (step 6), re-alignments (step 7), and then continue.

SNP-calling is highly dependent on project details, e.g. whether you have data for inbred (homozygous) lines, whether the reference sequence matches at least one strain, etc. The provided scripts should prove useful in many scenarios but will often need to be supplemented with other SNP manipulation programs such as vcftools.

Supported Platforms

The external binaries supplied are 64-bit Linux and the pipeline has been tested on RedHat Enterprise Server 6.5 and CentOS 6.5. On other platforms the scripts should work, however you may need to supply some or all of the external tools and alter the path variables in the script headers.
Email Comments To: tcw@agcol.arizona.edu