The University of Arizona
sTCW Differential Expression Guide
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour
The TCW includes the runDE interface to several popular R packages for computing differential expression (DE), including
  1. EdgeR, DESeq2 or supply a file of R commands.
  2. GOseq for DE enrichment in GO categories.
This document assumes that the "demoTra" project has been loaded as described in the User Guide. If you have the demos from before the 27Mar16 release, the demo demoOl_DE is similar to demoTra.

Contents

      
  1. Installing R and packages
  2. Example: demoTra
  3. DE between conditions
    1. Methods
    2. Options
    3. Execute
  4. DE for GO categories
  5. R directory permissions

1. Installing R and packages

Before running DE computations the relevant R packages must be installed. Note, you only need to install those packages which you intend to use.

Installing R

The following instructions are for Linux systems. See
Mac-OSX for Mac instructions. On Redhat or CentOS Linux, R can be installed simply by
sudo yum install R-devel
If you do not have sudo privileges (sudo allows a general user to run some commands as the root user), you will need an administrator to do the install. Note, even if R is already installed, you probably still need to run this command to install the "devel" libraries, which are needed for rJava package used by TCW to interface with R.

This is all that you strictly need to do; however, unless you plan to run as root (not recommended), or will be the only user of DE computations, you will also want to set up R so that one user can install packages to be used by all users. See R directory permissions.

Installing rJava

TCW uses the rJava package to interface directly to R. Starting with v1.3, you should not need to do anything for rJava on Linux or Mac; the necessary library is included in the package and should be found. If, however, you see an error involving libjri.so when trying DE functions, you may try the following: copy the library to a system directory, e.g. on Linux:
sudo cp external/libjri.so  /usr/lib64
Also use the file function to verify that the library is actually usable on your system. A 64-bit library is supplied, which will not work on 32-bit systems, or with 32-bit Java installed on 64-bit systems.

Installing packages

The packages used by TCW can all be installed through the "biocLite" system of Bioconductor as follows:
  1. Run R (type "R" at the command line)
  2. At the R prompt, type: source("http://bioconductor.org/biocLite.R")
  3. At the R prompt, install the packages you need:
    PackageInstallVersion as of 13-Dec-151NoteDependencies, e.g.2
    edgeRbiocLite("edgeR")3.12.0Used for demobiocLite("locfit")
    DESeq2biocLite("DESeq2")1.10.0  sudo yum install libxml2-devel
    GOSeqbiocLite("goseq")1.22.0 sudo yum install curl-devel
    multtestbiocLite("multtest")2.26.0For FDR 
    1These packages have been tested on Linux and Mac as of 13-Dec-15 with the latest versions and R 3.2.2, but also worked on older 2014/2015 versions. 2There are many dependencies, most are automatically resolved with the biocLite command, but some you may need to track down as they are dependent on your system. You will probably need the ones listed in the last column. Note, if you do not have sudo permissions, your system administrator will need to perform the 'yum install' commands.

2. Example: demoTra (release 27Mar16)

Go to top
From the command line, type:
    >runDE tra
Alternatively, type runDE with no arguments and the sTCW database graphical chooser will appear with all sTCW databases listed. This brings up the runDE interface shown at right.

Define a pairwise comparison by selecting "Root" for Group 1 and "Tip" for Group 2, as shown.

Check the box Save results in p-value column, the text box will display the column name "RoTi", which can be changed.

Assuming you have installed edgeR, leave the method as is.

Execute Group 1 - Group 2. The edgeR package will run, with commands printed to the console.

Note that both "Root" has 5 replicates and "Tip" has one replicates, indicated on the terminal:

Root has 5 replicates: Root1, Root2, Root3, Root4, Root5
Tip has 1 replicates: Tip1
Collecting count data (may take several minutes)
Using count filter >1
   4 filtered sequences
The last two lines of the above printout indicate that 4 sequences were filtered out because they did not have any replicates with more than a count of 1. When the computation is done you will see on the terminal:
Number of DE results:
   <1e-5 <1e-4 <0.001 <0.01 <0.05
      16    17    20    29    41

Saving 207 scores
Adding database column RoTi (may take a while)...
    2 NA scores
    207 saved scores
Finish adding scores to sTCW
Create library DE table

Finished DE execution for tra at:  0m:4s

EdgeR plots:
  Scatter plot: plotBCV(y)
  Cluster plot: plotMDS(y, labels=grpNames)
The console is in R, you may run R commands -- q() or  Cntl-C when done, or perform another Execute.
The R console remains open and you can now manually expore the data using R, if desired. EdgeR has a couple of useful plot commands (plotBCV, plotMDS) which are indicated in the text; many of the other packages also have useful data exploration functions, described in their documentation. If you don't wish to use the R console you can return to the DE interface and run additional comparisons.

Note: the line editor for R (i.e. scrolling through the R commands and editing the commands) may not work when run from the TCW initiated R session; if you want line editing, type q(), say 'y' to saving the workspace image, then restart R in order to examine results and view plots with line editing.

View Results: To view the resulting p-values, run 'viewSingleTCW tra' from the command line, and click Columns. The column options has a section Differential Expression with the DE columns listed, e.g. the DE columns for demoTra are shown on the right. There is also a section for Conditions, which has the "Root", "Tip, and "Zone" column options. Select the desired columns ( e.g. "Root", "Tip", "RoTi"), then select Show All; you may need to click Refresh Columns at the top to display the conditions and DE columns. Note in the below figure that if Root < Tip, the p-values RoTi is negative, else it is positive. Filters has extensive filter options for condition values and p-values.

 

3. DE between conditions

3.A Methods

Go to top

Use of replicates

Biological replicates are valuable for RNA-seq to provide the baseline estimate of variability. This is incorporated into the models (typically as dispersion of the negative-binomial distribution) and makes a big difference in the result DE scores.

If replicates are provided for a condition, they will automatically be used for the applicable R packages (e.g., edgeR). All of the methods also work without replicates, but are likely to overestimate DE.

Replicates should be biological replicates, i.e. sequencing from distinct individuals having the same treatments, rather than technical replicates, i.e. re-sequencing of the same individuals. Technical replicates can be input to TCW and will be used in the same way, but do not provide a useful estimate of variability.

Built-in methods

edgeR and DESeq2 are built-in methods, meaning the code is executed from within TCW. It cannot be changed unless the TCW code is changed. Both produce FDR adjusted p-values.

R-scripts

Go to top
When the R-script option is selected, TCW performs the following:
  1. Writes the necessary information to the R environment based on the options selected.
  2. Runs the R script using the source command.
  3. Reads the results variable from the R enviroment, which should contain the p-values.
TCW writes the following variables to the R environment (the following assignments are from the above demo example):
    gc: GC values of sequences
    rowNames: sequences (row) names
    grpNames: group (column) names
    repNames: replicate names
    counts: counts of sequences
    countData <- array(counts,dim=c(109,6))
    rownames(countData) <- rowNames
    nGroup1 <- 3
    nGroup2 <- 3
	
The variables displayed with a ":" are assigned with the assign function, whereas the others are the exact assignment.
The most important variable are countData and grpNames, e.g.
>head(countData)
        [,1] [,2] [,3] [,4] [,5] [,6]
tra_001  378 1002 1649  826 1195  726
tra_002  101  206  151  109  185  129
tra_003 1859 2506 1334 1541 2307 3012
tra_004  529  919 1103  810 1427 2338
tra_005  114  192  151  232  351  317
tra_006  632  839  905  670  823 1080
> grpNames
[1] "Root" "Root" "Root" "Root" "Root" "Tip"
The R script should put the p-values in the results variables, e.g.
> head(results)
[1] 5.239394e-102  3.837063e-51  1.621835e-15  6.043034e-13  7.094994e-13
[6]  2.309597e-12
The order should correspond to the input order; if changed, then change the order in rowNames to correspond.

An example script is:

# edgeR classic method using exactTest
library(edgeR)
y <- DGEList(counts=countData,group=grpNames)
y <- calcNormFactors(y)
if (nGroup1==1 && nGroup2==1) {
  writeLines("Using dispersion")
  et <- exactTest(y, dispersion=disp)
} else {
  y <- estimateDisp(y)
  et <- exactTest(y)
} 
# Use the following for non-adjusted: results <- et$table$PValue
# 4th column is FDR
res <- topTags(et, n=nrow(et), adjust.method="BH")
results <- res$table[,4]
rowNames <- rownames(res)

3.B Options

Go to top

Pre-filter

This example uses the demo demoOl_DE (not in latest TCW package).
FilterDescriptionColumn**
Count >[n]Filter all sequences that do not have any replicates with a count > 1 RhRo1
CPM >=[n] for >[m]Filter all sequences that do not have more than m replicates with CPM >= n RhRo2
None No filtering RhRo3
** Column in table on right.

In the figure on the right, the filtered sequences have a value of 3 and sequences that recieved an NA from the DE computation have a value of 2. Note that with edgeR, the number of analyzed sequences effects the resulting DE value. Note also that the filtered sequences either recieve a value of 2 or a very high p-value in the non-filter RzRo3 column.

CPM (Counts per million) takes into account replicate size. There will be N replicates/samples to be analyzed, e.g. for the above demo, there are 6 replicates. The CPM value for a sequence j replicate i is computed (countj/Si)*1e6, where Si is sum of the counts for each replicate i.

The columns prefixed with '#' are the counts, which are used by the filters. The two columns at the end (Rhiz and Root) are the RPKM values. The DE value is set to negative if Rhiz<Root.
To view the first 50 filtered sequences that do not have all zero count values, execute runDE with the ID or database name and a -v, e.g.
runDE OlD -v
When this is run on the above demo setting CPM>=20 for >1 count gives the following output:
Using CPM filter >=20 for >1
     Rhiz1 44440
     Rhiz2 38819
     Rhiz3 34045
     Root1 113138
     Root2 229738
     Root3 115595
                   Counts                        CPM
        OlD_006    1    0    0    2    0    0  : 22.5  0.0  0.0 17.7  0.0  0.0 
        OlD_021    0    0    0    0    1    2  :  0.0  0.0  0.0  0.0  4.4 17.3 
        OlD_069    0    0    0    2    9    0  :  0.0  0.0  0.0 17.7 39.2  0.0 
        OlD_070    0    1    0    0    0    0  :  0.0 25.8  0.0  0.0  0.0  0.0 
        OlD_089    0    0    0   24    1    1  :  0.0  0.0  0.0 212.1  4.4  8.7 
        OlD_101    0    0    0    0    0    7  :  0.0  0.0  0.0  0.0  0.0 60.6 
        OlD_107    0    1    0    0    0    0  :  0.0 25.8  0.0  0.0  0.0  0.0 
   4 sequences with all zero counts
   11 filtered sequences
A 20 cutoff was used since the samples sizes are small. Use a smaller cutoff for larger sample sizes.

Fixed dispersion

If there are not biological replicates, then it is necessary to set the dispersion. If it is not set, TCW will use 0.1 by default.

False Discovery Rate

P-values should generally be adjusted for multiple testing. The supplied methods automatically provide adjusted p-values. If you don't want them adjusted, edit one of the R-scripts; instructions are in the scripts on how to obtain the non-adjusted p-values.

If you have supplied an R-script that does not provide FDR adjusted p-values, you can use this option to post-process the values. TCW uses the popular Benjamini-Hochberg False Discovery Rate adjustment. Selecting this option uses the R package multtest (which must be installed). The adjusted scores are then entered into the database column, as described above. The raw scores are not retained, although they may be seen in the R console.

3.C Execute

Go to top
The most meaningful DE comparison is between exactly two conditions, as illustrated in the example above. Groups of more than one condition may also be selected, however the conditions in each group will be treated as replicates, leading (probably) to an overestimation of biological variability and underestimation of DE.

If multi-condition testing is desired, e.g. to find transcripts reduced in condition A compared to any or all of conditions B,C,D, then the recommended process is to compute the individual pair DEs A-B, A-C, A-D, and then use filter options in viewSingleTCW to perform the screen.

Group 1 - Group 2

  • Select at least one condition from each group.
  • If you want to store the results in the database, check "Save results in P-value column". If the groups have been selected, a column name will be auto-generated which you can change. When you change the groups for a subsequent execution, uncheck-then-check the column box to auto-generated a name for the new groups.
  • Execute: The condition(s) in Group 1 will be compared with the condition(s) in group 2.

All Pairs for Group 1

  • Only check conditions in Group 1.
  • Results will be saved in the database with auto-generated column names.
  • Execute: Each condition selected in group 1 will be pairwise compared; any selections in Group 2 are ignored.
For example, if all three conditions (root, tip, zone) were selected, then it would compute:
Root  Tip   RoTi
Root  Zone  RoZo
Tip   Zone  TiZo
where the last column is the generated column name.

All Pairs from File

  • Any checked conditions in Group 1 and Group 2 are ignored; the conditions to compare are specified in the file.
  • Results will be saved in the database with the provided column names.
  • Execute: All pairs from the file will be compared with the selected method and options.
File format: Each line has 3 columns which represent Group 1, Group 2 and the column name to use, respectively. For example:
Root     Tip  RoTi
Root     Zone RoZo
Tip      Zone TiZo
Tip:Zone Root TpZo_Ro
The last row shows how to select two or more from Group 1; no spaces are allowed between the ":". Group 2 can only have one condition.

Keep the column names short but meaningful. A file similar to the above is in the projects/demoTra directory, filename traDE to allow user to try this option.

4. Differential Expression for GO categories

Go to top
If GO annotations are loaded (see User Guide), you can compute p-values for enrichment of DE in GO categories using the Execute GOseq button. For this, you first choose one of the DE comparison pairs (or all of them), and a p-value threshold to define DE of individual transcripts (or a percentage ranking).

The result is p-value columns for the GO, having the same names as the corresponding transcript DE columns, accessible through the "Basic GO Query" function of viewSingleTCW.

The meaning of a low p-value for a given GO is that the transcripts labeled by that GO have a higher percentage of DE than the transcripts as a whole. For example, if GO:000123 has 1000 labeled transcripts, of which 500 are DE (50%), while only 25% of transcripts as a whole are DE, then GO:000123 will receive a very low p-value. Note that the direction of the DE for each transcript (elevated or suppressed) is not relevant.

To use goseq therefore requires a threshold for defining DE of transcripts. The default is a typical .05 value (which assumes FDR correction was applied). Since the DE p-values still contain considerable uncertainty as absolute measures of probability, we have also included an option to defined DE in terms of ranking, i.e., top 10% (the ranking is ordered by the p-values).

5. R Directory Permissions

Go to top
When you install an R package, it can go either to a shared location or to a user-specific location. Installing to the shared location is better, as others can then use the package, however in the default installation of R the shared directories are owned by the root user and their permissions must be changed to allow ordinary users to install shared packages. This is done as follows (Unless you have sudo access, you will need an administrator to do this):
  1. Run R (by typing "R")
  2. In R, use Sys.getenv("R_HOME") to find the R installation directory (usually /usr/lib64/R)
  3. exit from R (command is "q()", no need to save workspace)
  4. ls /usr/lib64/R (or wherever the R_HOME was)
  5. The directory listing should have a library subdirectory; if not, R is not installed right
  6. Change permissions on the library directory (and everything under it) such that ordinary users can write to it. Ideally there is an appropriate user group and you can use chgrp -R <group> library and chmod -R g+w library to give this group write permissions. Otherwise, chmod -R 777 library will be necessary.
Also, you can check the R-devel installation using ls <R_HOME>/etc. There should be a file Renviron in this etc directory; if not, or if etc is not there, then R-devel was not installed.
Email Comments To: tcw@agcol.arizona.edu