The University of Arizona
sTCW Differential Expression Guide
AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour
RunDE computes differential expression (DE) between conditions for an existing singleTCW database, as follows:
  1. DE enrichment per sequence: built-in EdgeR, DESeq2 or custom R script.
    • The built-in approaches are very simplistic and typically are NOT the best approach.
    • It is important that you study the documentation for the respective DE method to determine the best approach for your data, then customize one of the existing R-scripts or create a new one.
    • The R-scripts explains how runDE writes the data to R variables, executes the R-script, then reads the results into sTCWdb.
  2. DE enrichment in GO categories: GOseq.
This document uses the "demoTra" project as an example, see User Guide.

Contents

      
  1. Installing R and packages
  2. Example: demoTra
  3. DE between conditions
    1. Methods
    2. Options
    3. Execute
  4. DE for GO categories
  5. R directory permissions

1. Installing R and packages

Before running DE computations the relevant R packages must be installed. Note, you only need to install those packages which you intend to use.

Installing R

The following instructions are for Linux systems. See
Mac-OSX for Mac instructions. On Redhat or CentOS Linux, R can be installed simply by
sudo yum install R-devel
If you do not have sudo privileges (sudo allows a general user to run some commands as the root user), you will need an administrator to do the install. Note, even if R is already installed, you probably still need to run this command to install the "devel" libraries, which are needed for the rJava package used by runDE to interface with R.

This is all that you strictly need to do; however, unless you plan to run as root (not recommended), or will be the only one running runDE, you will also want to set up R so that one user can install packages to be used by all users (see R directory permissions). Note: you do you need R to view the DE results with viewSingleTCW.

Installing rJava

TCW uses the rJava package to interface directly to R. However, you should not need to do anything for rJava on Linux or Mac since the necessary library is included in the package. If, however, you see an error involving libjri.so when trying DE functions, you may try the following: copy the library to a system directory, e.g. on Linux:
sudo cp external/libjri.so  /usr/lib64
Also use the file function to verify that the library is actually usable on your system. A 64-bit library is supplied, which will not work on 32-bit systems, or with 32-bit Java installed on 64-bit systems.

6/29/18: I updated my R biocLite on Linux, and the rJava quite working. In R, I ran install.packages("rJava"), then modified runDE to set JPATH=/usr/lib64/R/library/rJava/jri, and it works now.

Installing packages

If you are going to supply your own R-scripts, skip this section; interfacing your R-scripts with runDE is covered in the R-scripts section.

The packages used by TCW can all be installed through the "biocLite" system of Bioconductor as follows:

  1. Run R (type "R" at the command line)
  2. At the R prompt, type: source("http://bioconductor.org/biocLite.R")
  3. At the R prompt, install the packages you need:
    PackageInstallVersion as of 13-Dec-151NoteDependencies, e.g.2
    edgeRbiocLite("edgeR")3.12.0Used for demobiocLite("locfit")
    DESeq2biocLite("DESeq2")1.10.0  sudo yum install libxml2-devel
    GOSeqbiocLite("goseq")1.22.0 sudo yum install curl-devel
    multtestbiocLite("multtest")2.26.0For FDR 
    1These packages have been tested on Linux and Mac as of 13-Dec-15 with the latest versions and R 3.2.2, but also worked on older 2014/2015 versions.
    2There are many dependencies, most are automatically resolved with the biocLite command, but some you may need to track down as they are dependent on your system. You will probably need the ones listed in the last column. Note, if you do not have sudo permissions, your system administrator will need to perform the 'yum install' commands.

2. Example: demoTra (release 28Aug18)

Go to top
From the command line, type:
    >runDE tra
Or, if you type runDE with no arguments, the sTCW graphical chooser will appear with all sTCW databases listed; select the demoTra database. Either way, this brings up the runDE interface shown at right.

Define a pairwise comparison by selecting "Stem" for Group 1 and "Root" for Group 2, as shown.

Check the box Save results in p-value column, the text box will display the column name "StRo", which can be changed.

Assuming you have installed edgeR, leave the method as is.

Execute Group 1 - Group 2. The default is to run the edgeRglm.R script. The output will be as follows:

Root has 5 replicates: Root1, Root2, Root3, Root4, Root5
Tip has 1 replicates: Tip1
Collecting count data (may take several minutes)
Using count filter >1
   4 filtered sequences
The last two lines of the above printout indicate that 4 sequences were filtered out because they did not have any replicates with more than a count of 1. When the computation is done you will see on the terminal:
Stem has 5 replicates: Stem1,Stem2,Stem3,Stem4,Stem5
Root has 5 replicates: Root1,Root2,Root3,Root4,Root5
Collecting count data (may take several minutes)
Using count filter >1
   1 filtered sequences
Assigning R variables
    gc: GC values of sequences
    rowNames: sequences (row) names
    grpNames: group (column) names
    repNames: replicate names
    counts: counts of sequences
    countData <- array(counts,dim=c(210,10))
    rm(counts)
    rownames(countData) <- rowNames
    nGroup1 <- 5
    nGroup2 <- 5

Start R-script
    source('R-scripts/edgeRglm.R')
Loading required package: limma
Using tradional glm
    results
    rowNames
R-script done

Number of DE results:
   <1e-5 <1e-4 <0.001 <0.01 <0.05
      15    21    36    65    77

Saving 210 scores
The R console remains open and you can now manually expore the data using R, if desired. However, it easiest to explore it be selecting "Close" on runDE, which closed the java program but not R. Then do the following:
>q()
Save workspace image? [y/n/c]: y
R
This saves all variables written by runDE and the scripts (in this case edgeRglm.R. By restarting R, you can now explore the data, e.g.
> library(edgeR)
> plotBCV(y)
> plotMDS(y, labels=grpNames)

View Results: To view the resulting p-values, run 'viewSingleTCW tra' from the command line, and click Columns. The column options has a section Differential Expression with the DE columns listed, e.g. the DE columns for demoTra are shown on the right. There is also a section for Conditions, which has the "Stem", "Root, and "Oleaf" column options. Select the desired columns ( e.g. "Stem", "Root", "StRo"); you may then select a filter (e.g. all sequences that have an absolute p-value < 0.05). Note in the below figure that if Stem < Root, the p-values StRo is negative, else it is positive. viewSingleTCW DE section with all pairs computed.

 

3. DE between conditions

3.A Methods

Go to top

Use of replicates

Biological replicates are valuable for RNA-seq to provide the baseline estimate of variability. This is incorporated into the models (typically as dispersion of the negative-binomial distribution) and makes a big difference in the result DE scores.

If replicates are provided for a condition, they will automatically be used for the applicable R packages (e.g., edgeR). All of the methods also work without replicates, but are likely to overestimate DE.

Replicates should be biological replicates, i.e. sequencing from distinct individuals having the same treatments, rather than technical replicates, i.e. re-sequencing of the same individuals. Technical replicates can be input to TCW and will be used in the same way, but do not provide a useful estimate of variability.

Built-in methods

Both edgeR and DESeq2 produce FDR adjusted p-values. They are built-in methods, meaning the code is executed from within runDE. However, it is better to use an R-script that has been customized for your data, as described in the next section.

R-scripts

Go to top
When the R-script option is selected, TCW performs the following:
  1. Writes the necessary information to the R environment based on the options selected.
  2. Runs the R script using the source command.
  3. Reads the results variable from the R enviroment, which should contain the p-values.
TCW writes the following variables to the R environment (the following assignments are from the above demo example):
gc: GC values of sequences
    rowNames: sequences (row) names
    grpNames: group (column) names
    repNames: replicate names
    counts: counts of sequences
    countData <- array(counts,dim=c(210,10))
    rm(counts)
    rownames(countData) <- rowNames
    nGroup1 <- 5
    nGroup2 <- 5
   
	
The variables displayed with a ":" after them are assigned with the assign function, whereas the others are the exact assignment.
The most important variables are countData and grpNames, e.g.
> head(countData)
        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
tra_001 1017  594 1222 1209 1315  378 1002 1649  826  1195
tra_002  272  239  431  400  368  101  206  151  109   185
tra_003 3830 5185 4847 4857 5451 1859 2506 1334 1541  2307
tra_004 1707 1088 2429 2210 2334  529  919 1103  810  1427
tra_005  479  369  439  444  565  114  192  151  232   351
tra_006 1122  923 1381 1320 1482  632  839  905  670   823
> grpNames
 [1] "Stem" "Stem" "Stem" "Stem" "Stem" "Root" "Root" "Root" "Root" "Root"
The R script should put the p-values in the results variables, e.g.
> head(results)
[1] 7.322066e-10 1.615220e-09 4.467008e-09 4.721506e-09 9.586951e-08
[6] 1.314299e-07

The order should correspond to the input order; if changed, then change the order in rowNames to correspond.

An example script is:

# edgeR gln method unless no replicates
library(edgeR)
y <- DGEList(counts=countData,group=grpNames)
y <- calcNormFactors(y)
if (nGroup1==1 && nGroup2==1) {
    writeLines("Using classic with fixed dispersion")
    et <- exactTest(y, dispersion=disp)
    res <- topTags(et, n=nrow(et), adjust.method="BH")
} else {
 	writeLines("Using tradional glm")
    design <- model.matrix(~grpNames)
    y <- estimateDisp(y, design)
    fit <- glmQLFit(y,design)
    qlf <- glmQLFTest(fit,coef=2)
    res <- topTags(qlf, n=nrow(qlf), adjust.method="BH")
} 
# Use the following for non-adjusted: results <- et$table$PValue
# 4th column is FDR
results <- res$table[,4]
rowNames <- rownames(res)

3.B Options

Go to top

Pre-filter

This example uses a database with 26k sequences and 5 replicates for root and stem. The three filtering options where run with defaults(table on lower left) before input to edgeR (results in figure on lower right). The first option filtered 1713 sequences and the second option filtered 7518; with edgeR, the number of analyzed sequences effects the resulting DE value.
FilterDescriptionColumn1
Count >[n] Filter all sequences that do not have any replicates with a count > n (default 1) RoSt1
CPM2 >=[n] for >[m] Filter all sequences that do not have more than m replicates with CPM >= n (default 2 and 1) RoSt2
None No filtering RoSt3
1 Column in table on right.
2 CPM (Counts per million) is defined below.
The DE value (RoSt) is set to negative if Root<Stem.
The columns prefixed with '#' are the counts, which are used by the filters.
The two columns at the end (Root and Stem) are the RPKM values.
In the figure on the right:
  1. DE value 3.00: the filtered sequences (i.e. not input to the DE computation) have a value of 3.
  2. DE value 2.00: the sequences that recieved an NA (i.e. no result) from the DE computation have a value of 2.

CPM (Counts per million) takes into account replicate size. There will be N replicates/samples to be analyzed, e.g. there are 10 replicates for the above dataset. The CPM value for a sequence j replicate i is computed (countj/Si)*1E6, where Si is sum of the counts for each replicate i.

To view the first 50 filtered sequences that do not have 'all' zero count values, execute runDE with the ID or database name and a -v,
e.g. runDE NnR -v.

Using Count > 1, the following are the Counts of the filtered sequences.

  NnR_00003    0    0    0    1    0    0    0    0    0    0
  NnR_00016    0    0    0    0    0    0    1    0    0    0
  NnR_00032    0    0    0    0    0    1    0    0    0    0
  NnR_00090    1    0    0    1    0    1    0    0    0    0
  NnR_00095    1    1    1    1    0    1    0    0    0    0
  NnR_00112    0    0    0    1    0    0    0    0    0    0
  ...
  1713 filtered sequences
Using CPM >=2 for >1 replicates, the following are the Counts:CPM values of the filtered sequences.
 NnR_00002   6   10   13    8   21   20   16    9   13    3  :  0.2  0.4  0.4  0.3  0.6  0.8  0.5  0.3  0.4  0.1
 NnR_00003   0    0    0    1    0    0    0    0    0    0  :  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 NnR_00004  19   27   18   15   17   16   17    4   22   10  :  0.8  1.0  0.6  0.5  0.5  0.6  0.5  0.1  0.7  0.3
 NnR_00007  34   23   29   29   31   12    6    1    7    2  :  1.4  0.9  0.9  1.0  0.9  0.5  0.2  0.0  0.2  0.1
 NnR_00009  45   23   33   22   46   17   26    7   22    3  :  1.8  0.9  1.1  0.8  1.3  0.6  0.8  0.3  0.7  0.1
 NnR_00010  19   16   29   23   26    5    4    3    4    1  :  0.8  0.6  0.9  0.8  0.7  0.2  0.1  0.1  0.1  0.0
 NnR_00014  28   21   38   26   31   13   13   15   14   11  :  1.1  0.8  1.2  0.9  0.9  0.5  0.4  0.5  0.5  0.4
 NnR_00016   0    0    0    0    0    0    1    0    0    0  :  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 NnR_00017   5    4    4   10   11    0    0    2    1    0  :  0.2  0.2  0.1  0.4  0.3  0.0  0.0  0.1  0.0  0.0
 ...
 7518 filtered sequences

Fixed dispersion

If there are not biological replicates, then it is necessary to set the dispersion. If it is not set, TCW will use 0.1 by default.

False Discovery Rate

P-values should generally be adjusted for multiple testing. The supplied methods automatically provide adjusted p-values. If you don't want them adjusted, edit one of the R-scripts; instructions are in the scripts on how to obtain the non-adjusted p-values.

If you have supplied an R-script that does not provide FDR adjusted p-values, you can use this option to post-process the values. TCW uses the popular Benjamini-Hochberg False Discovery Rate adjustment. Selecting this option uses the R package multtest (which must be installed). The adjusted scores are then entered into the database column, as described above. The raw scores are not retained, although they may be seen in the R console.

3.C Execute

Go to top
The most meaningful DE comparison is between exactly two conditions, as illustrated in the example above. Groups of more than one condition may also be selected, however the conditions in each group will be treated as replicates, leading (probably) to an overestimation of biological variability and underestimation of DE.

If multi-condition testing is desired, e.g. to find transcripts reduced in condition A compared to any or all of conditions B,C,D, then the recommended process is to compute the individual pair DEs A-B, A-C, A-D, and then use filter options in viewSingleTCW to perform the screen.

There are four options to add DE values to the sTCW database, as explained in the following four sections.

Group 1 - Group 2

  • Select at least one condition from each group.
  • If you want to store the results in the database, check "Save results in P-value column". If the groups have been selected, a column name will be auto-generated which you can change. When you change the groups for a subsequent execution, uncheck-then-check the column box to auto-generated a name for the new groups.
  • Execute: The condition(s) in Group 1 will be compared with the condition(s) in group 2.

All Pairs for Group 1

  • Only check conditions in Group 1.
  • Results will be saved in the database with auto-generated column names.
  • Execute: Each condition selected in group 1 will be pairwise compared; any selections in Group 2 are ignored.
For example, if all three conditions (Stem, Root, Oleaf) were selected, then it would compute:
Stem  Root   StRo
Stem  Oleaf  StOl
Root  Oleaf  RoOl
where the last column is the generated column name.

All Pairs from File

  • Any checked conditions in Group 1 and Group 2 are ignored; the conditions to compare are specified in the file.
  • Results will be saved in the database with the provided column names.
  • Execute: All pairs from the file will be compared with the selected method and options.
File format: Each line has 3 columns which represent Group 1, Group 2 and the column name to use, respectively. For example:
Root     Stem  RoSt
Root     Oleaf RoOl
Stem     Oleaf StmLea
Root:Stem Oleaf RoSt_Ol
The last row shows how to select two or more from Group 1; no spaces are allowed between the ":". Group 2 can only have one condition.

Keep the column names short but meaningful. A file similar to the above is in the projects/demoTra directory, filename traDE to allow user to try this option.

P-values from File

The first line defines the two groups and the p-value column name, using the same format as the "Get Pairs from File". All remaining lines have a seqID followed by the p-value. For example,
#Example p-value file 
root stem RoSt
tra_001	0.14
tra_002	0.14
tra_003	0.65
The p-values are read as absolute values since runDE makes values negative if Cond1<Cont2.

4. Differential Expression for GO categories

Go to top
If GO annotations are loaded (see User Guide), you can compute p-values for enrichment of DE in GO categories using the Execute GOseq button. For this, you first choose one of the DE comparison pairs (or all of them), and a p-value threshold to define DE of individual transcripts (or a percentage ranking).

The result is p-value columns for the GO, having the same names as the corresponding transcript DE columns, accessible through the "Basic GO Query" function of viewSingleTCW.

The meaning of a low p-value for a given GO is that the transcripts labeled by that GO have a higher percentage of DE than the transcripts as a whole. For example, if GO:000123 has 1000 labeled transcripts, of which 500 are DE (50%), while only 25% of transcripts as a whole are DE, then GO:000123 will receive a very low p-value. Note that the direction of the DE for each transcript (elevated or suppressed) is not relevant.

To use GOseq therefore requires a threshold for defining DE of transcripts. The default is a typical .05 value (which assumes FDR correction was applied). Since the DE p-values still contain considerable uncertainty as absolute measures of probability, we have also included an option to defined DE in terms of ranking, i.e., top 10% (the ranking is ordered by the p-values).

5. R Directory Permissions

Go to top
When you install an R package, it can go either to a shared location or to a user-specific location. Installing to the shared location is better, as others can then use the package, however in the default installation of R the shared directories are owned by the root user and their permissions must be changed to allow ordinary users to install shared packages. This is done as follows (unless you have sudo access, you will need an administrator to do this):
  1. Run R (by typing "R")
  2. In R, use Sys.getenv("R_HOME") to find the R installation directory (usually /usr/lib64/R)
  3. Exit from R with command "q()"
  4. ls /usr/lib64/R (or wherever the R_HOME was)
  5. The directory listing should have a library subdirectory; if not, R is not installed right
  6. Change permissions on the library directory (and everything under it) such that ordinary users can write to it. Ideally there is an appropriate user group and you can use chgrp -R <group> library and chmod -R g+w library to give this group write permissions. Otherwise, chmod -R 777 library will be necessary.
Also, you can check the R-devel installation using ls <R_HOME>/etc. There should be a file Renviron in this etc directory; if not, or if etc is not there, then R-devel was not installed.
Email Comments To: tcw@agcol.arizona.edu