The University of Arizona
sTCW Differential Expression Guide
  AGCoL | TCW Home | Doc Index | singleTCW Guide | DE Guide | multiTCW Guide | Tour
RunDE computes differential expression (DE) between conditions for an existing singleTCW database, as follows:
  1. DE per sequence: R scripts for EdgeR, DESeq2.
    • It is important that you study the documentation for the respective DE method to determine the best approach for your data. You may either use one of the existing scripts, customize an existing script, or create a new one.
    • The R-scripts for DE sections explains how runDE writes the data to R variables, executes the R-script, then reads the results into sTCWdb.
  2. GO enhancement: R script for GOseq.
This document uses the "demoTra" project as an example. The User Guide explains how to built it.

Contents:

      
  1. Installing R and packages
  2. Example: demoTra
  3. DE between conditions
    1. Methods and Options
    2. Execute
  4. GO enhancement
  5. Remove
  6. R-scripts
  7. R directory permissions

1. Installing R and packages

Installing R

Installing rJava and R packages

2. Example: demoTra

Go to top
From the command line, type:
    >runDE tra
	
Or, if you type runDE with no arguments, the sTCW graphical chooser will appear with all sTCW databases listed; select the demoTra database. Either way, this brings up the runDE interface shown at right.

Define a pairwise comparison by selecting "Stem" for Group 1 and "Root" for Group 2, as shown.

Check the box Save results in p-value column; the text box will display the column name "StRo", which can be changed.

Assuming you have installed edgeR, leave the Differential Expression R-script as edgeRglm.R .

Execute Group 1 - Group 2. See Execute DE for the output.

The GO enrichment R-script will be goSeqBH.R by default, which performs BH multiple hypothesis testing correction. However, this results in 0 p-value<0.05 for the demo, hence, for the demo only, change this to goSeqNoFDR.R. Then select Execute. See Execute GO for the output.

Execute DE output

> ****************************************************************
******** Start DE execution for column: StRo *********
****************************************************************
Replicates:
      5 Stem
      5 Root
Collecting count data (may take several minutes)
Using CPM filter > 1 for >= 2
      1 filtered sequences
Assigning R variables
   gc: GC values of sequences
   rowNames: sequences (row) names
   grpNames: group (column) names
   repNames: replicate names
   counts: counts of sequences
   countData <- array(counts,dim=c(210,10))
   rm(counts)
   rownames(countData) <- rowNames
   nGroup1 <- 5
   nGroup2 <- 5

Start R-script
   source('R-scripts/edgeRglm.R')
Loading required package: limma
Using traditional glm (quasi-likelihood F-tests)
   results
   rowNames
R-script done

Number of DE results:
   <1e-5 <1e-4 <0.001 <0.01 <0.05
       6    17    32    65    97

Saving 210 scores for StRo
   Adding column to database...
                            .
Finished DE execution for StRo             0m:0s  (8Mb)
Complete all Group1-Group2 for tra
The console is in R, you may run R commands -- q() or Cntl-C when done, or perform another Execute.
The R console remains open and you can now manually explore the data using R, if desired. However, it easiest to explore it be selecting "Close" on runDE, which closed the java program but not R. Then do the following:
	>q()
	Save workspace image? [y/n/c]: y
	R
This saves all variables written by runDE and the scripts (in this case edgeRglm.R). By restarting R, you can now explore the data, e.g.
	> library(edgeR)
	> plotBCV(y)
	> plotMDS(y, labels=grpNames)

View DE Results

To view the resulting p-values, run 'viewSingleTCW tra' from the command line, and click >> Columns. The column options has a section Differential Expression with the DE columns listed, e.g. the DE columns for demoTra are shown on the right. There is also a section for Conditions, which has the "Stem", "Root, and "Oleaf" column options. Select the desired columns ( e.g. "Stem", "Root", "StRo"); you may then select a filter (e.g. all sequences that have an absolute p-value < 0.05). Note in the below figure that if Stem < Root, the p-value StRo is negative, else it is positive. viewSingleTCW DE section with all pairs computed.

Execute GO output

****************************************************************
******** 1. Start GO enrichment for column: StRo *********
****************************************************************
Assigning R variables
   208  Sequences with GOs (from 208)
   seqNames: sequence names
   seqLens:  sequence lengths
   nSeqs <- 208
   seqDEs: DE binary vector (97 seq p-value < 0.05)
   names(seqDEs) <- seqNames
   seqGOs <- vector("list",nSeqs)
   For all n: seqGOs[[n]] <- c(gonum list)
   names( seqGOs) =  seqNames                       ...

Start R-script
   source('R-scripts/goSeqNoFDR.R')
Using manually entered categories.
Calculating the p-values...
   oResults
   goNums
R-script done

Saving 6651 values to database
   Adding column to go_info table
    311 GO p-values < 0.05
Finished GO enrichment for StRo                                          0m:2s
Complete GO enrichment for tra
The console is in R, you may run R commands -- q() or Cntl-C when done, or perform another Execute.
The R console remains open and you can now manually explore the data using R, if desired. However, it easiest to explore it be selecting "Close" on runDE, which closed the java program but not R. Then do the following:
	>q()
	Save workspace image? [y/n/c]: y
	R
This saves all variables written by runDE and the scripts (in this case goSeqNoFDR.R). By restarting R, you can now explore the data.

View GO Enhancement Results

3. DE between conditions

Methods & Options

Go to top

R-script

The default R-script is edgeR.R, however, there are other scripts in the R-script, or the user can supply there own (see R-script for DE).

Use of replicates

Biological replicates are valuable for RNA-seq to provide the baseline estimate of variability. This is incorporated into the models (typically as dispersion of the negative-binomial distribution) and makes a big difference in the result DE scores.

If replicates are provided for a condition, they will automatically be used for the applicable R packages (e.g., edgeR). All of the methods also work without replicates, but are likely to overestimate DE.

Replicates should be biological replicates, i.e. sequencing from distinct individuals having the same treatments, rather than technical replicates, i.e. re-sequencing of the same individuals. Technical replicates can be input to TCW and will be used in the same way, but do not provide a useful estimate of variability.

Pre-filter

This example uses a database with 26k sequences and 5 replicates for root and stem. The three filtering options were run with defaults (table on lower left) before input to edgeR (results in figure on lower right). The first option filtered 1713 sequences and the second option filtered 7518; with edgeR, the number of analyzed sequences effects the resulting DE value.
FilterDescriptionColumn1
Count >[n] Filter all sequences that do not have any replicates with a count > n (default 1) RoSt1
CPM2 >=[n] for >[m] Filter all sequences that do not have more than m replicates with CPM >= n (default 2 and 1) RoSt2
None No filtering RoSt3
1 Column in table on right.
2 CPM (Counts per million).

The DE value (RoSt) is set to negative if Root<Stem.
The columns prefixed with '#' are the counts, which are used by the filters.
The two columns at the end (Root and Stem) are the RPKM values.
In the figure on the right:
  1. DE value 3.00: the filtered sequences have a value of 3.
  2. DE value 2.00: the sequences that received an NA from the DE computation have a value of 2.

To view the first 50 filtered sequences that do not have 'all' zero count values, execute runDE with the ID or database name and a -v,
e.g. runDE NnR -v.

Using Count > 1, the following are the Counts of the filtered sequences.

	NnR_00003    0    0    0    1    0    0    0    0    0    0
	NnR_00016    0    0    0    0    0    0    1    0    0    0
	NnR_00032    0    0    0    0    0    1    0    0    0    0
	NnR_00090    1    0    0    1    0    1    0    0    0    0
	NnR_00095    1    1    1    1    0    1    0    0    0    0
	NnR_00112    0    0    0    1    0    0    0    0    0    0
	...
	1713 filtered sequences
Using CPM >=2 for >1 replicates, the following are the Counts:CPM values of the filtered sequences.
	NnR_00002   6   10   13    8   21   20   16    9   13    3  :  0.2  0.4  0.4  0.3  0.6  0.8  0.5  0.3  0.4  0.1
	NnR_00003   0    0    0    1    0    0    0    0    0    0  :  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
	NnR_00004  19   27   18   15   17   16   17    4   22   10  :  0.8  1.0  0.6  0.5  0.5  0.6  0.5  0.1  0.7  0.3
	NnR_00007  34   23   29   29   31   12    6    1    7    2  :  1.4  0.9  0.9  1.0  0.9  0.5  0.2  0.0  0.2  0.1
	NnR_00009  45   23   33   22   46   17   26    7   22    3  :  1.8  0.9  1.1  0.8  1.3  0.6  0.8  0.3  0.7  0.1
	NnR_00010  19   16   29   23   26    5    4    3    4    1  :  0.8  0.6  0.9  0.8  0.7  0.2  0.1  0.1  0.1  0.0
	NnR_00014  28   21   38   26   31   13   13   15   14   11  :  1.1  0.8  1.2  0.9  0.9  0.5  0.4  0.5  0.5  0.4
	NnR_00016   0    0    0    0    0    0    1    0    0    0  :  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
	NnR_00017   5    4    4   10   11    0    0    2    1    0  :  0.2  0.2  0.1  0.4  0.3  0.0  0.0  0.1  0.0  0.0
	...
	7518 filtered sequences

Fixed dispersion

If there are not biological replicates, then it is necessary to set the dispersion. If it is not set, TCW will use 0.1 by default.

Execute

Go to top
The most meaningful DE comparison is between exactly two conditions, as illustrated in the example above. Groups of more than one condition may also be selected, however the conditions in each group will be treated as replicates, leading (probably) to an overestimation of biological variability and underestimation of DE.

If multi-condition testing is desired, e.g. to find sequences reduced in condition A compared to any or all of conditions B,C,D, then the recommended process is to compute the individual pair DEs A-B, A-C, A-D, and then use filter options in viewSingleTCW to perform the screen.

There are four options to add DE values to the sTCW database, as explained in the following four sections.

All Pairs for Group 1

  • Only check conditions in Group 1.
  • Results will be saved in the database with auto-generated column names.
  • Execute: Each condition selected in group 1 will be pairwise compared; any selections in Group 2 are ignored.
For example, if all three conditions (Stem, Root, Oleaf) were selected, then it would compute:
	Stem  Root   StRo
	Stem  Oleaf  StOl
	Root  Oleaf  RoOl
where the last column is the generated column name.

Group 1 - Group 2

  • Select at least one condition from each group.
  • If you want to store the results in the database, check Save results in P-value column. If the groups have been selected, a column name will be auto-generated which you can change. When you change the groups for a subsequent execution, uncheck-then-check the column box to auto-generated a name for the new groups.
  • Execute: The condition(s) in Group 1 will be compared with the condition(s) in group 2.

All Pairs from File

  • Any checked conditions in Group 1 and Group 2 are ignored; the conditions to compare are specified in the file.
  • Results will be saved in the database with the provided column names.
  • Execute: All pairs from the file will be compared with the selected method and options.
File format: Each line has 3 columns which represent Group 1, Group 2 and the column name to use, respectively. For example:
	Root     Stem  RoSt
	Root     Oleaf RoOl
	Stem     Oleaf StmLea
	Root:Stem Oleaf RoSt_Ol
The last row shows how to select two or more from Group 1; no spaces are allowed between the ":". Group 2 can only have one condition.

Keep the column names short but meaningful. A file similar to the above is in the projects/demoTra directory, filename traDE.tsv to allow user to try this option.

P-values from File

The first line defines the two groups and the p-value column name, using the same format as the "Get Pairs from File". All remaining lines have a seqID followed by the p-value. For example,
	#Example p-value file 
	root stem RoSt
	tra_001	0.14
	tra_002	0.14
	tra_003	0.65
The p-values are read as absolute values since runDE makes values negative if Cond1<Cont2.

4. GO enrichment (over-represented categories)

Go to top
If GO annotations are loaded (see User Guide), you can compute p-values for enrichment of DE in GO categories (terms/IDs).
  • R-script: Use the default goSeqBH.R or supply your own.
  • P-value columns: If one or more DE p-values have been added to the database, the <---> box will say All p-values; this can be changed to compute GO enrichment for just one p-value column.
  • P-value cutoff: A p-value of 0.05 is the default; this applies to the DE p-value so that all sequences with a p-value<0.05 will be considered DE (which assumes FDR correction was applied). Since the DE p-values still contain considerable uncertainty as absolute measures of probability, there is an option to defined DE in terms of ranking, i.e., top 10% (the ranking is ordered by the p-values).
The meaning of a low p-value for a given GO is that the sequences assigned the GO have a higher percentage of DE than the sequences as a whole. For example, if GO:000123 is assigned to 1000 sequences, of which 500 are DE (50%), while only 25% of sequences as a whole are DE, then GO:000123 will receive a very low p-value. Note that the direction of the DE for each sequence (elevated or suppressed) is not relevant.

The result is p-value columns for the GO, having the same names as the corresponding sequence DE columns, accessible through the Basic GO Query function of viewSingleTCW.

5. Remove

Go to top

5. R-scripts

Go to top

R-scripts for DE

When the Execute is selected, TCW performs the following:
  1. Writes the necessary information to the R environment based on the options selected.
  2. Runs the R script using the source command.
  3. Reads the results variable from the R environment, which should contain the p-values.
TCW writes the following variables to the R environment (the following assignments are from the above demo example):
	> gc: GC values of sequences
		rowNames: sequences (row) names
		grpNames: group (column) names
		repNames: replicate names
		counts: counts of sequences
		countData <- array(counts,dim=c(210,10))
		rm(counts)
		rownames(countData) <- rowNames
		nGroup1 <- 5
		nGroup2 <- 5
The variables displayed with a ":" after them are assigned with the assign function, whereas the others are the exact assignment.
The most important variables are countData and grpNames, e.g.
	> head(countData)
			[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
	tra_001 1017  594 1222 1209 1315  378 1002 1649  826  1195
	tra_002  272  239  431  400  368  101  206  151  109   185
	tra_003 3830 5185 4847 4857 5451 1859 2506 1334 1541  2307
	tra_004 1707 1088 2429 2210 2334  529  919 1103  810  1427
	tra_005  479  369  439  444  565  114  192  151  232   351
	tra_006 1122  923 1381 1320 1482  632  839  905  670   823
	> grpNames
	 [1] "Stem" "Stem" "Stem" "Stem" "Stem" "Root" "Root" "Root" "Root" "Root"
The R script should put the p-values in the results variables, e.g.
	> head(results)
	[1] 1.537634e-07 1.695981e-07 2.478791e-07 2.478791e-07 4.026519e-06
	[6] 4.600047e-06
The order should correspond to the input order; if changed, then change the order in rowNames to correspond.

An example script is:

	# edgeR gln method unless no replicates
	library(edgeR)
	y <- DGEList(counts=countData,group=grpNames)
	y <- calcNormFactors(y)
	if (nGroup1==1 && nGroup2==1) {
		writeLines("Using classic with fixed dispersion")
		et <- exactTest(y, dispersion=disp)
		res <- topTags(et, n=nrow(et), adjust.method="BH")
	} else {
		writeLines("Using traditional glm")
		design <- model.matrix(~grpNames)
		y <- estimateDisp(y, design)
		fit <- glmQLFit(y,design)
		qlf <- glmQLFTest(fit,coef=2)
		res <- topTags(qlf, n=nrow(qlf), adjust.method="BH")
	} 
	# Columns are:  logFC    logCPM        F       PValue          FDR
	results <- res$table$FDR
	rowNames <- rownames(res)

R-scripts for GO

This works like the R-scripts for DE, with the variables shown below in the GO enrichment output:
****************************************************************
******** 1. Start GO enrichment for column: StRo *********
****************************************************************
Assigning R variables
   208  Sequences with GOs (from 208)
   seqNames: sequence names
   seqLens:  sequence lengths
   nSeqs <- 208
   seqDEs: DE binary vector (97 seq p-value < 0.05)
   names(seqDEs) <- seqNames
   seqGOs <- vector("list",nSeqs)
   For all n: seqGOs[[n]] <- c(gonum list)
   names( seqGOs) =  seqNames                       ...

Start R-script
   source('R-scripts/goSeqBH.R')
Using manually entered categories.
Calculating the p-values...
   oResults
   goNums
R-script done

Saving 6651 values to database
    0 GO p-values < 0.05
Finished GO enrichment for StRo                                          0m:2s
See R-scripts/goSeqBH.R for an example. The p-values to be loaded must be in "oResults" and the associated GO numbers are in "goNums" (note that TCW removes the "GO:" prefix and just used the number).

seqGOs

The GOs assigned to each sequence are both the direct and indirect.

FDR

According to the GOseq Manual: "Having performed the GO analysis, you may now wish to interpret the results. If you wish to identify categories significantly enriched/unenriched below some p-value cutoff, it is necessary to first apply some kind of multiple hypothesis testing correction. For example, GO categories over enriched using a .05 FDR cutoff [Benjamini and Hochberg, 1995] are:"
> enriched.GO=GO.wall$category[p.adjust(GO.wall$over_represented_pvalue, + method="BH")<.05]
Hence, the values loaded into "oResults" for the TCW supplied GOseqBH.R:
oResults <- p.adjust(GO.wall$over_represented_pvalue, method="BH")
The p.adjust methods are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

demoTra: The TCW supplied demo result in 0 p-values<0.05. In order to view some p-values<0.05 results for the demo, use GOseqNoFDR.R, which has no multiple testing correction.

Naming R-scripts and Overview

Go to top
If you use your own R-script, make sure the name reflects its function. The R-script names are shown in the TCW Overview, as illustrated below:
DE (Differential Expression) computation: 
      Column       Method               Conditions
      StRo         edgeRglm.R           Stem : Root 
      StOl         edgeRglm.R           Stem : Oleaf 
      RoOl         edgeRglm.R           Root : Oleaf 

   GO enrichment computation: 
      Column       Method               Cutoff
      StRo         goSeqNoFDR.R         0.0500
      StOl         goSeqNoFDR.R         0.0500
      RoOl         goSeqNoFDR.R         0.0500

6. R Directory Permissions

Go to top
The following was written in 2014, so may be dated.

Unless you plan to run as root (not recommended), or will be the only one running runDE, you will also want to set up R so that one user can install packages to be used by all users. Note: you do NOT need R to view the DE results with viewSingleTCW.

When you install an R package, it can go either to a shared location or to a user-specific location. Installing to the shared location is better, as others can then use the package, however in the default installation of R the shared directories are owned by the root user and their permissions must be changed to allow ordinary users to install shared packages. This is done as follows (unless you have sudo access, you will need an administrator to do this):

  1. Run R (by typing "R")
  2. In R, use Sys.getenv("R_HOME") to find the R installation directory (usually /usr/lib64/R)
  3. Exit from R with command "q()"
  4. ls /usr/lib64/R (or wherever the R_HOME was)
  5. The directory listing should have a library subdirectory; if not, R is not installed right
  6. Change permissions on the library directory (and everything under it) such that ordinary users can write to it. Ideally there is an appropriate user group and you can use chgrp -R <group> library and chmod -R g+w library to give this group write permissions. Otherwise, chmod -R 777 library will be necessary.
Also, you can check the R-devel installation using ls <R_HOME>/etc. There should be a file Renviron in this etc directory; if not, or if etc is not there, then R-devel was not installed.
Email Comments To: tcw@agcol.arizona.edu