Data Mining Algorithms In R/Sequence Mining/DEGSeq

Working with DEGseq

Introduction
First we have to choose the way to organize your data. There are two ways to do that: using the MA-plot-based method with random sampling model or MA-plot-based method with technical replicates. MA-plot-based method with random sampling model 

The MA-plot is a statistical analysis tool that having been used to detect and visualize intensity-dependent ratio of microarray data. Let C1 and C2 denote the counts of reads mapped to a specific gene obtained from two samples, with Ci ~ binomial(ni, pi), I = 1,2, where ni denotes the total number of mapped reads and pi denotes the probability of a read coming from that gene. We define M = log2C1 - log2C2, and A = (log2C1 + log2C2)/2. It can be proven that under the random sampling assumption the conditional distribution of M given that A = a (a is an observation of A), follows an approximate normal distribution. For each gene on the MA-plot, we do the hypothesis test of H0: p1 = p2 versus H1: p1 ≠ p2. Then a P-value could be assigned based on the conditional normal distribution.

MA-plot-based method with technical replicates

Though it has been reported that sequencing platform has low background noise, technical replicates would still be informative for quality control and to estimate the variation due to different machines or platforms. MA-plot-based is an another method which estimates the noise level by comparing technical replicates in the data (if available). In this method, a sliding-window is first applied on the MA-plot of the two technical replicates along the A-axis to estimate the random variation corresponding to different expression levels. A smoothed estimate of the intensity-dependent noise level is done by loess regression, and converted to local standard deviations of M conditioned on A, under the assumption of normal distribution. The local standard deviations are then used to identify the difference of the gene expression between the two samples.

Multiple testing correction

For the above methods, the P-values calculated for each gene are adjusted to Q-values for multiple testing corrections. Users can set either a P-value or a false discovery rate (FDR) threshold to identify differentially expressed genes.

Once you choose your data, you can apply the R package DEGseq. There are five different methods to run the program. They are DEGexp, DEGseq, readGeneExp, samWrapper and getGeneExp.

DEGexp
Description

This function is used to identify differentially expressed genes when users already have the gene expression values (such as the number of reads mapped to each gene). Usage

Arguments

Example: > library(DEGseq) > geneExpFile <- system.file("data", "GeneExpExample5000.txt", + package = "DEGseq") > if (geneExpFile == "") { + zipFile <- system.file("data", "Rdata.zip", package = "DEGseq") + if (zipFile != "") { + unzip(zipFile, "GeneExpExample5000.txt", exdir = tempdir) + geneExpFile <- file.path(tempdir, "GeneExpExample5000.txt") + } + } > layout(matrix(c(1, 2, 3, 4, 5, 6), 3, 2, byrow = TRUE)) > par(mar = c(2, 2, 2, 2)) > DEGexp(geneExpFile1 = geneExpFile, expCol1 = c(7, 9, 12, 15, 18), groupLabel1 = "kidney", geneExpFile2 = geneExpFile, expCol2 = c(8, 10, 11, 13, 16), groupLabel2 = "liver", method = "MARS") Please wait... geneExpFile1: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/GeneExpExample5000.txt gene id column in geneExpFile1: 1 expression value column(s) in geneExpFile1: 7 9 12 15 18 total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557 geneExpFile2: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/GeneExpExample5000.txt gene id column in geneExpFile2: 1 expression value column(s) in geneExpFile2: 8 10 11 13 16 total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999 method to identify differentially expressed genes: MARS pValue threshold: 0.001 output directory: none The outputDir is not specified! Please wait … Identifying differentially expressed genes ... Please wait patiently ... output … The results can be observed in directory: none [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/1/136 From Wang et. Al (2009)]

DEGseq
Description

This function is used to identify differentially expressed genes from RNA-seq data. It takes uniquely mapped reads from RNA-seq data for the two samples with a gene annotation as input. So users should map the reads (obtained from sequencing libraries of the samples) to the corresponding genome in advance.

Usage

DEGseq ( mapResultBatch1, mapResultBatch2, fileFormat="bed", readLength=32, strandInfo=FALSE, refFlat, groupLabel1="group1", groupLabel2="group2", method=c("LRT", "CTR", "FET", "MARS", "MATR", "FC"), pValue=1e-3, zScore=4, qValue=1e-3, foldChange=4, thresholdKind=1, outputDir="none", normalMethod=c("none", "loess", "median"), depthKind=1, replicate1="none", replicate2="none", replicateLabel1="replicate1", replicateLabel2="replicate2" )

Arguments

Example:

> kidneyR1L1 <- system.file("data", "kidneyChr21.bed.txt", package = "DEGseq") > liverR1L2 <- system.file("data", "liverChr21.bed.txt", package = "DEGseq") > refFlat <- system.file("data", "refFlatChr21.txt", package = "DEGseq") > mapResultBatch1 <- c(kidneyR1L1) > mapResultBatch2 <- c(liverR1L2) > outputDir <- file.path(tempdir, "DEGseqExample") > DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "bed",refFlat = refFlat, outputDir = outputDir, method = "LRT")

Please wait...

mapResultBatch1: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt mapResultBatch2: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/liverChr21.bed.txt file format: bed refFlat: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/refFlatChr21.txt Ignore the strand information when count the reads mapped to genes! Count the number of reads mapped to each gene ... This will take several minutes, please wait patiently!

Please wait...

SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes.

Please wait …

total 259 unique genes processed 0 reads (kidneyChr21.bed.txt) processed 10000 reads (kidneyChr21.bed.txt) processed 20000 reads (kidneyChr21.bed.txt) processed 30000 reads (kidneyChr21.bed.txt) processed 34304 reads (kidneyChr21.bed.txt) total used 0.328000 seconds!

Please wait...

SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/liverChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes.

Please wait …

total 259 unique genes processed 0 reads (liverChr21.bed.txt) processed 10000 reads (liverChr21.bed.txt) processed 20000 reads (liverChr21.bed.txt) processed 30000 reads (liverChr21.bed.txt) processed 30729 reads (liverChr21.bed.txt) total used 0.297000 seconds!

Please wait...

geneExpFile1: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample/group1.exp gene id column in geneExpFile1: 1 expression value column(s) in geneExpFile1: 2 total number of reads uniquely mapped to genome obtained from sample1: 34304 geneExpFile2: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample/group2.exp gene id column in geneExpFile2: 1 expression value column(s) in geneExpFile2: 2 total number of reads uniquely mapped to genome obtained from sample2: 30729 method to identify differentially expressed genes: LRT pValue threshold: 0.001 output directory: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample

Please wait …

Identifying differentially expressed genes ... Please wait patiently ... output ...

Done ... The results can be observed in directory: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGs

getGeneExp
Description

This function is used to count the number of reads and calculate the RPKM for each gene. It takes uniquely mapped reads from RNA-seq data for a sample with a gene annotation file as input. So users should map the reads (obtained from sequencing library of the sample) to the corresponding genome in advance.

Usage

getGeneExp( mapResultBatch, fileFormat="bed", readLength=32, strandInfo=FALSE, refFlat, output=paste(mapResultBatch[1],".exp",sep=""), min.overlapPercent= 1 )

Arguments

Example:

> kidneyR1L1 <- system.file("data", "kidneyChr21.bed.txt", package = "DEGseq") > refFlat <- system.file("data", "refFlatChr21.txt", package = "DEGseq") > mapResultBatch <- c(kidneyR1L1) > output <- file.path(tempdir, "kidneyChr21.bed.exp") > getGeneExp(mapResultBatch, refFlat = refFlat, output = output) Please wait... SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes. Please wait ... total 259 unique genes processed 0 reads (kidneyChr21.bed.txt) processed 10000 reads (kidneyChr21.bed.txt) processed 20000 reads (kidneyChr21.bed.txt) processed 30000 reads (kidneyChr21.bed.txt) processed 34304 reads (kidneyChr21.bed.txt) total used 0.328000 seconds! > exp <- readGeneExp(file = output, geneCol = 1, valCol = c(2, + 3), label = c("raw count", "RPKM")) > exp[30:32, ] raw 		count 		RPKM C21orf131 	0 		0.000 C21orf15 	0 		0.000 C21orf2 	51 		665.789

readGeneExp
Description

This method is used to read gene expression values from a file to a matrix in R workspace. So that the matrix can be used as input of other packages, such as edgeR. The input of the method is a file that contains gene expression values.

Usage

readGeneExp(file, geneCol=1, valCol=2, label = NULL, header=TRUE, sep="")

Arguments

Argument Description file file containing gene expression values. GeneCol gene id column in file. valCol expression value columns to be read in the file. label label for the columns. Header a logical value indicating whether the file contains the names of the variables as its first line. sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns.

Example:

> geneExpFile <- system.file("data", "GeneExpExample1000.txt", + package = "DEGseq") > exp <- readGeneExp(file = geneExpFile, geneCol = 1, valCol = c(7, + 9, 12, 15, 18, 8, 10, 11, 13, 16)) > exp[30:32, ]

R1L1Kidney R1L3Kidney R1L7Kidney R2L2Kidney R2L6Kidney ENSG00000188976 73 77 68 70 82 ENSG00000187961 15 15 13 12 15 ENSG00000187583 1 1 3 0 3

R1L2Liver R1L4Liver R1L6Liver R1L8Liver R2L3Liver ENSG00000188976 34 56 45 55 42 ENSG00000187961 8 13 11 12 20 ENSG00000187583 0 1 0 0 2

samWrapper
Description

This function is a wrapper of the functions in samr. It is used to identify differentially expressed genes for two sets of samples with multiple replicates or two groups of samples from different individuals (e.g. disease samples vs. control samples).

Usage

samWrapper(    geneExpFile1, geneCol1=1, expCol1=2, measure1=rep(1, length(expCol1) ), geneExpFile2, geneCol2=1, expCol2=2, measure2=rep(2, length(expCol2) ), header=TRUE, sep="", paired=FALSE, s0=NULL, s0.perc=NULL, nperms=100, testStatistic= c("standard","wilcoxon"), max.qValue=1e-3,  in.foldchange=logged2=FALSE, output )

Arguments

Argument Description geneExpFile1 file containing gene expression values for group1. geneCol1 gene id column in geneExpFile1. expCol1 expression value columns in geneExpFile1. See the example. measure1 numeric vector of outcome measurements for group1. like c(1,1,1...) when paired=FALSE, or like c(-1,-2,-3,...) when paired=TRUE. geneExpFile2 file containing gene expression values for group2. geneCol2 gene id column in geneExpFile2. ExpCol2 expression value columns in geneExpFile2. See the example. Measure2 numeric vector of outcome measurements for group2. like c(2,2,2...) when paired=FALSE, or like c(1,2,3,...) when paired=TRUE. header a logical value indicating whether geneExpFile1 and geneExpFile2 contain the names of the variables as its first line. sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns. paired a logical value indicating whether the samples are paired. s0 exchangeability factor for denominator of test statistic; Default is automatic choice. s0.perc percentile of standard deviation values to use for s0; default is automatic  choice; -1 means s0=0 (different from s0.perc=0, meaning s0=zeroeth percentile of standard deviation values= min of sd values. Nperms number of permutations used to estimate false discovery rates. TestStatistic test statistic to use in two class unpaired case. Either "standard" (t-statistic) or "wilcoxon" (Two-sample wilcoxon or Mann-Whitney test). recommend "standard". max.qValue the max qValue desired; shoube be <1. min.foldchange the minimum fold change desired; should be >1. default is zero, meaning no fold change criterion is applied. logged2 a logical value indicating whether the expression values are logged2. output the output file.

Example

> geneExpFile <- system.file("data", "GeneExpExample1000.txt", + package = "DEGseq") > set.seed(100) > geneExpFile1 <- geneExpFile > geneExpFile2 <- geneExpFile > output <- file.path(tempdir, "samWrapperOut.txt") > expCol1 = c(7, 9, 12, 15, 18) > expCol2 = c(8, 10, 11, 13, 16) > measure1 = c(-1, -2, -3, -4, -5) > measure2 = c(1, 2, 3, 4, 5) > samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 1e-04, output = output, paired = TRUE)