# T-statistic

Let $\displaystyle{ \scriptstyle\hat\beta }$ be an estimator of parameter β in some statistical model. Then a t-statistic for this parameter is any quantity of the form

$\displaystyle{ t_{\hat{\beta}} = \frac{\hat\beta - \beta_0}{\mathrm{s.e.}(\hat\beta)}, }$

where β0 is a non-random, known constant, and $\displaystyle{ \scriptstyle s.e.(\hat\beta) }$ is the standard error of the estimator $\displaystyle{ \scriptstyle\hat\beta }$.

## Two sample test assuming equal variance

The t statistic (df = $\displaystyle{ n_1 + n_2 - 2 }$) to test whether the means are different can be calculated as follows:

$\displaystyle{ t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} }$

where

$\displaystyle{ s_{X_1X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2+(n_2-1)s_{X_2}^2}{n_1+n_2-2}}. }$

$\displaystyle{ s_{X_1X_2} }$ is an estimator of the common/pooled standard deviation of the two samples. The square-root of a pooled variance estimator is known as a pooled standard deviation.

• Pooled variance from Wikipedia
• The pooled sample variance is an unbiased estimator of the common variance if Xi and Yi are following the normal distribution.
• (From minitab) The pooled standard deviation is the average spread of all data points about their group mean (not the overall mean). It is a weighted average of each group's standard deviation. The weighting gives larger groups a proportionally greater effect on the overall estimate.
• Type I error rates in two-sample t-test by simulation

## Two sample test assuming unequal variance

The t statistic (Behrens-Welch test statistic) to test whether the population means are different is calculated as:

$\displaystyle{ t = {\overline{X}_1 - \overline{X}_2 \over s_{\overline{X}_1 - \overline{X}_2}} }$

where

$\displaystyle{ s_{\overline{X}_1 - \overline{X}_2} = \sqrt{{s_1^2 \over n_1} + {s_2^2 \over n_2}}. }$

Here s2 is the unbiased estimator of the variance of the two samples.

The degrees of freedom is evaluated using the Satterthwaite's approximation

$\displaystyle{ df = { ({s_1^2 \over n_1} + {s_2^2 \over n_2})^2 \over {({s_1^2 \over n_1})^2 \over n_1-1} + {({s_2^2 \over n_2})^2 \over n_2-1} }. }$

## Z-value/Z-score

If the population parameters are known, then rather than computing the t-statistic, one can compute the z-score.

## Nonparametric test: Wilcoxon rank sum test

Sensitive to differences in location

Wilcoxon-Mann-Whitney test and a small sample size. if performing a WMW test comparing S1=(1,2) and S2=(100,300) it wouldn’t differ of comparing S1=(1,2) and S2=(4,5). Therefore when having a small sample size this is a great loss of information.

## Nonparametric test: Kolmogorov-Smirnov test

Sensitive to difference in shape and location of the distribution functions of two groups

Kolmogorov–Smirnov test. The Kolmogorov–Smirnov statistic for a given cumulative distribution function F(x) is

$\displaystyle{ D_n= \sup_x |F_n(x)-F(x)| }$

where supx is the supremum of the set of distances.

Gene set enrichment analysis The Enrichment score ES is is the maximum deviation from zero of P(hit)-P(miss). For a randomly distributed S, ES will be relatively small, but if it is concentrated at the top or bottom of the list, or otherwise nonrandomly distributed, then ES(S) will be correspondingly high. When p=0, ES reduces to the standard Kolmogorov–Smirnov statistic. c; when p=1, we are weighting the genes in S by their correlation with C normalized by the sum of the correlations over all of the genes in S.

$\displaystyle{ P_{hit}(S, i) = \sum \frac{|r_j|^p}{N_R}, P_{miss}(S, i)= \sum \frac{1}{(N-N_H)} }$ where $\displaystyle{ N_R = \sum_{g_j \in S} |r_j|^p }$.

## Limma: Empirical Bayes method

• Some Bioconductor packages: limma, RnBeads, IMA, minfi packages.
• The moderated T-statistics used in Limma is defined on Limma's user guide.
• Diagram of usage ?makeContrasts, ?contrasts.fit, ?eBayes
           lmFit        contrasts.fit           eBayes       topTable
x ------> fit ------------------> fit2  -----> fit2  --------->
^                       ^
|                       |
model.matrix   |    makeContrasts      |
class ---------> design ----------> contrasts
• Moderated t-test mod.t.test() from MKmisc package
• Examples of contrasts (search contrasts.fit and/or model.matrix from the user guide)
# Ex 1 (Single channel design):
design <- model.matrix(~ 0+factor(c(1,1,1,2,2,3,3,3))) # number of samples x number of groups
colnames(design) <- c("group1", "group2", "group3")
fit <- lmFit(eset, design)
contrast.matrix <- makeContrasts(group2-group1, group3-group2, group3-group1,
levels=design)        # number of groups x number of contrasts
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
topTable(fit2, coef=1, sort = "none", n = Inf, adjust="BH")$adj.P.Val # Ex 2 (Common reference design): targets <- readTargets("runxtargets.txt") design <- modelMatrix(targets, ref="EGFP") contrast.matrix <- makeContrasts(AML1,CBFb,AML1.CBFb,AML1.CBFb-AML1,AML1.CBFb-CBFb, levels=design) fit <- lmFit(MA, design) fit2 <- contrasts.fit(fit, contrasts.matrix) fit2 <- eBayes(fit2) # Ex 3 (Direct two-color design): design <- modelMatrix(targets, ref="CD4") contrast.matrix <- cbind("CD8-CD4"=c(1,0),"DN-CD4"=c(0,1),"CD8-DN"=c(1,-1)) rownames(contrast.matrix) <- colnames(design) fit <- lmFit(eset, design) fit2 <- contrasts.fit(fit, contrast.matrix) # Ex 4 (Single channel + Two groups): fit <- lmFit(eset, design) cont.matrix <- makeContrasts(MUvsWT=MU-WT, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) # Ex 5 (Single channel + Several groups): f <- factor(targets$Target, levels=c("RNA1","RNA2","RNA3"))
design <- model.matrix(~0+f)
colnames(design) <- c("RNA1","RNA2","RNA3")
fit <- lmFit(eset, design)
contrast.matrix <- makeContrasts(RNA2-RNA1, RNA3-RNA2, RNA3-RNA1,
levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

# Ex 6 (Single channel + Interaction models 2x2 Factorial Designs) :
cont.matrix <- makeContrasts(
SvsUinWT=WT.S-WT.U,
SvsUinMu=Mu.S-Mu.U,
Diff=(Mu.S-Mu.U)-(WT.S-WT.U),
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)

• Example from user guide 17.3 (Mammary progenitor cell populations)
setwd("~/Downloads/IlluminaCaseStudy")
url <- c("http://bioinf.wehi.edu.au/marray/IlluminaCaseStudy/probe%20profile.txt.gz",
"http://bioinf.wehi.edu.au/marray/IlluminaCaseStudy/control%20probe%20profile.txt.gz",
"http://bioinf.wehi.edu.au/marray/IlluminaCaseStudy/Targets.txt")
for(i in url)  system(paste("wget ", i))
system("gunzip probe%20profile.txt.gz")
system("gunzip control%20probe%20profile.txt.gz")

source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
biocLite("statmod")
library(limma)
targets

x <- read.ilmn(files="probe profile.txt",ctrlfiles="control probe profile.txt",
other.columns="Detection")
options(digits=3)
head(x$E) boxplot(log2(x$E),range=0,ylab="log2 intensity")
y <- neqc(x)
dim(y)
expressed <- rowSums(y$other$Detection < 0.05) >= 3
y <- y[expressed,]
dim(y) # 24691 12
plotMDS(y,labels=targets$CellType) ct <- factor(targets$CellType)
design <- model.matrix(~0+ct)
colnames(design) <- levels(ct)
dupcor <- duplicateCorrelation(y,design,block=targets$Donor) # need statmod dupcor$consensus.correlation

fit <- lmFit(y, design, block=targets$Donor, correlation=dupcor$consensus.correlation)
contrasts <- makeContrasts(ML-MS, LP-MS, ML-LP, levels=design)
fit2 <- contrasts.fit(fit, contrasts)
fit2 <- eBayes(fit2, trend=TRUE)
summary(decideTests(fit2, method="global"))
topTable(fit2, coef=1) # Top ten differentially expressed probes between ML and MS
#                 SYMBOL TargetID logFC AveExpr     t  P.Value adj.P.Val    B
# ILMN_1766707     IL17B     <NA> -4.19    5.94 -29.0 2.51e-12  5.19e-08 18.1
# ILMN_1706051      PLD5     <NA> -4.00    5.67 -27.8 4.20e-12  5.19e-08 17.7
# ...
tT <- topTable(fit2, coef=1, number = Inf)
dim(tT)
# [1] 24691     8

• Three groups comparison (What is the difference of A vs Other AND A vs (B+C)/2?). Contrasts comparing one factor to multiple others
library(limma)
set.seed(1234)
n <- 100
testexpr <- matrix(rnorm(n * 10, 5, 1), nc= 10)
testexpr[, 6:7] <- testexpr[, 6:7] + 7  # mean is 12

design1 <- model.matrix(~ 0 + as.factor(c(rep(1,5),2,2,3,3,3)))
design2 <- matrix(c(rep(1,5),rep(0,5),rep(0,5),rep(1,5)),ncol=2)
colnames(design1) <- LETTERS[1:3]
colnames(design2) <- c("A", "Other")

fit1 <- lmFit(testexpr,design1)
contrasts.matrix1 <- makeContrasts("AvsOther"=A-(B+C)/2, levels = design1)
fit1 <- eBayes(contrasts.fit(fit1,contrasts=contrasts.matrix1))

fit2 <- lmFit(testexpr,design2)
contrasts.matrix2 <- makeContrasts("AvsOther"=A-Other, levels = design2)
fit2 <- eBayes(contrasts.fit(fit2,contrasts=contrasts.matrix2))

t1 <- topTable(fit1,coef=1, number = Inf)
t2 <- topTable(fit2,coef=1, number = Inf)

#        logFC  AveExpr         t      P.Value    adj.P.Val         B
# 92 -5.293932 5.810926 -8.200138 1.147084e-15 1.147084e-13 26.335702
# 81 -5.045682 5.949507 -7.815607 2.009706e-14 1.004853e-12 23.334600
# 37 -4.720906 6.182821 -7.312539 7.186627e-13 2.395542e-11 19.625964
# 27 -2.127055 6.854324 -3.294744 1.034742e-03 1.055859e-03 -1.141991
# 86 -1.938148 7.153142 -3.002133 2.776390e-03 2.804434e-03 -2.039869
# 75 -1.876490 6.516004 -2.906626 3.768951e-03 3.768951e-03 -2.314869
#        logFC  AveExpr          t    P.Value adj.P.Val         B
# 92 -4.518551 5.810926 -2.5022436 0.01253944 0.2367295 -4.587080
# 81 -4.500503 5.949507 -2.4922492 0.01289503 0.2367295 -4.587156
# 37 -4.111158 6.182821 -2.2766414 0.02307100 0.2367295 -4.588728
# 27 -1.496546 6.854324 -0.8287440 0.40749644 0.4158127 -4.595601
# 86 -1.341607 7.153142 -0.7429435 0.45773401 0.4623576 -4.595807
# 75 -1.171366 6.516004 -0.6486690 0.51673851 0.5167385 -4.596008

var(as.numeric(testexpr[, 6:10]))
# [1] 12.38074
var(as.numeric(testexpr[, 6:7]))
# [1] 0.8501378
var(as.numeric(testexpr[, 8:10]))
# [1] 0.9640699

As we can see the p-values returned from the first contrast are very small (large mean but small variance) but the p-values returned from the 2nd contrast are large (still large mean but very large variance). The variance from the "Other" group can be calculated from a mixture distribution ( pdf = .4 N(12, 1) + .6 N(5, 1), VarY = E(Y^2) - (EY)^2 where E(Y^2) = .4 (VarX1 + (EX1)^2) + .6 (VarX2 + (EX2)^2) = 73.6 and EY = .4 * 12 + .6 * 5 = 7.8; so VarY = 73.6 - 7.8^2 = 12.76).
• Correct assumptions of using limma moderated t-test and the paper Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies.
• Evaluation: statistical power (figure 3, 4, 5), false-positive rate (table 2), execution time and ease of use (table 3)
• RVM inflates the expected number of false-positives when sample size is small. On the other hand the, RVM is very close to Limma from either their formulas (p3 of the supporting info) or the Hierarchical clustering (figure 2) of two examples.
• Slides
• Use Limma to run ordinary T tests
# where 'fit' is the output from lmFit() or contrasts.fit().
unmod.t <- fit$coefficients/fit$stdev.unscaled/fit$sigma pval <- 2*pt(-abs(unmod.t), fit$df.residual)

# Following the above example
t.test(testexpr[1, 1:5], testexpr[1, 6:10], var.equal = T)
# Two Sample t-test
#
# data:  testexpr[1, 1:5] and testexpr[1, 6:10]
# t = -1.2404, df = 8, p-value = 0.25
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#   -7.987791  2.400082
# sample estimates:
#   mean of x mean of y
# 4.577183  7.371037
fit2$coefficients[1] / (fit2$stdev.unscaled[1] * fit2$sigma[1]) # Ordinary t-statistic # [1] -1.240416 fit2$coefficients[1] / (fit2$stdev.unscaled[1] * sqrt(fit2$s2.post[1])) # moderated t-statistic
# [1] -1.547156
topTable(fit2,coef=1, sort.by = "none")[1,]
#       logFC  AveExpr         t   P.Value adj.P.Val         B
# 1 -2.793855 5.974110 -1.547156 0.1222210 0.2367295 -4.592992

# Square root of the pooled variance
fit2$sigma[1] # [1] 3.561284 (((5-1)*var(testexpr[1, 1:5]) + (5-1)*var(testexpr[1, 6:10]))/(5+5-2)) %>% sqrt() # [1] 3.561284  • Comparison of ordinary T-statistic, RVM T-statistic and Limma/eBayes moderated T-statistic. Test statistic for gene g Ordinary T-test $\displaystyle{ \frac{\overline{y}_{g1} - \overline{y}_{g2}}{S_g^{Pooled}/\sqrt{1/n_1 + 1/n_2}} }$ $\displaystyle{ (S_g^{Pooled})^2 = \frac{(n_1-1)S_{g1}^2 + (n_2-1)S_{g2}^2}{n1+n2-2} }$ RVM $\displaystyle{ \frac{\overline{y}_{g1} - \overline{y}_{g2}}{S_g^{RVM}/\sqrt{1/n_1 + 1/n_2}} }$ $\displaystyle{ (S_g^{RVM})^2 = \frac{(n_1+n_2-2)S_{g}^2 + 2*a*(a*b)^{-1}}{n1+n2-2+2*a} }$ Limma $\displaystyle{ \frac{\overline{y}_{g1} - \overline{y}_{g2}}{S_g^{Limma}/\sqrt{1/n_1 + 1/n_2}} }$ $\displaystyle{ (S_g^{Limma})^2 = \frac{d_0 S_0^2 + d_g S_g^2}{d_0 + d_g} }$ • In Limma, • $\displaystyle{ \sigma_g^2 }$ assumes an inverse Chi-square distribution with mean $\displaystyle{ S_0^2 }$ and $\displaystyle{ d_0 }$ degrees of freedom • $\displaystyle{ d_0 }$ (fit$df.prior) and $\displaystyle{ d_g }$ are, respectively, prior and residual/empirical degrees of freedom.
• $\displaystyle{ S_0^2 }$ (fit$s2.prior) is the prior distribution and $\displaystyle{ S_g^2 }$ is the pooled variance. • $\displaystyle{ (S_g^{Limma})^2 }$ can be obtained from fit$s2.post.
• Empirical Bayes estimation of normal means, accounting for uncertainty in estimated standard errors Lu 2019

# ANOVA

## Post-hoc test

Determine which levels have significantly different means.

## TukeyHSD (Honestly Significant Difference), diagnostic checking

https://datascienceplus.com/one-way-anova-in-r/, Tukey HSD for Post-Hoc Analysis (detailed explanation including the type 1 error problem in multiple testings)

• TukeyHSD for the pairwise tests
• You can’t just perform a series of t tests, because that would greatly increase your likelihood of a Type I error.
• compute something analogous to a t score for each pair of means, but you don’t compare it to the Student’s t distribution. Instead, you use a new distribution called the studentized range (from Wikipedia) or q distribution.
• Suppose that we take a sample of size n from each of k populations with the same normal distribution N(μσ) and suppose that $\displaystyle{ \bar{y} }$min is the smallest of these sample means and $\displaystyle{ \bar{y} }$max is the largest of these sample means, and suppose S2 is the pooled sample variance from these samples. Then the following random variable has a Studentized range distribution: $\displaystyle{ q = \frac{\overline{y}_{\max} - \overline{y}_{\min}}{S/\sqrt{n}} }$
• One-Way ANOVA Test in R from sthda.com.
res.aov <- aov(weight ~ group, data = PlantGrowth)
summary(res.aov)
#              Df Sum Sq Mean Sq F value Pr(>F)
#  group        2  3.766  1.8832   4.846 0.0159 *
#  Residuals   27 10.492  0.3886
TukeyHSD(res.aov)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = weight ~ group, data = PlantGrowth)
#
# $group # diff lwr upr p adj # trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711 # trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960 # trt2-trt1 0.865 0.1737839 1.5562161 0.0120064 # Extra: # Check your data my_data <- PlantGrowth levels(my_data$group)
set.seed(1234)
dplyr::sample_n(my_data, 10)

# compute the summary statistics by group
library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)

• Or we can use Benjamini-Hochberg method for p-value adjustment in pairwise comparisons
library(multcomp)
pairwise.t.test(my_data$weight, my_data$group,
#      ctrl  trt1
# trt1 0.194 -
# trt2 0.132 0.013
#
# P value adjustment method: BH

• Shapiro-Wilk test for normality
# Extract the residuals
aov_residuals <- residuals(object = res.aov )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

• Bartlett test and Levene test for the homogeneity of variances across the groups