Probe specific quantile normalization to adjust batch differences in MUGA and MegaMUGA genotyping data

Daniel M. Gatti Dec. 3, 2013

When measuring allele intensities on the MUGA family arrays, we often see intensity differences that vary by batch. These differenes add noise to the genotyping algorithm and should be removed. Note that we do not expect the intensity distribution of each sample to be identical and that any normalization will have to be performed on each SNP independently. Global quantile normalization of the genotyping array intensities fails because it destroys the allele clusters at each SNP. Other global methods such as mean or median subtraction do not account for probe specific differences in th eX and Y direction. Here, I use one SNP to demonstrate the method.

Load in sample data for Chr 1.

load("d:/182_DO_QTL_Mapping/AllData/allDO.Rdata")
load(url("ftp://ftp.jax.org/MUGA/muga_snps.Rdata"))
load(url("ftp://ftp.jax.org/MUGA/muga_x.Rdata"))
load(url("ftp://ftp.jax.org/MUGA/muga_y.Rdata"))
keep = which(muga_snps[, 2] == 1)
muga_snps = muga_snps[keep, ]
x = x[, keep]
y = y[, keep]
x2 = t(muga_x[keep, ])
y2 = t(muga_y[keep, ])

The problem of batch differences can be seen in the following plot in which one batch is in black and the other in red.

s = 22
par(las = 1)
plot(x[, s], y[, s], pch = 16, xlab = "X intensity", ylab = "Y intensity", main = "Before Normalization")
points(x2[, s], y2[, s], pch = 16, col = 2)

plot of chunk unnamed-chunk-2

It seems that the X intensities in the red batch are shifted right and the Y intensities are compressed.

A simple approach would be to adjust each sample by the mean batch intensity in the X and Y direction at each SNP.

x1med = apply(x, 2, mean, na.rm = T)
y1med = apply(y, 2, mean, na.rm = T)
x2med = apply(x2, 2, mean, na.rm = T)
y2med = apply(y2, 2, mean, na.rm = T)
x1adj = sweep(x, 2, x1med, "-")
y1adj = sweep(y, 2, y1med, "-")
x2adj = sweep(x2, 2, x2med, "-")
y2adj = sweep(y2, 2, y2med, "-")
s = 22
par(las = 1)
plot(x1adj[, s], y1adj[, s], pch = 16, xlab = "X intensity", ylab = "Y intensity", 
    main = "After Mean subtraction")
points(x2adj[, s], y2adj[, s], pch = 16, col = 2)

plot of chunk unnamed-chunk-4

This does not work well because the mean value in each direction is not the same for each batch. Further, the mean may be vulnerable to outliers, which will pull the intensities to one side. In order to avoid this, we can subtract the median intensity for each batch from the X and Y directions.

x1med = apply(x, 2, median, na.rm = T)
y1med = apply(y, 2, median, na.rm = T)
x2med = apply(x2, 2, median, na.rm = T)
y2med = apply(y2, 2, median, na.rm = T)
x1adj = sweep(x, 2, x1med, "-")
y1adj = sweep(y, 2, y1med, "-")
x2adj = sweep(x2, 2, x2med, "-")
y2adj = sweep(y2, 2, y2med, "-")
s = 22
par(las = 1)
plot(x1adj[, s], y1adj[, s], pch = 16, xlab = "X intensity", ylab = "Y intensity", 
    main = "After Median Subtraction")
points(x2adj[, s], y2adj[, s], pch = 16, col = 2)

plot of chunk unnamed-chunk-6

Median subtraction does not work in this example because the median Y intensity for batch 2 (in red) is lower than that of batch 1 (black).

The failure of these simple approaches led me to investigate quantile normalization between batches at each SNP. This was motivated by Staaf et.al., BMC Bioinformatics, 2008. Quantile normalization assumes that the distrubution of X (or Y) intensities in each batch will be similar. To apply it, at each probe, I create intensity quantiles for the batch with the larger number of samples. Then I create quantiles for the second batch and adjust the second batch's intensities to match the quantiles of the first. I will demonstrate using the SNP plotted above.

First, remove the top 0.5% of the X and Y intensities from batch 1 to reduce the effect of outliers. This is an arbitrary threshold. We also remove points for which X or Y is NA.

keepx = quantile(x[, s], 0.995)
keepy = quantile(y[, s], 0.995)
keep = which(x[, s] <= keepx & y[, s] <= keepy)
keep = intersect(keep, which(!is.na(x[, s])))
keep = intersect(keep, which(!is.na(y[, s])))
x1keep = x[keep, s]
y1keep = y[keep, s]
x1sort = sort(x1keep)
x1quant = cbind((x1sort - min(x1sort, na.rm = T)), x1sort)
x1quant[, 1] = x1quant[, 1]/max(x1quant[, 1], na.rm = T)
colnames(x1quant) = c("quantiles", "intensities")
y1sort = sort(y1keep)
y1quant = cbind((y1sort - min(y1sort, na.rm = T)), y1sort)
y1quant[, 1] = y1quant[, 1]/max(y1quant[, 1], na.rm = T)
colnames(y1quant) = colnames(x1quant)

x1quant and y1quant contais quantiles in column 1 and intensity values in column 2.

head(x1quant)
##      quantiles intensities
## [1,]  0.000000    0.000000
## [2,]  0.000000    0.000000
## [3,]  0.007039    0.006097
## [4,]  0.007514    0.006508
## [5,]  0.009259    0.008019
## [6,]  0.010153    0.008794

Next, create quantiles for batch 2. We may need to add some handling of outliers here. We cannot simply discard them.

x2quant = cbind(x2[, s] - min(x2[, s], na.rm = T), x2[, s], rep(0, nrow(x2)))
x2quant[, 1] = x2quant[, 1]/max(x2quant[, 1], na.rm = T)
y2quant = cbind(y2[, s] - min(y2[, s], na.rm = T), y2[, s], rep(0, nrow(y2)))
y2quant[, 1] = y2quant[, 1]/max(y2quant[, 1], na.rm = T)
colnames(x2quant) = c("quantiles", "intensities", "normalized")
colnames(y2quant) = colnames(x2quant)

x2quant and y2quant contain quantiles, intensities and normalized values in columnes 1 through 3. Column 3 contains zeros at this stage.

head(x2quant)
##             quantiles intensities normalized
## Cast           0.1773       0.248          0
## Pwk            0.5870       0.627          0
## 129S1/SvImJ    0.8346       0.856          0
## A/J            0.8724       0.891          0
## C57BL/6J       0.0627       0.142          0
## CAST/EiJ       0.2227       0.290          0

For each X (or Y) intensity value in batch 2, we will find the quantiles in batch 1 that bracket the quantile in batch 2. We will interpolate between them and calulate an adjusted intensity from the batch 1 intensities between the two bracketing quantiles.

for (i in 1:nrow(x2quant)) {
    lower = max(which(x1quant[, 1] <= x2quant[i, 1]))
    upper = lower + 1
    if (upper > nrow(x1quant)) {
        x2quant[i, 3] = x1quant[lower, 2]
    } else {
        x2quant[i, 3] = x1quant[lower, 2] + (x1quant[upper, 2] - x1quant[lower, 
            2]) * (x2quant[i, 1] - x1quant[lower, 1])/(x1quant[upper, 1] - x1quant[lower, 
            1])
    }  # else

    lower = max(which(y1quant[, 1] <= y2quant[i, 1]))
    upper = lower + 1
    if (upper > nrow(y1quant)) {
        y2quant[i, 3] = y1quant[lower, 2]
    } else {
        y2quant[i, 3] = y1quant[lower, 2] + (y1quant[upper, 2] - y1quant[lower, 
            2]) * (y2quant[i, 1] - y1quant[lower, 1])/(y1quant[upper, 1] - y1quant[lower, 
            1])
    }  # else
}  # for(i)

Quantile normalization aligns the two batches.

s = 22
par(las = 1)
plot(x[, s], y[, s], pch = 16, xlab = "X intensity", ylab = "Y intensity", main = "After Quantile Normalization")
points(x2quant[, 3], y2quant[, 3], pch = 16, col = 2)

plot of chunk unnamed-chunk-12