## Large-scale_2003.dvi

**Large-Scale Simultaneous Hypothesis Testing:**
**The Choice of a Null Hypothesis**
**Bradley Efron**
**Abstract**
Current scientiﬁc techniques in genomics and image processing routinely produce hy-
pothesis testing problems with hundreds or thousands of cases to consider simultaneously.

This poses new diﬃculties for the statistician, but also opens new opportunities. In particu-
lar it allows empirical estimation of an appropriate null hypothesis. The empirical null may
be considerably more dispersed than the usual theoretical null distribution that would be
used for any one case considered separately. An empirical Bayes analysis plan for this situ-
ation is developed, using a local version of the false discovery rate to examine the inference
issues. Two genomics problems are used as examples to show the importance of correctly

**Key Words**: local false discovery rate, empirical Bayes, microarray analysis, empirical null

**1. Introduction**
Until recently “simultaneous inference” meant considering two or ﬁve or perhaps ten
hypothesis tests at the same time, as in Miller’s classic 1981 text. Rapid progress in tech-
nology, particularly in genomics and imaging, has vastly upped the ante for simultaneous
inference problems: now 500 or 5000 or even 50,000 tests may need to be evaluated at once,
raising new problems for the statistician, but also opening new analytic opportunities. This
paper concerns the choice of an appropriate null hypothesis in large-scale testing situations,
and how this choice aﬀects well-known inference methods such as the false discovery rate.

Simultaneous hypothesis testing begins with a collection of null hypotheses

*H*1

*, H*2

*, . . . , HN ,*
corresponding test statistics, possibly not independent,

*Y*1

*, Y*2

*, . . . , YN ,*
and their

*p*-values,

*P*1

*, P*2

*, . . . , PN *, with

*Pi *measuring how strongly

*yi*, the observed value of

*Yi*, contradicts

*Hi*, for instance

*Pi *= prob

*{|Y*
*i| > |yi|}*. “Large-scale” means that

*N *is a
big number, say at least

*N > *100.

It is convenient though not necessary to work with

*z*-

*values *instead of the

*Yi*’s or

*Pi*’s,

*zi *= Φ

*−*1(

*Pi*)

*,*
*i *= 1

*, *2

*, . . . , N ,*
Φ indicating the standard normal cumulative distribution function (cdf), Φ

*−*1(

*.*95) = 1

*.*645
etc. If

*Hi *is exactly true then

*zi *will have a standard normal distribution
We will call (1.4)

*the theoretical null hypothesis*.

Our motivating example concerns an HIV study of 1391 patients, investigating which of
6 Protease Inhibitor (“PI”) drugs cause mutations at which of 74 sites on the viral genome.

Each patient provided a vector of predictors

**x **= (

*x*1

*, x*2

*, . . . , x*6)

*,*
*xj *= 1 or 0 indicating whether or not the patient used

*P Ij*, 1

*≤*
**v **= (

*v*1

*, v*2

*, . . . , v*74)

*,*
*vk *= 1 or 0 indicating whether or not a mutation occurred at site

*k*. Remark A of Section 7
describes the study in a little more detail.

For each of the 74 genomic sites, a separate logistic regression analysis was run using
all 1391 cases, with that site’s mutation indicators as responses and the PI indicators as
predictors. Together these yielded 444 = 6

*×*74

*z*-values, one for testing each null hypothesis,that drug

*j *does not cause mutations at site

*k*,

*j *= 1

*, *2

*, . . . *6 and

*k *= 1

*, *2

*, . . . , *74. The

*z*-
values were based on the usual approximation

*i *= 1

*, *2

*, . . . , *444

*,*
(using a single subscript

*i *in place of (

*j, k*)) where

*yi *is the maximum likelihood estimate
(MLE) of the logistic regression coeﬃcient and

*sei *its approximate large-sample standard
Figure 1 shows a histogram of the 444

*z*-values, with negative

*zi*’s indicating greater
mutational eﬀects. The smooth curve

*f *(

*z*) is a natural spline with seven degrees of freedom,
ﬁt to the histogram counts by Poisson regression. It emphasizes the

*central peak *near

*z *= 0,
presumably the large majority of uninteresting drug-site combinations that have negligible
mutation eﬀects. Near its center the peak is well-described by a normal density having mean
-0.35 and standard deviation 1.20, which we will call the

*empirical null hypothesis*,

*zi|Hi ∼ N*(

*−*0

*.*35

*, *1

*.*202)

*.*
Section 3 describes the estimation methodology for (1.8), with a brief discussion of the
normality assumption in Remark D of Section 7.

The diﬀerence between the theoretical null

*N *(0

*, *1) and empirical null

*N *(

*−*0

*.*35

*, *1

*.*202)
may not seem worrisome here but we will see that it substantially aﬀects any simultaneous
inference procedure. A more dramatic example is given in Section 6, for a microarray analysis
where going from the theoretical to empirical null totally negates any ﬁndings of signiﬁcance.

Situations going in the reverse direction also occur.

In classic situations involving only a single hypothesis test we must, of necessity, employ
the theoretical null hypothesis

*z ∼ N*(0

*, *1). The main point of this paper is that large-scaletesting situations permit empirical estimation of the null distribution. Sections 3 through
5 concern reasons why the empirical and theoretical null might diﬀer, and which might be
preferable in diﬀerent situations.

There are scientiﬁc as well as statistical diﬀerences between small-scale and large-scale
z-values (<-- larger mutation effects)

**Figure 1**:

*Histogram of 444 z-values from the Drug-mutation analysis; smooth curve f *(

*z*)

*is natural spline ﬁt to histogram counts. The central peak near z *= 0

*is approximately*
*N *(

*−*0

*.*35

*, *1

*.*202)

*: the “empirical null hypothesis”. Simultaneous hypothesis tests for the 444cases depend critically on the choice between the empirical or theoretical N *(0

*, *1)

*null.*
hypothesis testing situations. A single hypothesis test is most often run with the expectation
and hope of rejecting the null, “with 80% power” in a typical clinical trial. Nobody wants to
reject 80% of

*N *= 5000 null hypotheses. The usual point of large-scale testing is to identify
a small percentage of interesting cases that deserve further investigation. While not exactly
looking for a needle in a haystack, we don’t want the whole haystack either. An important
assumption of what follows is that the proportion of interesting cases is small, perhaps 1%,
or 5% of

*N *, but not more than 10%. This is made explicit in Section 2, in our description of
the local false discovery rate as an analytic tool for large-scale testing. There are situations
where the 10% limit is irrelevant, for example, in constructing prediction models, but these
The terminology “Interesting/Uninteresting” used in this paper in preference to “Sig-
niﬁcant/Nonsigniﬁcant” is discussed near the end of Section 5. We conclude in Sections 7
and 8 with remarks, including most of the technical details, and a summary.

**2. The Local False Discovery Rate **It is convenient to discuss large-scale testing prob-

lems in terms of the local false discovery rate (fdr), an empirical Bayes version of Benjamini
and Hochberg’s (1995) methodology focusing on densities rather than tail areas; see Efron
et al. (2001) and Efron and Tibshirani (2002).

We begin with a simple Bayes model. Suppose that the

*N z*-values fall into two classes,
“Uninteresting” or “Interesting”, corresponding to whether or not

*zi *is generated according
to the null hypothesis, with prior probabilities

*p*0 and

*p*1 = 1

*− p*0, for the classes; and that

*zi *has density either

*f*0(

*z*) or

*f*1(

*z*) depending on its class,

*p*0 = Prob

*{*Uninteresting

*},*
*f*0(

*z*) density if Uninteresting (Null)

*p*1 = Prob

*{*Interesting

*},*
*f*1(

*z*) density if Interesting (Non-Null)

*.*
The smooth curve in Figure 1 estimates the

*mixture density f *(

*z*),

*f *(

*z*) =

*p*0

*f*0(

*z*) +

*p*1

*f*1(

*z*)

*.*
According to Bayes theorem the

*a posteriori *probability of being in the Uninteresting class
Prob

*{*Uninteresting

*|z} *=

*p*0

*f*0(

*z*)

*/f *(

*z*)

*.*
Here we deﬁne the

*local false discovery rate *to be
fdr(

*z*)

*≡ f*0(

*z*)

*/f *(

*z*)

*,*
ignoring the factor

*p*0 in (2.3), so fdr(

*z*) is an upper bound on Prob

*{*Uninteresting

*|z}*. Infact

*p*0 can be roughly estimated, see Remark B, but we are assuming that

*p*0 is near 1, say

*p*0

*≥ *0

*.*90, so fdr(

*z*) is not a ﬂagrant overestimator.

The local fdr provides a useful methodology for identifying Interesting cases in a situ-
ation like that of Figure 1: (1) estimate

*f *(

*z*) from the observed ensemble of

*z*-values, for
example by the natural spline ﬁt to the histogram counts; (2) assign a null density

*f*0(

*z*); (3)
calculate fdr(

*z*) =

*f*0(

*z*)

*/f *(

*z*); (4) report as Interesting those cases with fdr(

*zi*) less than some
threshold value, perhaps fdr(

*zi*)

*≤ *0

*.*10. Remark B discusses the close connection betweenthis algorithm and Benjamini and Hochberg’s (1995) method.

This paper concerns the choice of

*f*0(

*z*), the null hypothesis density. In the drug-mutation
example it is crucial whether

*f*0 is taken to be the theoretical or empirical null,

*N *(0

*, *1) or

*N *(

*−*0

*.*35

*, *1

*.*202). This is illustrated in Figure 2, a close-up view of Figure 1 focusing onthe bin containing

*z *=

*−*3. The expected number of the 444

*zi *values falling into this
bin is 6.37 for

*f *(

*z*), and either 0.62 or 3.90 as

*f*0(

*z*) is

*N *(0

*, *1) or

*N *(

*−*0

*.*35

*, *1

*.*202). Thusfdr(

*z*) =

*f*0(

*z*)

*/f *(

*z*) at

*z *=

*−*3 is estimated to be either

*.*097 using theoretical null

*N*(0

*, *1)

*.*612 using empirical null

*N*(

*−*0

*.*35

*,*1

*.*202)

*.*
In this bin, changing from the theoretical to empirical null changes our inferences from
Interesting to deﬁnitely Uninteresting.

**Figure 2**:

*Close-up view of the bin containing z *=

*−*3

*in Figure 1. Expected number in*

bin: 6.37 for f (

*z*)

*, 0.62 for f*0 =

*N *(0

*, *1)

*, 3.90 for f*0 =

*N *(0

*.*35

*, *1

*.*202)

*, the empirical null.*
*Corresponding estimates of fdr*(

*−*3) : 0

*.*097

*for N*(0

*, *1)

*versus 0.612 for N*(

*−*0

*.*35

*, *1

*.*202)

*.*

Should we report the cases in this bin as Interesting?.
Figure 3 compares the two estimates of log fdr(

*z*) over most of the

*z *scale. 18 of the 444

*z*-values have fdr(

*z*)

*< *0

*.*10 for

*f*0 =

*N *(0

*, *1) but

*> *0

*.*10 for

*f*0 =

*N *(

*−*0

*.*35

*, *1

*.*202), with 17 ofthese at the left end of the scale. All told the empirical null yields only two-thirds as many
cases with fdr

*< *0

*.*10 as the theoretical null, 35 compared to 53.

**3. Estimating the Empirical Null Distribution **Our estimate of the empirical null

distribution for the Drug-mutation data was obtained in two steps: the curve

*f *(

*z*) shown

**Figure 3**:

*Comparison of estimates of *log

*fdr*(

*z*)

*for the Drug-Mutation data; empirical null*
*estimate (solid curve) declines more slowly than theoretical null estimate (dotted). Dashes*
*indicate the 444 z-values. 17 cases on left have fdr*(

*z*)

*< *1

*/*10

*for theoretical but > *1

*/*10

*for*
in Figure 1 was ﬁt to the histogram counts by Poisson regression, and then the center and
half-width of the central peak, say

*δ*0 and

*σ*0, were obtained from

*f *(

*z*),

*δ*0 = arg max

*{f *(

*z*)

*} *and

*σ*0 =

*− d*2 log

*f *(

*z*)
yielding (

*δ*0

*, σ*0) = (

*−*0

*.*35

*, *1

*.*20). Details are given in Remark D, where the possibility of anon-normal empirical null distribution is brieﬂy discussed.

More direct estimation methods for

*f*0 seem possible, for example estimating

*δ*0 by the
median of the

*z*-values. Suppose though that 10% of the

*z*-values came from the non-null
distribution and all of these were located at the far left end of Figure 1. Then the median of
all the

*z*’s would be the 4

*/*9 quartile of the actual null distribution, not its median, yielding
a badly biased estimate of

*δ*0. Similar comments apply to estimating

*σ*0, Remark D. Method
(3.1) does not require preliminary estimates of the proportion

*p*0 in the null population of
(2.1), a considerable practical advantage.

How accurate are the estimates (

*−*0

*.*35

*, *1

*.*20)? The usual standard error approximations
for a Poisson regression ﬁt are not appropriate here since the

*zi*’s are not independent of
each other. A nonparametric bootstrap analysis was performed instead, with the 1391 80-
dimensional vectors (

**x***, ***v**), (1.5-1.6) as the resampling units. This gave .09 and .08 for the

bootstrap standard errors of

*δ*0 and

*σ*0 respectively, i.e.

(

*δ*0

*, σ*0) = (

*−*0

*.*35

*, *1

*.*20)

*± *(

*.*09

*, .*08)

*.*
It seems quite unlikely that estimation error alone accounts for the diﬀerence between the
empirical null and the theoretical values (

*δ*0

*, σ*0) = (0

*, *1). (Notice that this type of bootstrap
analysis, which requires independent sampling units, is not applicable to the microarray
example of Section 6, where we expect correlations among the genes.)
The next two sections concern other possible causes for empirical/theoretical diﬀerences,
diagnostics for these causes, and their interpretations. Our list is not exhaustive and in fact
the microarray example of Section 6 demonstrates another form of pathology.

**4. Permutation Tests and Unobserved Covariates **The theoretical

*N *(0

*, *1) null hy-

pothesis (1.4) is usually based on asymptotic approximations like those for the logistic re-
gression coeﬃcients in the Drug-mutation study. Permutation methods can be used to avoid
these approximations, perhaps in the hope that an improved theoretical null will more closely
This was not the case for the Drug-mutation data. Permutation testing was implemented
by randomly pairing the 1391 predictor vectors

**x**, (1.5), with the 1391 response vectors

**v**,

(1.6), and recalculating the 444

*z*-values. This whole process was independently repeated 20
times, yielding a total of 20

*×*444 permutation

*z*’s. Their distribution was well approximatedby a

*N *(0

*, .*9652) density (the “permutation null”) except for a prominent spike near

*z *= 0

*.*3.

In this case the permutation-improved theoretical null diﬀers more rather than less from the
empirical null

*N *(

*−*0

*.*35

*, *1

*.*202).

Permutation methods are popular in the microarray literature as a way of avoiding
assumptions and approximations, see Efron et al. (2001) or Dudoit et al. (2003),

*but they*
*do not automatically resolve the question of an appropriate null hypothesis*. This can be
seen in the following hypothetical example, which is a stylized version of the two-sample
microarray testing problem in Section 6: the data

*xij *comes from

*N *simultaneous two-
sample experiments, each comparing 2

*n *subjects,
(

*i *= 1

*, . . . , N *)

*, .*

*Treatments j *=

*n *+ 1

*, n *+ 2

*, . . . , *2

*n*
The

*i*th test statistic

*Yi *is the usual two-sample

*t*-statistic, comparing Treatments versus
Controls for the

*i*th experiment.

Suppose that, unknown to the statistician, the data was actually generated from
2

*i *

*βi ∼ N*(0

*, σ*2)

*,*
with the

*uij *and

*βi *mutually independent and

*−*1

*j *= 1

*, *2

*, . . . n*
*j *=

*n *+ 1

*, . . . *2

*n .*
Then it is easy to show that the statistics

*Yi *follow a dilated

*t*-distribution with 2

*n − *2degrees of freedom,
while the permutation distribution, permuting Treatments and Controls within each experi-
ment, has nearly a standard

*t*2

*n−*2 null distribution. So for example if

*σ*2 = 2

*/n*, the empirical
2 times as wide as the permutation null.

The quantity

*βi *in (4.2)-(4.3) causes the only consistent diﬀerences between Treatments
and Controls in experiment

*i*. If

*βi *is a dependable feature of the

*i*th experiment, and would
appear again with the same value in a replication of the study, then the permutation null

*t*2

*n−*2is a reasonable basis for inference. With

*n *large and

*σ*2 = 2

*/n*, it results in fdr(

*y*
for the most extreme 2% of the observed

*t*-statistics, favoring those with the largest values
Suppose though that

*βi *is not inherent to experiment

*i*, but rather a purely random eﬀect
that would have a diﬀerent value and perhaps a diﬀerent sign if the study were repeated; that
is,

*βi *is part of the noise and not part of the signal. In this case the appropriate choice is the
empirical null (4.4). The equivalent of Figure 1 will be

*all *central peak, with no interesting
outliers, and there will be no cases having small values of fdr(

*yi*). This is appropriate since
now there is no real Treatment eﬀect.

In this last context

*βi *acts as an

*unobserved covariate*, a quantity which the statistician
would use to correct the Treatment-Control comparison if it were observable. Unobserved
covariates are ubiquitous in observational studies. There are several obvious ones in the
Drug-mutation study: personal characteristics of the patients such as age and gender, prior
use of AZT and other non-PI drugs, years since infection, geographical location, etc.

The eﬀect of important unobserved covariates is to dilate the null hypothesis density

*f*0(

*z*), as happens in (4.4). Unobserved covariates will also dilate the “Interesting” density

*f*1(

*z*) in (2.1), and the mixture density

*f *(

*z*), (2.2). However an empirical ﬁtting method for
estimating

*f *(

*z*), such as the spline ﬁt in Figure 1, automatically includes any dilation eﬀects.

In estimating fdr(

*z*) =

*f*0(

*z*)

*/f *(

*z*) it is important to also allow for dilation of the numerator

*f*0.

*This is a strong argument for preferring the empirical null hypothesis in observational*
**5. A Structural Model for the z-values **The Bayesian speciﬁcations (2.1) underlying

our fdr results have the advantage of not requiring a structural model for the

*z*-values;
in particular it is not necessary to motivate, or even describe, the non-null density

*f*1(

*z*).

There is however a simple structural model that helps elucidate the Interesting-Uninteresting
The structural model assumes that

*zi*, the

*i*th

*z*-value, is normally distributed around a
“true value”

*µi*, its expectation,

*zi ∼ N*(

*µi, *1) for

*i *= 1

*, *2

*, . . . , N ,*
with

*µi *having some prior distribution

*g*(

*µ*),

*µi ∼ g*(

*µ*) for

*i *= 1

*, *2

*, . . . , N .*
Structure (5.1) is often a good approximation, see Section 4 of Efron (1988), and in fact
proved reasonably accurate in the bootstrap experiment giving (3.2). Together (5.1)-(5.2)
say that the mixture density

*f *(

*z*), (2.2), is a convolution of

*g*(

*µ*) with the standard normal

*ϕ*(

*z − µ*)

*g*(

*µ*)

*dµ*
(with the understanding that

*g*(

*µ*) may include discrete probability atoms.)
As a ﬁrst application of the structural model, suppose we insist that

*g*(

*µ*) put probability
for some ﬁxed value of

*p*0 between 0 and 1. This amounts to our original Bayes model (2.1)
with

*p*0 = Prob

*{*Uninteresting

*}*,

*f*0(

*z*) the theoretical null hypothesis

*N*(0

*, *1), and

*ϕ*(

*z − µ*)

*g*(

*µ*)

*dµ/*(1

*− p*0)

*.*
In the context of this paper,

*p*0 should be 0.90 or greater.

For any

*f *(

*z*) of the convolution form (5.3) let (

*δg, σg*) be the center and width parameters
(

*δ*0

*, σ*0) deﬁned by (3.1). Figure 4 answers the following question: for a given choice of

*p*0 in
constraint (5.4), what are the maximum possible values of

*|δg| *and of

*σg*,

*δ*max = max

*{|δg| p*0

*} *and

*σ*max = max

*{σg|p*0

*} .*
**Figure 4**:

*Maximum possible values of the center and width parameters *(

*δ*0

*, σ*0)

*, (3.1), when*
*the structural model (5.1)-(5.3) is constrained to put probability p*0

*on µ *= 0

*. For *1

*−p*0

*≤ *0

*.*10

*the maxima are not much greater than the theoretical null values *(0

*, *1)

*, as shown in Table 1.*
Three curves appear for

*σ*max, for the general case just described, for the case where the
non-zero component of

*g*(

*µ*) is required to be symmetric around zero, and for the case where
it is also required to be normal. Here we will only mention the general case. Remark F
discusses the solution of (5.6), which turns out to have a simple “single-point” form.

The notable feature of Figure 4 is that for

*p*0

*≥ *0

*.*90, our preferred realm for large-scale
hypothesis testing, (

*δ*max

*, σ*max) must be quite near the theoretical null values (0

*, *1):

*δ*max

*≤ *0

*.*07 and

*σ*max

*≤ *1

*.*04

*.*
Table 1 shows (

*δ*max

*, σ*max) for various choices of

*p*0. We see that the “Interesting” probability
1

*− p*0 would have to be nearly 0.30, very big by the standards of large-scale testing, in orderto obtain the observed Drug-mutation values (

*δ*0

*, σ*0) = (

*−*0

*.*35

*, *1

*.*20). The inference is thatuninteresting eﬀect, such as the unobserved covariates of Section 4, are dilating the null

**Table 1**:Value of

*σ*max and

*δ*max as a function of 1

*− p*0, (5.4).

The main point here is that our measures (3.1) of center and width are quite robust
to the arrangement of Interesting values

*µi *as long as the Interesting percentage does not
exceed 10%. If (

*δ*0

*, σ*0) for the central peak is much diﬀerent than (0

*, *1), as it is in Figure
1, then use of the theoretical null is bound to result in identifying an uncomfortably large
percentage of supposedly Interesting cases.

We can pursue this last point for the Drug-mutation data by removing constraint (5.4).

Figure 5 shows an unconstrained estimate of

*g*(

*µ*). For computational simplicity

*g*(

*µ*) was
assumed to be discrete, with at most

*J *= 8 support points

*µ*1

*, µ*2

*, . . . , µJ *, so that (5.3)

*πj *being the probability

*g *puts on

*µj*, with

*πj ≥ *0 and

*πj *= 1. A non-linear minimization
program was employed to ﬁnd the best-ﬁt curve of form (5.8) to the histogram counts in
Figure 1, using Poisson deviance as the ﬁtting criterion. The vertical bars in Figure 5 are
located at the resulting 8 values

*µj*, with the bar’s height proportional to

*πj*. For example
the little bar at far left represents an atom of probability

*π*1 =

*.*015 at

*µ*1 =

*−*10

*.*9. Theresulting

*f *(

*z*) estimate (5.7) closely resembles the natural spline ﬁt of Figure 1. Table 2
shows all 8 (

*πj, µj*) pairs.

Suppose for a moment that the estimated

*g*(

*µ*) is exactly correct, so 1.5% of the 444
cases have their

*µi*’s equal -10.9, 1.3% have -7.0, etc., and that an oracle has told us the
eight (

*πj, µj*) values. Given an observed

*zi *we can now calculate Prob

*{*Uninteresting

*|z}*,

**Figure 5**:

*Best-ﬁt discrete mixing function g*(

*µ*)

*, (5.2) for Drug-mutation data; bars located at*
*support points µj, heights proportional to weights πj; tall bar at µj *= 0

*has weight πj *= 0

*.*61

*.*
*Solid curve is best-ﬁt estimate f *(

*z*) =

*πjϕ*(

*z − µj*)

*; it closely matches natural spline ﬁt*
(2.3), exactly,

*once the scientist speciﬁes the deﬁnition of Uninteresting versus Interesting*.

It seems obvious that the 60.8% at

*µj *= 0 are Uninteresting, and that the 10.6% at

*µj *=

*−*10

*.*9

*, −*7

*.*0

*, −*4

*.*9, and 6.1 deserve Interesting status. However the status of the 28.6% at

*µj *=

*−*1

*.*8

*, −*1

*.*1, and 2.4 is less clear.

If the 28.6% are deemed Interesting, this leaves only the 60.8% at

*µj *= 0 as Unin-
teresting. In terms of our Bayes model (2.1) we have

*p*0 =

*.*608 and

*f*0(

*z*)

*∼ N*(0

*, *1), thetheoretical null. About 174 of the 444 cases will be identiﬁed as Interesting, too many for
a typical screening exercise. Shifting the 28.6% to the Uninteresting classiﬁcation increases

*p*0 to

*.*608 +

*.*286 =

*.*894, a more manageable value, and changes

*f*0(

*z*) to the version of (5.7)
supported on the four Uninteresting

*µj*’s,
this is approximately

*N *(

*−*0

*.*34

*, *1

*.*192), almost the same as the empirical null (1.8).

In other words the deﬁnition of “Interesting” determines the relevant choice of the null
hypothesis

*f*0. If we want to keep the proportion of Interesting cases manageably small then

*f*0(

*z*) has to grow wider than

*N *(0

*, *1).

Use of the term “Interesting” rather than “Signiﬁcant” reﬂects a diﬀerence in intent
between large-scale and classical testing. In the hypothetical context of Figure 5 and Table 2,
all of the 39.2% of the cases with non-zero

*µi*’s would eventually be declared as “signiﬁcantly
diﬀerent from zero” if we vastly increased the sample size of patients. Section 4 suggests
that minor deviations from

*N *(0

*, *1) might arise from scientiﬁcally uninteresting causes such
as unobserved covariates. However even if a modestly non-zero

*µi *is genuine in some sense,
it may still be Uninteresting when viewed in comparison with an ensemble of more dramatic
possibilities. Nonsigniﬁcant implies Uninteresting but not conversely.

**6. A Microarray Example **Microarrays have become a prime source of large-scale simul-

taneous testing problems. Figure 6 relates to a well-known microarray experiment concern-
ing diﬀerences between two types of genetic mutations causing increased breast cancer risk,
“BRCA1” and “BRCA2”; see Hedenfalk et al. (2001), also Efron and Tibshirani (2002), and
The experiment included 15 breast cancer patients, seven with the BRCA1 mutation
and eight with BRCA2. Each women’s tumor was analyzed on a separate microarray, each
microarray reporting on the same set of

*N *= 3226 genes. For each gene the two-sample

*t*-statistic

*yi *comparing the 7 BRCA1 responses with the 8 BRCA2’s was computed. The

*yi*’s were then converted to

*z*-values.

*zi *= Φ

*−*1

*F*13(

*yi*)

*,*
where

*F*13 is the cdf of a standard

*t*-distribution with 13 degrees of freedom. Figure 6 displays
the histogram of the 3226

*z*-values.

**Table 2**:Weights

*πj *and locations

*µj *for 8-point best-ﬁt estimate

*g*(

*µ*) of Figure 8. Which

locations we deem Interesting versus Uninteresting determines the choice between the theo-
retical or empirical null hypothesis. (Numerical results accurate to one decimal place.)

**Figure 6**:

*Histogram of N *= 3226

*z-values from breast cancer study. The theoretical N *(0

*, *1)

*null is much narrower than the central peak, which has *(

*δ*0

*, σ*0) = (

*−*0

*.*02

*, *1

*.*58)

*. In this casethe central peak seems to include the entire histogram.*
The central peak is wider here than in Figure 1, with center-width estimates (

*δ*0

*, σ*0) =
(

*−*0

*.*02

*, *1

*.*58). More importantly, the histogram seems to be

*all *central peak, with no inter-esting outliers such as those seen at the left of Figure 1. This was reﬂected in the local fdr
calculations: using the theoretical

*N *(0

*, *1) null yielded 35 genes having fdr(

*zi*)

*< *0

*.*1, those
with

*|zi| > *3

*.*35; using the empirical

*N*(

*−*0

*.*02

*, *1

*.*582) null,

*no genes at all had fdr < *0

*.*1(or for that matter fdr

*< *0

*.*9, the histogram in fact being a little short-tailed compared to

*N *(

*−*0

*.*02

*, *1

*.*582).)
There is ample reason to distrust the theoretical null in this case. The microarray
experiment for all its impressive technology is still an observational study, with a wide range
of unobserved covariates possibly distorting the BRCA1-BRCA2 comparison.

Another reason for doubt can be found in the data itself. The fdr methodology does
not require independence of the

*yi*’s or

*zi*’s across genes. However it does require that the
15 measurements for

*each *gene be independent across the microarrays. Otherwise the two-
sample

*t*-statistic

*yi *will not have an

*F*13 null distribution, not even approximately.

Unfortunately the experimental methodology used in the breast cancer study seems
to have induced substantial correlations among the various microarrays. In particular, as
discussed in Remark G, the ﬁrst four microarrays in the BRCA2 groups were mutually
correlated, and likewise the last four. Correlations reduce the eﬀective sample size for a
two-sample

*t*-statistic, just the type of eﬀect that would induce overdispersion in (6.1).

This does not say that there are no BRCA1-BRCA2 diﬀerences, only that it is dangerous
to compare the

*t*-statistics with a standard

*t*13 null distribution, even if simultaneous inference

**7. Remarks**
**A. ***Drug-mutation Study*
The data base for the Drug-mutation study, Wu et al. (2002),
included 2497 patients having HIV subtype B, of whom 1391 had received at least one of
six popular Protease Inhibitor drugs. amprenavir, indinavir, lopinavir, nelﬁnavir, ritonavir,
or saquinavir. Among the 1391, the mean number of PI drugs taken was 2.05 per patient.

Amino acid sequences were obtained at all 99 positions on the HIV protease gene, and
mutations from wild-type recorded; 25 positions showed 3 or fewer mutations among the
1391 patients, deemed too few for analysis, leaving 74 positions for the investigation here.

Each of the 74 individual logistic regressions included an intercept term as well as the six PI
main eﬀects, but no other covariates.

**B. ***The Local False Discovery Rate *The local fdr, (2.3) or (2.4), is closely related to Benjamini

and Hochberg’s (1995) “tail-area” False Discovery Rate, as discussed in Efron et al. (2001)
and Efron and Tibshirani (2002).Substituting cdf’s

*F*0 and

*F *for the densities

*f*0 and

*f *,
Bayes theorem gives a tail-area version of (2.3),
Prob

*{*Uninteresting

*|z ≤ z*0

*} *=

*p*0

*F*0(

*z*0)

*/F *(

*z*0)

*≡ *FDR(

*z*0)

*.*
FDR(

*z*0) turns out to be the conditional expectation of fdr(

*z*)

*≡ p*0

*f*0(

*z*)

*/f *(

*z*) given

*z ≤ z*0,
fdr(

*z*)

*f *(

*z*)

*dz/*
Benjamini and Hochberg work in a frequentist framework but their False Discovery Rate
control rule can be stated in empirical Bayes terms: given

*F*0, which they usually take to be
what we called the theoretical null, estimate FDR(

*z*0) by
FDR(

*z*0) =

*p*0

*F*0(

*z*)

*/F *(

*z*0)

*,*
where

*F *is the empirical cdf of the

*zi*’s; for a desired control level

*α*, say

*α *=

*.*05, deﬁne

*z*0 = arg max

*{*FDR(

*z*)

*≤ α} *;
then rejecting all cases with

*zi ≤ z*0 gives an expected (frequentist) rate of false discoveriesno greater than

*α*.

With

*z*0 as in (7.4), relation (7.2) (applied to the estimated versions of FDR, fdr, and

*f *) says that the weighted average of fdr(

*zi*) for the cases rejected by the FDR level-

*α *rule
is itself

*α*. As an example take

*α *=

*.*05 and

*f*0 equal the theoretical

*N *(0

*, *1) null. Applying
the FDR control rule to the negative side of Figure 1’s Drug-mutation data rejects the null
hypothesis for the 56 cases having

*zi ≤ −*2

*.*61; the corresponding 56 values of fdr(

*zi*) haveweighted average

*α *=

*.*05. They vary from nearly zero at the far left to .19 at the boundary
value

*z *=

*−*2

*.*61, justifying the name “local”:

*zi*’s near the boundary are more likely to befalse discoveries than the overall .05 rate suggests.

Our concern with a correct choice of null hypothesis applies to FDR just as well as fdr.

In the microarray study, FDR with

*F*0 =

*N *(0

*, *1) gives 24 signiﬁcant genes at

*α *=

*.*05, while

*F*0 =

*N *(

*−.*02

*, *1

*.*582) gives none. In fact any simultaneous testing procedure, the popularWestfall-Young method (1993) for example, will depend on a correct assessment of

*p*-values
for the individual cases, i.e. on the choice of

*F*0.

**C. ***Estimating f *(

*z*) The Poisson regression method used in Figure 1 to estimate the mixture

density

*f *(

*z*), (2.2), originates in an idea of Lindsey described in Section 2 of Efron and
Tibshirani (1996): the range of the sample

*z*1

*, z*2

*, . . . zN *is partitioned into

*K *equal intervals,
with interval

*k *having midpoint

*xk *and containing count

*sk *of the

*N z*-values; the expectation

*λk *of

*sk *is nearly proportional to

*fk ≡ f *(

*xk*), and if the

*zi*’s are independent the countsapproximate independent Poisson variates,
[

*k *= 1

*, *2

*, . . . , K*]

*,*
*c *a constant depending on

*N *and the interval length.

Lindsey’s method is to estimate the

*λk*’s with a Poisson regression, which because of
(7.5) amounts to estimating a scaled version of the

*fk*’s; in other words estimating

*f *(

*z*).

*K *equals 60 in Figure 1, with the regression model being a natural spline with 7 degrees of
freedom, roughly equivalent to a sixth degree polynomial ﬁt in

*z*.

Poisson regression based on (7.5), is almost fully eﬃcient for estimating

*f *(

*z*) if the

*zi*’s
are independent. Here we do not expect independence but we still have the expectation of

*sk *proportional to

*fk*. The Poisson regression method will still tend to unbiasedly estimate

*f *(

*z*), assuming the regression model is suﬃciently ﬂexible, though we may lose estimating
The bootstrap analysis that gave the standard errors in (3.2) was also used to check
(7.5). This turned out to be surprisingly accurate for the Drug-mutation data. If not we
might have used the bootstrap estimate of covariance for the

*sk*’s to motivate a more eﬃcient
estimation procedure, though this is unlikely to be important for large values of

*N *. In any
case bootstrap analyses as in (3.2) will provide legitimate standard errors for the Poisson
regression whether or not (7.5) is valid.

**D. ***Estimating the Empirical Null Distribution *The main tactic of this paper is to estimate

the null distribution

*f*0(

*x*) in (2.1) from the central peak in the

*z*-values’ histogram. Assuming
for

*z *near zero, so that

*δ*0 and

*σ*0 can be estimated by diﬀerentiating log

*f *(

*z*) as in (3.1).

The constant depends on

*N *and

*p*0 but the constant has no eﬀect on the derivatives of (3.1).

Directly diﬀerentiating the spline estimate of log

*f *(

*z*) can give an overly variable estimate
of

*σ*0. One more smoothing step was employed here: a quadratic curve

*a*0 +

*a*1

*xk *+

*a*2

*x*2 was
ﬁt by ordinary least squares to the estimated values log

*fk*, for

*xk *within 1.5 units of the
maximum

*δ*0, yielding

*σ*0 = [

*−*2

*a*2]

*−*12 as in (3.1). This procedure gave the small bootstrapstandard error estimate in (3.2).

None of this methodology is crucial, though it is important that the estimates

*δ*0 and

*σ*0 relate directly to

*f*0(

*z*), and are not much aﬀected by the non-Null distribution

*f*1(

*z*) in
(2.1). As an example of what can go wrong suppose we try to estimate

*σ*0 by a “robust”
scale measure such as (84th quantile minus 16th quantile)/2. This gives

*σ*0 = 1

*.*47 for the
Drug-mutation data, reﬂecting long tails due to the Interesting cases in Figure 1. Similar
diﬃculties arise using the central slope of a

*qq *plot. Basically a density estimate of the
central peak is required, and then some assessment of its center and width.

More ambitiously, we might try extending the estimation of

*f*0(

*z*) to third moments,
permitting a skew null distribution. Expression (7.6) could be generalized to

*− *log

*f*(

*z*) ˙=

*c*0 +

*c*1

*z *+

*c*2

*z*2

*/*2 +

*c*3

*z*3

*/*6

*,*
now requiring three derivates to estimate the coeﬃcients rather than the two of (3.1). This
is an unexplored path, and in particular Table 1 has not been extended to include skewness
Familiarity was the only reason for using

*z*-values instead of

*t*-values in Figures 1 and 6.

**E. ***Estimating p*0 We can obtain reasonable upper bounds for

*p*0 in (2.1) from estimates of

*π*(

*c*)

*≡ *Prob

*f {zi ∈ δ*0

*± cσ*0

*} .*
Supposing

*f*0(

*z*) =

*N *(

*δ*0

*, σ*2), deﬁne

*G*0(

*c*) = 2Φ(

*c*)

*− *1 and

*G*1(

*c*) =
the probabilities that

*zi ∈ δ*0

*± cσ*0 under

*f*0 and

*f*1 respectively. Then

*G*0(

*c*)

*− G*1(

*c*)
the inequality following from the assumption that

*G*1(

*c*)

*≤ G*0(

*c*), i.e. that the

*f*1 density ismore dispersed than

*f*0.

This leads to the estimated upper bound for

*p*0,

*i ∈ δ*0

*± cσ*0

*}/N .*
In particular if we assume

*G*1(

*c*) = 0, in other words that the Interesting

*zi*’s always fall
outside

*δ*0

*± cσ*0, then

*p*0 =

*π*(

*c*)

*/G*0(

*c*) is unbiased. (This is the same estimate suggestedin Remark F of Efron et al. (2001).) Choosing (

*δ*0

*, σ*0) = (

*−*0

*.*35

*, *1

*.*20) and

*c *= 1

*.*5 gave

*p*0 = 0

*.*88 for the Drug-mutation data, with bootstrap standard error 0.024.

**F. ***Single-point Solutions for *(

*δ*max

*, σ*max) The distributions

*g*(

*µ*) providing (

*δ*max

*, σ*max) in

(5.6), as graphed in Figure 4, have their non-zero components supported at a single point

*µ*1. For example,

*g*(

*µ*) for the entry giving

*σ*max = 1

*.*04 in Table 1 puts probability 0.90 at

*µ *= 0 and 0.10 at

*µ*1 = 1

*.*47. Single-point optimality was proved for three of the four cases in
Figure 4, and veriﬁed by numerical maximization for the “General” case. Here is the proof
for the

*σ*max “Symmetric” case, the other two proofs being similar.

We consider symmetric distributions putting probability

*p*0 on

*µ *= 0 and probabilities

*pj *on symmetric pairs (

*−µj, µj*),

*j *= 1

*, *2

*, . . . J*, so (5.3) becomes

*f *(

*z*) =

*p*0

*ϕ*(

*z*) +

*pj*[

*ϕ*(

*z − µj*) +

*ϕ*(

*z *+

*µj*)]

*/*2

*.*
Deﬁning

*c*0 =

*p*0

*/*(1

*− p*0),

*rj *=

*pj/p*0, and

*r*+ =

*j *= 1

*/c*0, we can express

*σ*0 in (3.1) as
Here we have used

*δ*0 = 0, which is true by symmetry assuming

*p*0

*≥ *1

*/*2. Then

*σ*max in(5.6) can be found by maximizing

*Q*.

We will show that with

*p*0 (and

*c*0) and

*µ*1

*, µ*2

*, . . . , µJ *held ﬁxed in (7.12),

*Q *is maximized
by a choice of

*p*1

*, p*2

*, . . . , pJ *having

*J − *1 zero values; this is a stronger version of the single-

point result. Because

*Q *is homogeneous in

**r **= (

*r*1

*, r*2

*, . . . , rJ *) in (7.13), we can consider the

unconstrained maximization of

*Q*(

**r**), subject only to

*rj ≥ *0 for

*j *= 1

*, *2

*, . . . , J*.

“den” the denominator of

*Q*. At a maximizing point

**r **we must have

*∂Q*(

**r**)

*≤ *0 with equality if

*r*
*j *=

*µ*2

*/*(1 +

*c*
*Q*(

**r**)

*≥ Rj*
Since

*Q*(

**r**) is the maximum, this says that

*rj*, and

*pj *can only be non-zero if

*j *maximizes

*Rj*. In case of ties we can arbitrarily choose one of the maximizing

*j*’s.

All of this shows that we need only consider

*J *= 1 in (7.12). The global maximized
value of

*r*0 in (7.12) is

*σ*max = (1

*− R*max)

*−*12 where
max = max

*{µ*2

*/*(1 +

*c*
The maximizing argument

*µ*1 ranges from 1.43 for

*p*0 =

*.*95 to 1.51 for

*p*0 =

*.*70. The
corresponding result for

*δ*max is simpler,

*µ*1 =

*δ*max + 1.

**G. ***Microarray Correlation in the Breast Cancer Study*
correlation structure among the eight BRCA2 microarrays. Let

*X *be the 3226

*× *8 matrixof BRCA2 data, with the columns of

*X *standardized to have mean 0 and variance 1. A
“de-gened” matrix

*X *was formed by subtracting row-wise averages from each element of

*X*,
Table 3 shows the 8

*× *8 correlation matrix of

*X*. With genuine gene eﬀects subtracted out,the correlations should vary around

*−*1

*/*7 =

*−*0

*.*14 if the columns of

*X *are independent.

Instead we see that the columns are correlated in blocks of four, with the oﬀ-diagonal block
too negative and the on-diagonal blocks too positive.

**Table 3**:Correlation matrix for the BRCA2 data with row-wise means subtracted oﬀ, (7.17).

It indicates positive correlations within the two blocks of four.

Large-scale simultaneous hypothesis testing, where the number of cases
exceeds say 100, permits the empirical estimation of a null hypothesis distribution. The em-
pirical null may be wider (more dispersed) than the theoretical null distribution that would
ordinarily be used for a single hypothesis test. The choice between empirical and theoretical
nulls can greatly inﬂuence which cases are identiﬁed as “Signiﬁcant” or “Interesting”, as op-
posed to “Null” or “Uninteresting”, this being true no matter which simultaneous hypothesis
We present an analysis plan for large-scale testing situations:

*• *A density ﬁtting technique is used to estimate the null hypothesis distribution

*f*0,

*• *The local false discovery rate, an empirical Bayes version of standard FDR theory,
provides inferences for the

*N *cases, Figure 3 and Section 2.

There are many possible reasons for overdispersion of the empirical null distribution that
would lead to the empirical null being preferred for simultaneous testing:

*• *Unobserved covariates in a observational study, Section 4.

*• *Hidden correlations, Section 6.

*• *A large proportion of genuine but uninterestingly small eﬀects, Figure 5.

Large-scale testing diﬀers in scientiﬁc intent from an individual hypothesis test. The
latter is most often designed to reject the null hypothesis with high probability. Large-scale
testing is usually more of a screening operation, intended to identify a

*small *percentage of
Interesting cases, assumed to be on the order of 10% or less in this paper. Our estimation
technique for the empirical null hypothesis is designed to be accurate under this constraint,
Figure 4. More traditional estimation methods, involving permutations or quantiles, give
incorrect

*f*0 estimates, Section 4 and Remark D.

**Acknowledgment **I am grateful to Robert Shafer, David Katzenstein, and Rami Kantor

for bringing the Drug-mutation data to my attention, and to Robert Tibshirani for several

**References**
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical
and powerful approach to multiple testing.

*J.R. Stat. Soc. Ser. B Stat. Methodol. ***57**
Dudoit, S., Shaﬀer J., and Boldrick J. (2003). “Multiple hypothesis testing in microarray
experiments”.

*Statistical Science ***18 **71-103.

Efron, B. (2003). “Robbins, empirical Bayes, and microarrays”.

*Annals Stat. ***31 **366-378.

Efron, B. and Tibshirani, R. (2002). “Empirical Bayes methods and false discovery rates for
microarrays”.

*Genetic Epidemiology ***23 **70-86.

Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes analysis of a
microarray experiment.

*J. Amer. Statist. Assoc. ***96 **1151-1160.

Efron, B. and Tibshirani, R. (1996). “Using specially designed exponential families for
density estimation”.

*Annals Stat. ***24 **2431-61.

Efron, B. (1988). “Three examples of computer-intensive statistical inference”.

*Sankhya ***50**
Hedenfalk, I., Duggen, D., Chen, Y., et al. (2001). “Gene expression proﬁles in hereditary
breast cancer”.

*New Engl. Jour. Medicine ***344 **539-48.

Miller, R. (1981).

*Simultaneous Statistical Inference*, Second Edition, Springer-Verlag, New
Westfall, P. and Young, S. (1993).

*Resampling-based multiple testing: examples and methods*
*for p-value adjustments*. Wiley, New York.

Wu, T., Schiﬀer, C., Shafer, R. et al. (2003). “Mutation patterns and structural correlates
in Human Immunodeﬁciency Virus Type 1 Protease following diﬀerent protease inhibitor
treatments”.

*Jour. Virology ***77(8) **4836-47.

Source: http://www.stats.org.uk/statistical-inference/Efron2004.pdf

Laboratoire de Rhéologie Laboratoire d’Electrochimie et de UMR 5520, Grenoble INP, Université Joseph Fourier Physicochimie des Matériaux et http://rheologie.ujf-grenoble.fr/ des Interfaces UMR 5621, Grenoble INP http://lepmi.grenoble-inp.fr/ Titre: Elaboration et caractérisation de nouvelles membranes composites organiques obtenues par electrospinning

Lantor Soric® TF in combination with D7760 In the marine industry there are a number of reasons why a skin coat is applied before processing the actual constructive laminate: - To prevent print through of the fabrics and core of the constructive laminate. - To protect the gel coat / mould when walking in the mould is necessary for draping the dry laminate pack for Vacuum Infusion. -