I tried my best not to come up with such a click-bait-y title for this post. However, Mr. Neyman (1894-1981) and Mr. Pearson (1895-1980) didn’t leave much room for that.
This is Part-1 of this two-part post. I have built an interactive Shiny app in R for visualizing what goes under the hood of these hypothesis tests. That can be found in Part-2.
First, let me define the “power” of a hypothesis test. In terms of statistics, Power is the probability of rejecting a null hypothesis when it is actually false. In layman terms, let us say we have a testing protocol which decides over two possible outcome: A and B. In this case, we have four possible scenario:
If you have a close look, the last two scenarios represent error of two kinds. Historically, these are known as Type-1 and Type-2 error. However, I find these nomenclatures extremely non-intuitive. To make the entire topic a little more intuitive, let us use an example. Let us say, we are trying to detect the presence of a signal in a radar output in Gaussian noise. Outcome A indicates the absence of signal, which means the output contains noise only. Outcome B indicates the presence of signal in Gaussian noise. Clearly, Type-1 error means a false alarm whereas Type-2 error means a miss detection. In above equations, , and denotes probability of correct detection, false alarm and miss detection respectively. There exists another combination obviously, which has not been named by statisticians. This is the nameless combo.
Also, the detection and miss detection probabilities are connected via
We have set our conceptual backdrop. Now we want to make this test the most powerful. This means: we want to come up with a decision rule that maximizes . Naturally, the first question that pops up in our mind: Is it possible to design a test where we can arbitrarily maximize the probability of correct decisions?
We need to answer this question first. Let us say, the decision rule divides the entire field (over which the likelihood functions are defined) into two disjoint region and . For a signal vetor we can write the following:
Here and are the probability density function of under each hypothesis. An interesting phenomenon lies in Equation (5) and (6). That is- if is so big that it engulfs the entire decision region then both and converge toward unity. This rather interesting phenomena tells us that, if the condional probability and overlap (which will yield a non-zero value in Equation (5)), then it is not possible to derive the ultimate decision rule for which and . If you want to increase , you will end up increasing in the process, but in a different rate depending on the density functions and their extent of overlap. This is where Neyman-Pearson criterion comes to aid.
What is Neyman-Pearson Criterion?
The Neyman-Pearson criterion see that problem in a rather different light. It asks, how can we maximize for a given . In short, it poses this problem into a constrained optimization problem as following
It can be solved using Lagrange multiplier. I will take a rather different path by solving it graphically because that is more insighful. For that, let us consider a simple yet common case. Each likelyhood function under different hypothesis are Gaussian, differing only by mean but they share the same covariance matrix.
where , which means is a vector of length . Immediately, the likelihood functions can be written as
Taking the log-likelihood ratio of above likelihood functions and simplifying further (which is really straightforward), we end up with the following expression
Now, the log-likelihood ratio test can be written as following by rearranging the above a little
Decide for A if … (8)
Decide for B if … (9)
In fact, the likelihood ratio test can be further simplified by a introducing the notion of “Sufficient Statistic”.
An extremely small primer to Sufficient Statistic:
Sufficient Statistic is a particular statistic or algebraic manipulation of our observation . Instead of dealing with the entire set of observed data points, sufficient statistic allows us to encode the observation in a simpler yet sufficient manner. This train of jargon is made simpler in the following example. We can define the test in a more compact manner by simply rearranging (8) and (9)
Decide for A if
Decide for B if
It is clear that, the term , which is essentially the dot product of the observation vector and mean vector, alone can be used as a decision rule for this test. Hence, the term is sufficient to decide the fate of the test. For the case of , it becomes . This means, in this case, the sufficient statistic is simply the sum of all elements in scaled by . Now, using sufficient statistic, the exactly same test can be rewritten as
Decide for A if
Decide for B if
For this example, the sufficient statistic is again a Gaussian random variable under each hypothesis. This time, it is not a vector, rather a scalar.
In the next part, we will be looking for the optimum decision rule and develop the tools to evaluate the performance of our tests.
(The comic in the featured image is stolen from the amazing xkcd)