Statistics/Testing Data/compare-prop2

A running example from the 2004 American Presidential Race follows. It should be clear that the choice of poll and who is leading is irrelevant to the presentation of the concepts. According to an October 2nd Poll by Newsweek (link), 47% of 1,013 registered voters would vote for John Kerry/John Edwards if the election were held today. 45% would vote for George Bush/Dick Cheney, and 2% would vote for Ralph Nader/Peter Camejo.


 * Open a new Blank Workbook in the program Microsoft Excel.
 * Enter Kerry's reported percentage p in cell A1 (0.47).
 * Enter Bush's reported percentage q in cell B1 (0.45).
 * Enter the number of respondents N in cell C1 (1013). This can be found in most responsible reports on polls.
 * In cell A2, copy and paste the next line of text in its entirety and press Enter. This is the Microsoft Excel expression of the standard error of the difference as shown above.


 * =sqrt(A1*(1-A1)/C1+B1*(1-B1)/C1+2*A1*B1/C1)


 * In cell A3, copy and paste the next line of text in its entirety and press Enter. This is the Microsoft Excel expression of the probability that Kerry is leading based on the normal distribution given the logic here.


 * =normdist((A1-B1),0,A2,1)


 * Don't forget that the percentages will be in decimal form. The percentage will be 0.5, or 50% if A1 and B1 are the same, of course.

The above text might be enough to do the necessary calculation, it doesn't contribute to the understanding of the statistical test involved. Much too often people think statistics is a matter of calculation with complex formulas.

So here is the problem: Let p be the population fraction of the registered voters who vote for Kerry and q likewise for Bush. In a poll n = 1013 respondents are asked to state their choice. A number of K respondents says to choose Kerry, a number B says to vote for Bush. K and B are random variables. The observed values for K and B are resp. k and b (numbers). So k/n is an estimate of p and b/n an estimate of q. The random variables K and B follow a trinomial distribution with parameters n, p, q and 1-p-q. Will Kerry be ahead of Bush? That is to say: wiil p > q? To investigate this we perform a statistical test, with null hypothesis:
 * $$\, H_0: p = q$$

against the alternative
 * $$\, H_1: p > q$$.

What is an appropriate test statistic T? We take:


 * $$\, T=K-B$$.

(In the above calculation $$T=\frac Kn - \frac Bn = \frac{K-B}n$$ is taken, which leads to the same calculation.)

We have to state the distribution of T under the null hypothesis. We may assume T is approximately normally distributed.

It is quite obvious that its expectation under H0 is:


 * $$\, E_0T = 0$$.

Its variance under H0 is not as obvious.


 * $$\, var_0(T) = var(K-B) = var(K) + var(B) - 2cov(K,B) = np(1-p) + nq(1-q) + 2npq $$.

We approximate the variance by using the sample fractions instead of the population fractions:


 * $$var_0(T) \approx 1013\times 0.47(1-0,46) + 1013\times 0.45(1-0.45) + 2\times 1013\times 0,47\times0.45 \approx 931

$$. The standard deviation s will approximately be:


 * $$\, s = \sqrt{var_0(T)} \approx \sqrt{931} = 30.5 $$.

In the sample we have found a value t = k - b = (0.47-0.45)1013 = 20.26 for T. We will reject the null hypothesis in favour of the alternative for large values of T. So the question is: is 20.26 to be considered a large value for T? The criterion will be the so called p-value of this outcome:


 * $$\, p-value = P(T\ge t; H_0) = P(T\ge 20.26; H_0) = P(Z\ge \frac{20.26}{30.5}) = 1-\Phi(0.67) = 0.25$$.

This is a very large p-value, so there is no reason whatsoever to reject the null hypothesis.