IB Mathematics (HL)/Further Statistics

Topic 8: Option - Statistics and Probability
As learnt in the statistics and probability section in the core syllabus, the expected value $$\mathrm{E}(X)$$, or the mean $$\mu$$ of a distribution "X" is: $$\mu = \mathrm{E}(X) = \begin{cases} \sum xP(x), & \mbox{for discrete } X \\ \int xf(x)dx,  & \mbox{for continuous } X \end{cases}$$ The variance of a distribution is defined as: $$ \begin{align} \sigma^2 = \mathrm{Var}(X) &= \mathrm{E}((X-\mu)^2)\\ &= \mathrm{E}(X^2) - (\mathrm{E}(X))^2 \end{align} $$ If X is discrete, the variance can be defined as: $$ \begin{align} \sigma^2 = \mathrm{Var}(X) &= \mathrm{E}((X-\mu)^2)\\ &= \sum(x-\mu)^2P(x) \\ &= \sum x^2P(x) - \sum\mu^2 P(x) \\ &= \sum x^2P(x) - \mu^2 & \left \{\sum P(x) \mbox{ is } 1 \mbox{ and } \mu \mbox{ is a constant}\right \}\\ \sigma^2 = \mathrm{Var}(X) &= \mathrm{E}(X^2) - (\mathrm{E}(X))^2 \end{align} $$

Linear Transformations
If one displaces and scales a distribution, the mean and the variance must change. The values change in accordance to the following formulas: $$\mathrm{E}(aX+b)=a\mathrm{E}(X)+b$$ $$\mathrm{Var}(aX+b)=a^2\mathrm{Var}(X)$$ Notice how the variance is unaffected by the value of b, since the variance of a distribution is never changed by the displacement of the distribution. Only the value a, representing a horizontal stretch of the distribution, modifies the spread.

Linear Combinations of Independent Random Variables
When one takes multiple samples of a random variable X, these are often independent random variables. In that case the distribution must be treated differently. $$\mathrm{E}(a_1X_1\pm a_2X_2)=a_1\mathrm{E}(X_1)\pm a_2\mathrm{E}(X_2)$$ $$\mathrm{Var}(a_1X_1\pm a_2X_2)=a_1^2\mathrm{Var}(X_1)+a_2^2\mathrm{Var}(X_2)$$ Notice that the variance is always superimposed, not subtracted. Also the random variable may not necessarily be from the same population X.

These rules also apply to situations with n independent random variables. $$\mathrm{E}(a_1X_1\pm a_2X_2\pm...\pm a_nX_n)=a_1\mathrm{E}(X_1)\pm a_2\mathrm{E}(X_2)\pm...\pm a_n\mathrm{E}(X_n)$$ $$\mathrm{Var}(a_1X_1\pm a_2X_2\pm...\pm a_nX_n)=a_1^2\mathrm{Var}(X_1)+a_2^2\mathrm{Var}(X_2)+...+a_n^2\mathrm{Var}(X_n)$$ The derivation of this rule is beyond the scope of the syllabus.

Unbiased estimators of mean and variance
$$ \begin{align} \mathrm{E}(\overline{X})& = \mathrm{E}(\tfrac{X_1+X_2+...+X_n}{n})& \\ & = \tfrac{\mathrm{E}(X_1+X_2+...+X_n)}{n}&\{\mbox{Assuming independence}\}\\ & = \tfrac{\mathrm{E}(nX)}{n}&\{n\mbox{ of them }\}\\ & = \tfrac{n\mu}{n}\\ \mathrm{E}(\overline{X}) & = \mu & \therefore \overline{x} \mbox{ or } \mathrm{E}(\overline{X}) \mbox{ is the unbiased estimate of } \mu \end{align}$$

Unbiased Estimators of Variance are calculated by multiplying the original variance by n/(n-1)

Discrete Distributions
Binomial, B(n,p) When X~B(n,p), P(X=x) denotes the probability of x number of successes when n trials, each with a probability of success of p, are performed. Applies when:
 * There are exactly two possible outcomes
 * The number of trials is fixed
 * Each trial is independent of the outcomes of other trials
 * The probability of each trial remains constant.

Negative Binomial, NB(r,p) Models the number of Bernoulli trials B(1,p) required to achieve r successes. The combinatorial coefficient in the probability mass function is merely to account for the number of ways such a number of successes could be arranged.

Geometric, Geo(p) Models the number of Bernoulli trials B(1,p) which will be needed until the first success, ie similar to NB(1,p). No combinatorial coefficient is needed because "counting" stops once the first success has been achieved. Hence there is only one possible arrangement for outcomes.

'Poisson, Po(m'') A Poisson distribution measures the number of successes in For instance, there could potentially be a huge number of phonecalls per hour, but if the mean is two per hour, the chance of this is slim. NB: This assumes that m is constant, which in real situations is not true (eg frequency of phonecalls depends on time of day, day of week, etc). Questions of this type often involve converting between time intervals, for instance if the mean is two calls per hour and the probability of a certain number of calls in 5 hours is needed, the mean used would be 10. Also specific to this distribution is the fact that E(x) = Var(x).
 * a fixed interval
 * an infinite number of trials.

Continuous Distributions
These can follow any function where: Furthermore, x can be any value $$-\infty \leq x \leq \infty$$ that is also within the domain of $$f(x)$$. Cumulative frequencies are calculated by integrating $$f(x)$$. In addition, the syllabus expects knowledge regarding three particular continuous probability distributions.
 * $$ {f(x)}\geq{0}$$ for all $${x}\in$$ range of f(x).
 * $$\int_{b}^{a}f(x)\, dx = 1$$ if the domain of $$f(x)$$ is $$a \leq x \leq b$$.

Exponential, Exp(λ) This distribution models the expected interval between events (assumed to be instantaneous) in a Poisson distribution. Eg for Po(2) calls an hour, the expected number of calls in one hour is 2; the expected value of the exponential distribution, $$\tfrac{1}{\lambda}$$, is half an hour. The exponential distribution can also be seen as the continuous equivalent of the geometric distribution, which models the time until the first success.

Normal, N(μ,σ2) This is the most interesting distribution, and the most relevant to the statistics option. Due to the central limit theorem, a large portion of the Statistics option is based on the normal distribution. On questions about the normal distribution, the question must state that the data at hand “follows a normal distribution”, “is normally distributed”, etc. This makes it easy to identify. The standard normal variable Z follows the distribution and is a way of converting to Z-scores, which is important for calculating confidence interval and hypothesis testing.

Normal approximation to the binomial distribution For large values of n, X~B(n,p) can be approximated as X~N(np,npq). (This can be shown on a histogram.) There are different estimates for how large n should be; greater than 5 is usually a good approximation, yet the IB states $$(np) \geq 10$$ and $$(nq) \geq 10$$ as rules. In situations which do not satisfy these conditions it should be clearly stated that the approximation is not good.

Summary of Distributions
The summary of equations, functions and notations are shown below for each distribution.

Linear Combinations
It is often useful to combine variables, eg to determine the probability that $$X > Y$$. This is done by solving an inequality and basing a new variable, for instance $$U$$ on the random variable size. For instance: $$\mathrm{P}(X > Y)\!$$ $$= \mathrm{P}(X-Y > 0).\!$$ $$\text{let }U = X-Y\!$$

We now need to find $$U$$ using the rules: Note that the variance is always added. It is also important to convert standard deviation to variance before attempting to combine variables.
 * $$\mathrm{E}(X \pm Y) = \mathrm{E}(X) \pm \mathrm{E}(Y)$$
 * $$\mathrm{Var}(X \pm Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$$

Questions also discuss combinations where multiple picks are made. Note the difference between In the first, four separate picks are made. Variance becomes $$4\mathrm{Var}(X)$$. In the second, the value of one pick is multiplied by four. Variance in this case is $$\mathrm{Var}(4X) = 16\mathrm{Var}(X)$$. In short, separate picks should be treated as separate variables, despite having the same μ and σ.
 * $$U = X_1 + X_2 + X_3 + X_4\!$$ and
 * $$U = 4X\!$$.

Central limit theorem
When taking samples from a non-normal population $$X$$ whose mean is μ variance is σ2, the averages of these samples will be distributed normally as $$\bar X \sim N(\mu, \tfrac{\sigma^2}{n})$$ where n is the number of data points each sample is based on (sample size). This applies when $$n \geq 30$$.

The samples must be independent. Although a proof of the CLT is not required, it may be useful to know that it stems from a binomial distribution: either $$\bar x < \mu$$ or $$\bar x \geq \mu$$. The probability is constant and independent, meaning that the sample means are described by a binomial function. We already know by the normal approximation to the binomial distribution that when the sample size is large enough, the distribution will be approximately normal.

This is a reliable approximation for sample sizes $$n \geq 30$$. Note that the variance, $$\tfrac{\sigma^2}{n}$$, of the normal distribution decreases with larger values of n, meaning that the probability distribution will be narrower, ie more precise. The "standard deviation" of the normal distribution when using the CLT, $$\tfrac{\sigma}{\sqrt{n}}$$, is also known as the sampling error or standard error.

Normality of a proportion
Proportions with large sample sizes also follow a normal distribution. Following a similar logic as for sample means, a proportion of a sample can either be a success or failure. The probability of this is considered fixed, meaning that the distribution is binomial. When the sample size is large enough, there is therefore a normal distribution where $$ \mu = \widehat{p} $$, the sample proportion: $$ \widehat{p} = \frac{X}{n}, \text{ where } \begin{cases} \widehat{p} = \text{ sample proportion} \\ X = \text{ number of successes} \\ n = \text{ sample size.} \end{cases} $$ If p is the true proportion of successes and n is the sample size then X~B(n,p). Hence we can show that By the Central Limit Theorem, we can say that for large values of n, $$\widehat{p} \sim \mathrm{N}(p,\tfrac{pq}{n})$$.
 * Expected value $$\mathrm{E}(\widehat{p}) = \mathrm{E}(\tfrac{1}{n}X) = \tfrac{1}{n}\mathrm{E}(X) = \tfrac{1}{n} \times np = p$$
 * Variance $$\mathrm{Var}(\widehat{p}) = \mathrm{Var}(\tfrac{1}{n}X) = \tfrac{1}{n^2}\mathrm{Var}(X) = \tfrac{npq}{n^2} = \tfrac{pq}{n}$$

Confidence Intervals
A confidence interval is a range measured from the mean of a distribution in which a certain fraction of samples lie. It is often represented as a percentage, for instance saying that "90% of samples weigh 2±0.01 kg". Confidence intervals work the same way for sample means and for proportions. In each case let $$\mu = \bar{x}$$ or $$\mu = \widehat{p}$$. The difference arises when the population variance is not known. These two situations are explored below.

When σ is known
The data booklet gives the expression for a confidence interval as: $$ \bar{x} \pm z \times \frac{\sigma}{\sqrt{n}}$$  (given n &ge; 30). This same expression is merely written in terms of the standard distribution for a sample: $$ \widehat{p} \pm z \times \sqrt{\frac{\widehat{p}\widehat{q}}{n}}$$ (when np &ge; 10 or nq &ge; 10).

$$z$$ is the z-score corresponding to the percentage of the confidence interval. This can be looked up using the tables which occupy the last few pages of the data booklet or using the invNorm function on the calculator. Note, however, that entering invNorm(.9) will not give the Z-score of a 90% confidence interval. The remaining 10% must be distributed evenly both above and below the target range that is within 90% of the mean. Therefore z = invNorm(0.95). This can be clearly seen in the illustration above - we want to find the value of a, and should therefore use either 0.95 or 0.05. This is the same concept as a two-tailed test (described below) - if we were saying that 90% of the values were below a certain value, then we would use invNorm(0.9).

Calculator functions ZInterval: Either enter a set of data or statistics. Not that in both cases, σ is clearly requested. C(onfidence)-Level as a fraction. When using data, select the list name at set Frequency=1. 1-PropZInt: x is the number of successes in n trials. C(onfidence)-Level as a fraction.

When σ is unknown
When the population standard distribution, σ, is unknown, we must approximate it using the sample data. $$S_{n-1}^2$$ is used to represent σ. Note that the sample standard deviation $$S_n$$ may be known without the population standard deviation being known. When σ was known, we said that $$\frac{\bar{X}-\mu}{\tfrac{\sigma}{\sqrt{n}}} = Z$$, the standard normal distribution N(0,1). Likewise, the distribution when σ is not known is $$\frac{\bar{X}-\mu}{\tfrac{S_{n-1}}{\sqrt{n}}} = t$$, known as the t-distribution. It is simply a "fatter" version of the standard normal curve N(0,1).

When using a t-distribution, state the degrees of freedom: ν = n-1 where, as usual, n is the sample size. (This gains significance in hypothesis testing.)

Calculator function TInterval: Data or statistics. When using data, same input method as ZInterval is used. Note that $$S_x$$ is the same as $$S_{n-1}$$, and has to be calculated manually from the sample standard deviation $$S_n$$ using $$S^2_{n-1} = \tfrac{n}{n-1}S^2_n$$ (from data booklet).

Determining appropriate samples sizes
In order for an estimate to be correct, the size of samples must be high enough. As n increases, the variance falls, increasing the precision (distribution narrows). The example below is taken from Haese & Harris' IBDP Mathematics (Options):

"How large should a sample be if we wish to be 98% confident that the sample mean will differ from the population mean by less than 0.3 if we know that population standard deviation σ = 1.365?"

This means that $$-0.3 < \mu - \bar{x} < 0.3$$ [where $$ \bar{x} $$ is the furthest acceptable point from μ.]

From the data booklet formula $$ \mu = \bar{x} \pm z \times \tfrac{\sigma}{\sqrt{n}} $$ we know that:

$$ \bar{x} - z \times \tfrac{\sigma}{\sqrt{n}} < \mu < \bar{x} + z \times \tfrac{\sigma}{\sqrt{n}}$$ and invNorm(0.99) [not 0.98!] is 2.326: $$-2.326\tfrac{\sigma}{\sqrt{n}} < \mu - \bar{x} < 2.326\tfrac{\sigma}{\sqrt{n}}$$

$$\therefore 2.326\tfrac{\sigma}{\sqrt{n}} = 0.3$$

$$\sqrt{n} = \tfrac{2.326\sigma}{0.3} = \tfrac{2.326 \times 1.365}{0.3} \approx 10.583$$

$$n \approx 112$$ So a sample of 112 should be taken to be 98% sure that sample means will differ from the population mean by less than 0.3 (n was rounded up to 112).

Note that for proportions, $$\widehat{p}$$ might not always be known. In such a case the largest possible error should be used, ie $$ \pm z \sqrt{\tfrac{\left ( \frac{1}{2} \right )\left ( \frac{1}{2} \right )}{n}} \to \pm z \tfrac{1}{2\sqrt{n}} $$ As above, this is set equal to the maximum acceptable range, eg 0.03 if the proportion must be "within 3%".

Significance/Hypothesis Testing
The aim of hypothesis testing is to consider the validity of hypotheses at particular levels of significance and come to a conclusion regarding their accuracy. The idea is to A level of significance is much like the confidence level of significance testing. For example, a 90% confidence interval contains 90% of the spread, while a 10% level of significance says that a hypothesis is true or false with a chance of 90% (10% chance of error).
 * Formulate a hypothesis
 * Collect sample data
 * Determine whether the data supports the hypothesis

Null and alternative hypothesis
In any hypothesis test, there will be two mutually exclusive hypotheses:
 * $$H_0$$, the null hypothesis, which states equality. This is assumed true until proven false.
 * $$H_1$$, the alternative hypothesis, which is adopted if $$H_0$$ has been proven false by random sample data.

For example, testing that the mean number of phone-calls per hour is greater than 6:
 * $$H_0 : \mu = 6$$
 * $$H_1 : \mu > 6$$

An alternative hypothesis can either be one-sided, as above, or two sided. If we wanted to prove that the mean number of phone-calls is not 6, we would say that $$H_0: \mu \neq 6$$. This would imply either that $$\mu > 6$$ or that $$\mu < 6$$. This creates a slight difference in how the probability is calculated (see the invNorm argument in Confidence Intervals) but this is largely handled by the calculator.

Significance Testing for Mean and Proportion
To perform a test, data must be collected, giving a value for $$\bar{x}$$. Then the z- or t-score ($$z*\!$$ or $$t*\!$$) is calculated, depending on whether population σ is known, using When using proportions: All the CLT requirements need to be met for each respective method: $$n \geq 30$$ for sample means and $$np \geq 10$$ or $$nq \geq 10$$ for proportions.
 * $$z = \frac{\bar{x} - \mu}{\tfrac{\sigma}{\sqrt{n}}}$$ or
 * $$t = \frac{\bar{x} - \mu}{\tfrac{S_{n-1}}{\sqrt{n}}}$$ (remembering to state $$n-1$$ degrees of freedom).
 * $$\bar{x} = \widehat{p} = \tfrac{x}{n}$$
 * $$\sigma = \sigma_{\widehat{p}} = \sqrt{\tfrac{pq}{n}}$$

The next step is to determine the  p-value , the probability of the z- or t-score occurring. For z-scores this can be done with normalcdf, but for t-scores the whole process must be done on the calculator (explained below). The p-value measures the likelihood of $$\bar{x}$$ occurring with a mean of μ and standard deviation σ. If the p-value is low, there is a high possibility that either the sample is wrong (sample size could be increased to verify this) or the mean is not μ (reject null hypothesis). The cut-off for which the p-value is considered "too" low is determined by the level of significance: given a 0.05 (5%) level of significance, $$H_0$$ will be rejected if the p-value is below 0.05.

For two-tailed tests ($$\neq$$), the p-value is the probability $$\mathrm{P}(t \geq t*) + \mathrm{P}(t \leq -t*)$$ whereas for a single-tailed test ($$<$$ or $$>$$) it is simply $$\mathrm{P}(t \geq t*)$$ or the equivalent in terms of $$z\!$$ and $$z*\!$$.

Steps to hypothesis testing
 * 1) State $$H_0$$, $$H_1$$ and whether the test is one- or two-tailed.
 * 2) State whether z- or t-distribution, calculate corresponding test statistic $$z*\!$$ or $$t*\!$$.
 * 3) State decision rule (reject $$H_0$$ if p-value is...). Calculate p-value for test statistic.
 * 4) Make decision: "Reject $$H_0$$" or "Accept $$H_1$$.
 * 5) Brief statement putting the decision into context.

The "brief" statement will involve as much of the information from as possible. For example, "Based on the a sample of 200 cookies, insufficient evidence is provided to accept at the 1% level of significance that more than 60% of cookies contain chocolate".

There is also a slightly different way of determining whether the calculated z-score is acceptable. Instead of converting the z-score into a probability which is then compared to the p-value, calculate the critical value z-score based on the level of significance. Then use logic to determine whether the calculated z-score falls within the rejection region. For instance, when testing at a 5% level of significance, a "<" one-tailed test$$H_0$$ can be rejected if $$z*\!$$ < invNorm(0.05). For a ">" one-tailed test, reject if > invNorm(0.95) and for a two-sided test reject if $$\left\vert z* \right\vert \geq $$ invNorm(0.975).

Calculators (TI) Calculator functions are quite straightforward to use. In general: Sorting out which variables are known will provide the right function in the case that it isn't clear to start with.
 * $$\mu_0:$$ and $$\mu:$$ combine to form the alternative hypothesis. For example, if $$H_0: \mu < 6 $$ then $$\mu_0 = 6$$ and $$\mu < \mu_0$$. The same applies to "$$\text{prop}$$" and $$P_0$$ in the 1-PropZTest.
 * $$x$$ is the number of successes
 * $$n$$ is the number of trials
 * $$\bar{x}$$ is the sample mean
 * $$\sigma$$ is the population standard deviation (z-tests) and $$S_x$$ is $$S_{n-1}$$, the unbiased estimate of population variance.

Type I and II errors

 * Falsely rejecting $$H_0$$ is a Type I error. Chance of this occurring is equal to the level of significance at which the test is performed.
 * Falsely accepting $$H_0$$ is a Type II error. Chances of making a Type II error increase with stricter levels of significance as the critical region shrinks. Calculating the probability of a Type II error occurring requires an alternative value, ie the "true" mean. Type II error is the chance of accepting that the mean is a when it is in fact b, meaning that the probability of Type II error occurring depends on b. Therefore Type II is the chance of getting the recorded sample mean $$\bar{x}$$ when the true mean is b - this can be calculated using normalpdf.

Chi-Squared Distribution
This distribution can be used to test whether a data set follows a particular distribution by comparing expected values to observed data. It can likewise be used to hypothesise whether two variables are dependent.

The chi-squared (χ2) distribution is dependent on the degrees of freedom. The higher the degree of freedom, the closer to the normal curve it becomes.

Note that all χ2 tests are one-tailed. So don't go dividing the p-value in two or any of that funny stuff.

Goodness of fit
For GOF tests, $$H_0$$ states that the data follows a distribution while $$H_1$$ states that it does not follow the distribution. For example:
 * $$H_0$$: The data is from a uniform distribution
 * $$H_1$$: The data is not from a uniform distribution.

Degrees of freedom ν = number of classes (n) - number of restrictions (k). When there do not appear to be any restrictions (most cases), k 1 due to the fact that there are a finite number of classes. Once the values of all but one have been found, the last class is unable to fluctuate. This means that in general for GOF-tests ν = n - 1.

To calculate the probability that a data set follows a particular distribution, enter the observed and expected values into separate lists. If any of the expected frequencies are below five, combine this group with a neighbouring group. This is to avoid dividing by small numbers when calculating the test statistic, which would produce disproportionately large values. Because the number of classes falls when this done, decrease the degrees of freedom accordingly. Additionally, subtract 1 from ν for each statistic ($$m, \bar{x}, p, \mu, \sigma^2$$) which is used to calculate expected data but which is itself based on the observed data.

Then perform a χ2GOF-Test using that data. As with hypothesis testing, the p-value shows how likely it is to get a χ2-score greater than the one achieved with this data set, so if it is smaller than the level of confidence $$H_0$$ should be rejected and it can be concluded that the data does not follow the distribution at this particular level.

Contingency tables
When testing for the independence of variables, a two-variable contingency table is used. This lists the combinations of frequencies of the two variables. For instance:

When entering the data into the calculator the "total" column and row are omitted. Expected values do not have to be calculated manually (this would be done by multiplying row total by column total and dividing by "total total", eg 70×44/200 = 15.4, the first cell in the expected value table). Instead, a χ2-test is performed and the expected values will be inserted into the "Expected" matrix. Degrees of freedom in a χ2-test are (rows-1)(columns-1). When there is a two by two contingency table, ν = 1. Normally one would use Yate's continuity correction, however this has been removed from the syllabus. Therefore just proceed as normal with ν = 1.

For independence test:
 * $$H_0$$: Variables are independent.
 * $$H_1$$: Variables are dependent.

As always, the p-value is the probability of a χ2-value larger than that observed occurring. $$H_0$$ is therefore rejected if the level of significance is lower than the p-value.