n = number in a trial or sample.

The aim of statistical testing is to uncover a significant difference when it actually exists. In its simplest form this involves comparing samples between one regime and another (which may be a control). Sample size is important because

- Larger samples increase the chance of finding a significant difference, but
- Larger samples cost more money.

The sample size is chosen to maximise the chance of uncovering a specific mean difference, which is also statistically significant. Please note that specific difference and statistically significant are two quite different ideas.

- The specific difference is chosen by the researcher in terms of the outcome measure of the experiment. For instance, 3kg mean weight change in a diet experiment, 10% mean improvement in a teaching method experiment.
- Statistical significance is a probability statement telling us how likely it is that the observed difference was due to chance only.

The reason larger samples increase your chance of significance is because they more reliably reflect the population mean.

Imagine we are doing a trial on whether a particular diet regime helps with weight loss. A random sample of people are chosen and each person is weighed before and after the diet, giving us their weight changes. Finally we work out the mean weight change of the entire sample. To get a statistically significant result we want a result which is unlikely to have happened if the diet makes no difference (the null hypothesis).

Imagine a scenario where one researcher has a sample size of 20, and another one, 40, both drawn from the same population, and both happen to get a mean weight change of 3kg. How likely is it that a 3kg weight change will be statistically significant in these two scenarios?

To help us here we’ll show a distribution curve from each scenario.

What you see above are two distributions of possible sample means (see below) for 20 people (n=20) and 40 people (n=40), both drawn from the same population. On each we have superimposed a sample mean weight change of 3kg. The curves are both centred on zero to indicate a null hypothesis of “no difference” (ie. that the diet has no effect). It is more likely to be significant when n=40 because the distribution curve is narrower and 3kg is more extreme in relation to it than it is in the n=20 scenario, which points to how you can increase the power of your experiment. The reason the n=40 curve is spikier is because of something called the standard error of the mean. Essentially, the larger the sample sizes, the more accurately the sample will reflect the population it was drawn from, so it is distributed more closely around the population mean.

Hopefully you will have an intuitive feeling that the larger your sample is, the more accurately it reflects the population: an exit poll at an election just asking two people how they voted is clearly less useful than one which asks 2,000 people. In Statistics this needs to be quantified and pinned down, and you want to make your sample as accurate as possible.

This reliability of the sample mean as a reflection of the population mean is quantified by something called the standard error of the mean (se), which is essentially the sd of the population of all the sample means that we would get if we took infinitely many random samples rather than just the one. The two curves above show the distributions for these for our two imaginary samples. (You can find out more about this in the section ‘Numeric Data Description’ in Statistics for the Terrified.) The standard error of the mean is calculated using two things:

- the standard deviation (sd) of the parent population (which we can’t change)
- the sample size (which we can change).

In order to show that the weight change we have seen is significant and not just random weight fluctuations, our sample mean needs to appear at one edge of the curve. If our sample mean appears in the middle section of the curve, then the observed weight change could have happened by chance.

Notice that the curve showing the se of the sample with 20 people is much wider (covering a wider range of weight changes) than the curve of the se of the sample with 40 people. You can see that a change of 3kg is right up at the end of the n=40 curve (significant!), whereas it is more in the central region of the n=20 curve (not significant).

With a sample size of n=20 it is impossible to say whether the change of 3kg is down to chance or the diet. By increasing the sample size we increase the reliability of the sample means – making the curve narrower and spikier – and so any change we detect is more likely to be up at one extreme, and therefore statistically significant.

In reality of course you will have to decide on your sample size before you begin, and there is a formula for calculating n to best achieve a significant difference. This formula uses the specific difference and the sd of the population. As mentioned above, the specific difference is proposed by the researcher and the population sd has to be obtained from previously published research or from a pilot study.

It is possible to get a statistically significant difference that is not relevant. Imagine you did a study of a new (but not very effective) fever control drug with so many people in the samples that you had a statistically significant finding with a mean drop in temperature of 0.1C. It may be statistically significant, but it won’t be very relevant if you have a high fever!

Concept Stew Services