Power in statistics and statistical significance

The science of statistics is largely based on principles of probability. Heads are as likely as tails to turn up on an unbiased coin toss. The chances of getting three heads in a row in any five coin tosses can be calculated; these are widely understood ideas. This chance of getting heads is computed using statistical concepts. The actual result when tossing the coin may or may not happen as predicted. In the same way that these statements describe the likelihood of a certain outcome happening, scientists can compute how likely their statistical statements are to be correct, or how much confidence they have in their results.

Each of us uses these concepts in our daily lives. We estimate statistical likelihoods, such as the statement, "I am 90% sure that I've received a passing grade on my statistics exam." or, "There is a 30% chance of rain."

In a simple honest dice-throwing situation, repeated over and over again, we would expect each number to show up about one-sixth of the time. So, if we get the number "2" 10 times out of 18, we would suspect that this was not mere coincidence (we would expect it to turn up 3 out of 18 times), especially if the person producing the die was betting that "2" would turn up more than any other number! We would be less suspicious if the number "2" turned up 3 times out of 12 (because 2 times out of 12 is expected). Three times seems to be close enough to the expected value that it could be a chance occurrence, even though still higher than expected.

Statistical comparison tests can be used to characterize the data we observe (in the Love Canal health study, the data are the number of deaths, cancers, and different birth outcomes). They help scientists know how confident they can be about whether the observed outcomes are similar to what they would expect. If they are different enough, they aren't likely to be due to chance and then scientists must look for other explanations. When the phrase "statistical significance" is used, it means the difference is not likely to be due to chance.

Confidence in statistical test results will vary. The ability to statistically detect a difference when the difference truly exists (that is, not due to chance) is called the power of the test. We would be more confident about saying that 10 times out of 18 to roll a "2" on an honest die is unlikely to happen without some interference. We would be less confident about concluding that rolling a "2" 3 times out of 12 is not due to chance, since we expected 2 out of 12 and 3 is very close. The ability to detect a true difference between actual and expected results (our power) is much greater when we roll a "2" 10 times out of 18 than when we roll the same "2" 3 times out of 12. As you might expect, our power is higher when the difference between actual and predicted values is bigger, when we increase the number of observations, and when the expected result is very different from the actual result.

There are mathematical formulae to determine the power of statistical tests. Such a formula was used to look at the power to see statistically significant differences in the Love Canal study group. After doing this calculation, scientists are concerned they do not have enough power to see a small difference in some health outcomes experienced by the Love Canal residents, even if there really is a difference. There is low power to conclude whether small differences in some outcomes are statistically significant. It's similar to the die only having been thrown six times instead of 18, but most of the throws yielded a "2". In a game, we can simply throw the dice more times to increase the power of our conclusions about whether the dice are "loaded".

Little or nothing can be done to increase the power in the Love Canal study. We can't change the expected rate of cancers, deaths, and birth outcomes. Nor can we make large changes in the number of observations (the people who lived at the Love Canal) although we will try to include as many residents as possible in the study and gather as complete information as we can about those people. Further, we can't change the size of the difference between the two numbers (observed and expected). Using the mathematical formula, it turns out that we have very low power to detect small differences in the more common health effects. We will have good power to detect different rates of illnesses that are relatively rare, and to see large differences if they exist. This is because you would not expect to see the rare illnesses in any group, so seeing them is more likely to be significant, and because a striking difference between the two groups is not likely due to chance.

Former Love Canal residents are concerned that statistics will not find a real difference between the Love Canal group and other groups because of the low power of the study. They are also concerned that reporting a "no difference" finding could be misleading. Not seeing a difference, especially when the power is low, does not mean there is no difference. However, reporting no difference could be misleading and cause someone to not be concerned about a real health effect. We agree, and we are already thinking about the best way to release study results so that they are clearly understood. While different approaches have been suggested to handle this problem, everyone on the Committee agreed that we should not rely on tests of statistical significance as a way to determine biologic import. Most, but not all of the experts felt that we should use other descriptive statistics to help report the data in a meaningful way. We will use this newsletter as a way to continue this dialogue with residents. We envision the Committee discussing this further as we go along.

If you have any questions, you can call us at 518-402-7530 or 1-800-458-1158.