Read In Pursuit of the Unknown Online
Authors: Ian Stewart
Suppose you are breeding a new variety of potato. Your data suggest that this breed is more resistant to some pest. But all such data are subject to many sources of error, so you can't be fully confident that the numbers support that conclusion â certainly not as confident as a physicist who can make very precise measurements and eliminate most errors. Fisher realised that the key issue is to distinguish a genuine difference from one arising purely by chance, and that the way to do this is to ask how probable that difference would be if only chance were involved.
Assume, for instance, that the new breed of potato appears to confer twice as much resistance, in the sense that the proportion of the new breed that survives the pest is double the proportion for the old breed. It is conceivable that this effect is due to chance, and you can calculate its probability. In fact, what you calculate is the probability of a result at least as extreme as the one observed in the data. What is the probability that the proportion of the new breed that survives the pest is at least twice what it was for the old breed? Even larger proportions are permitted here because the probability of getting
exactly
twice the proportion is bound to be very small. The wider the range of results you include, the more probable the effects of chance become, so you can have greater confidence in your conclusion if your calculation suggests it is not the result of chance. If this probability derived by this calculation is low, say 0.05, then the result is unlikely to be the result of chance; it is said to be significant at the 95% level. If the probability is lower, say 0.01, then the result is extremely unlikely to be the result of chance, and it is said to be significant at the 99% level. The percentages indicate that by chance alone, the result would not be as extreme as the one observed in 95% of trials, or in 99% of them.
Fisher described his method as a comparison between two distinct hypotheses: the hypothesis that the data are significant at the stated level, and the so-called null hypothesis that the results are due to chance. He insisted that his method must not be interpreted as confirming the hypothesis that the data are significant; it should be interpreted as a
rejection of the null hypothesis. That is, it provides evidence against the data
not
being significant.
This may seem a very fine distinction, since evidence against the data not being significant surely counts as evidence in favour of it being significant. However, that's not entirely true, and the reason is that the null hypothesis has an extra built-in assumption. In order to calculate the probability that a result at least as extreme is due to chance, you need a theoretical model. The simplest way to get one is to assume a specific probability distribution. This assumption applies only in connection with the null hypothesis, because that's what you use to do the sums. You don't assume the data are normally distributed. But the default distribution for the null hypothesis is normal: the bell curve.
This built-in model has an important consequence, which âreject the null hypothesis' tends to conceal. The null hypothesis is âthe data are due to chance'. So it is all too easy to read that statement as âreject the data being due to chance', which in turn means you accept that they're
not
due to chance. Actually, though, the null hypothesis is âthe data are due to chance
and
the effects of chance are normally distributed', so there might be two reasons to reject the null hypothesis: the data are not due to chance,
or
they are not normally distributed. The first supports the significance of the data, but the second does not. It says you might be using the wrong statistical model.
In Fisher's agricultural work, there was generally plenty of evidence for normal distributions in the data. So the distinction I'm making didn't really matter. In other applications of hypothesis testing, though, it might. Saying that the calculations reject the null hypothesis has the virtue of being true, but because the assumption of a normal distribution is not explicitly mentioned, it is all too easy to forget that you need to check normality of the distribution of the
data
before you conclude that your results are statistically significant. As the method gets used by more and more people, who have been trained in how to do the sums but not in the assumptions behind them, there is a growing danger of wrongly assuming that the test shows your data to be significant. Especially when the normal distribution has become the automatic default assumption.
In the public consciousness, the term âbell curve' is indelibly associated with the controversial 1994 book
The Bell Curve
by two Americans, the psychologist Richard J. Herrnstein and the political scientist Charles Murray. The main theme of the book is a claimed link between
intelligence, measured by intelligence quotient (IQ), and social variables such as income, employment, pregnancy rates, and crime. The authors argue that IQ levels are better at predicting such variables than the social and economic status of the parents or their level of education. The reasons for the controversy, and the arguments involved, are complex. A quick sketch cannot really do justice to the debate, but the issues go right back to Quetelet and deserve mention.
Controversy was inevitable, no matter what the academic merits or demerits of the book might have been, because it touched a sensitive nerve: the relation between race and intelligence. Media reports tended to stress the proposal that differences in IQ have a predominantly genetic origin, but the book was more cautious about this link, leaving the interaction between genes, environment, and intelligence open. Another controversial issue was an analysis suggesting that social stratification in the United States (and indeed elsewhere) increased significantly throughout the twentieth century, and that the main cause was differences in intelligence. Yet another was a series of policy recommendations for dealing with this alleged problem. One was to reduce immigration, which the book claimed was lowering average IQ. Perhaps the most contentious was the suggestion that social welfare policies allegedly encouraging poor women to have children should be stopped.
Ironically, this idea goes back to Galton himself. His 1869 book
Hereditary Genius
built on earlier writings to develop the idea that âa man's natural abilities are derived by inheritance, under exactly the same limitations as are the form and physical features of the whole organic world. Consequently . . . it would be quite practicable to produce a highly-gifted race of men by judicious marriages during several consecutive generations.' He asserted that fertility was higher among the less intelligent, but avoided any suggestion of deliberate selection in favour of intelligence. Instead, he expressed the hope that society might change so that the more intelligent people understood the need to have plenty of children.
To many, Herrnstein and Murray's proposal to re-engineer the welfare system was uncomfortably close to the eugenics movement of the early twentieth century, in which 60,000 Americans were sterilised, allegedly because of mental illness. Eugenics became widely discredited when it became associated with Nazi Germany and the holocaust, and many of its practices are now considered to be violations of human rights legislation, in some cases amounting to crimes against humanity. Proposals to breed humans selectively are widely viewed as inherently racist. A number of
social scientists endorsed the book's scientific conclusions but disputed the charge of racism; some of them were less sure about the policy proposals.
The Bell Curve
initiated a lengthy debate about the methods used to compile data, the mathematical methods used to analyse them, the interpretation of the results, and the policy suggestions based on those interpretations. A task force set up by the American Psychological Association concluded that some points made in the book are valid: IQ scores are good for predicting academic achievement, this correlates with employment status, and there is no significant difference in the performance of males and females. On the other hand, the task force's report reaffirmed that both genes and environment influence IQ and it found no significant evidence that racial differences in IQ scores are genetically determined.
Other critics have argued that there are flaws in the scientific methodology, such as inconvenient data being ignored, and that the study and some responses to it may to some extent have been politically motivated. For example, it is true that social stratification has increased dramatically in the United States, but it could be argued that the main cause is the refusal of the rich to pay taxes, rather than differences in intelligence. There also seems to be an inconsistency between the alleged problem and the proposed solution. If poverty causes people to have more children, and you believe that this is a bad thing, why on earth would you want to make them even poorer?
An important part of the background, often ignored, is the definition of IQ. Rather than being something directly measurable, such as height or weight, IQ is inferred statistically from tests. Subjects are set questions, and their scores are analysed using an offshoot of the method of least squares called analysis of variance. Like the method of least squares, this technique assumes that the data are normally distributed, and it seeks to isolate those factors that determine the largest amount of variability in the data, and are therefore the most important for modelling the data. In 1904 the psychologist Charles Spearman applied this technique to several different intelligence tests. He observed that the scores that subjects obtained on different tests were highly correlated; that is, if someone did well on one test, they tended to do well on them all. Intuitively, they seemed to be measuring the same thing. Spearman's analysis showed that a single common factor â one mathematical variable, which he called
g
, standing for âgeneral intelligence' â explained almost all of the correlation. IQ is a standardised version of Spearman's
g
.
A key question is whether
g
is a real quantity or a mathematical fiction.
The answer is complicated by the methods used to choose IQ tests. These assume that the âcorrect' distribution of intelligence in the population is normal â the eponymous bell curve â and calibrate the tests by manipulating scores mathematically to standardise the mean and standard deviation. A potential danger here is that you get what you expect because you take steps to filter out anything that would contradict it. Stephen Jay Gould made an extensive critique of such dangers in 1981 in
The Mismeasure of Man
, pointing out among other things that raw scores on IQ tests are often not normally distributed at all.
The main reason for thinking that
g
represents a genuine feature of human intelligence is that it is
one
factor: mathematically, it defines a single dimension. If many different tests all seem to be measuring the same thing, it is tempting to conclude that the thing concerned must be real. If not, why would the results all be so similar? Part of the answer could be that the results of IQ tests are reduced to a single numerical score. This squashes a multidimensional set of questions and potential attitudes down to a one-dimensional answer. Moreover, the test has been selected so that the score correlates strongly with the designer's view of intelligent answers â if not, no one would consider using it.
By analogy, imagine collecting data on several different aspects of âsize' in the animal kingdom. One might measure mass, another height, others length, width, diameter of left hind leg, tooth size, and so on. Each such measure would be a single number. They would in general be closely correlated: tall animals tend to weight more, have bigger teeth, thicker legs. . . If you ran the data through an analysis of variance you would very probably find that a single combination of those data accounted for the vast majority of the variability, just like Spearman's
g
does for different measurements of things thought to relate to intelligence. Would this necessarily imply that all of these features of animals have the same underlying cause? That
one thing
controls them all? Possibly: a growth hormone level, perhaps? But probably not. The richness of animal form does not comfortably compress into a single number. Many other features do not correlate with size at all: ability to fly, being striped or spotted, eating flesh or vegetation. The single special combination of measurements that accounts for most of the variability could be a mathematical consequence of the methods used to find it â especially if those variables were chosen, as here, to have a lot in common to begin with.
Going back to Spearman, we see that his much-vaunted
g
may be one-dimensional because IQ tests are one-dimensional. IQ is a statistical method for quantifying specific kinds of problem-solving ability,
mathematically convenient but not necessarily corresponding to a real attribute of the human brain, and not necessarily representing whatever it is that we mean by âintelligence'.
By focusing on one issue, IQ, and using that to set policy,
The Bell Curve
ignores the wider context. Even if it were sensible to genetically engineer a nation's population, why confine the process to the poor? Even if on average the poor have lower IQs than the rich, a bright poor child will outperform a dumb rich one any day, despite the obvious social and educational advantages that children of the rich enjoy. Why resort to welfare cuts when you could aim more accurately at what you claim to be the real problem: intelligence itself? Why not improve education? Indeed, why aim your policy at increasing intelligence at all? There are many other desirable human traits. Why not reduce gullibility, aggressiveness, or greed?