If, however, the effect of medication and placebo are not additive, you would have to design an experiment that isolates them from each other. You cannot use a double-blind experiment, because you will need a four-way distribution of patients: those who are given treatment and are
told
it is treatment, those who are given a placebo and are told it is treatment, those who are given treatment and are told it is a placebo, those who are given a placebo and are told it is a placebo. This should isolate the effect of the drug from that of the placeboâbut, instead of one comparison, you would need to make six, with all the usual problems of significance, sample size, and chance variation. It would be a large and complex trialâbut without it there would always be the suspicion that most antidepressants are little more than a very expensive but pharmacologically dubious pink pill.
In the United States today the whole question has become academic, because it is almost impossible to get the required informed consent for a classic randomized placebo trial: nobody who would sign up is willing to take the chance of not getting the newest treatment. Instead you do crossover trials, in which you give one of your randomized groups the treatment, and if it seems to be effective, then the other group gets it, tooâwhich solves the ethical problem of denying treatment, but makes it much harder to show a clear difference in results. “First, do no harm”âbut what if that means you can do no good?
Â
In 1766, Jean Astruc, onetime physician to the Regent of France, published a long and learned treatise on the art of midwifery that began with the observation that he had not himself been present at a birth (except, one assumes, his own).
For a long time, such a combination of prudery, presumption, and tradition could conceal the fact that women and men are confounded variables in any calculation of human health. We do overlap in many respects; indeed, as our roles in life converge we can see points where our health histories are also aligning, both for good (fewer men dying of industrial diseases) and for ill (more women dying of lung cancer). Nevertheless, there are significant variations between the sexes, ranging from the obvious reproductive and hormonal differences to types of depression, overall life expectancy, and unequal response to painkillers.
What should experimenters do with confounded variables? Isolate them, if possible. When obvious differences appear in the clinical record, we can design studies to quantify those differences with the same precision as any other comparison. The residual problem, though, is: what about differences that are not obvious? As we have seen, the significance or insignificance of clinical trials can hang on a single percentage point. What if the responses of men and women placed in the same group actually differed from one another? Would we be justified in taking the mean of our results? You can hardly say that a bucket of boiling water in a tub full of ice is the equivalent of a lukewarm bath.
It's not only an issue of
l'homme moyen
differing from
la femme moyenne:
in an age of mass movements of populations, countries with large numbers of people from different parts of the world are particularly aware of the effects of genetic variationsâfrom lactose intolerance to sickle-cell anemia. The National Institutes of Health Revitalization Act of 1993 required that clinical trials be “designed and carried out in a manner sufficient to provide for a valid analysis of whether the variables being studied in the trial affect women or members of minority subgroups, as the case may be, differently than [
sic
] other subjects in the trial.”
This seems fair enough, but the Act was less specific on how it was to be achievedâand with reason. A trial is designed to have a certain “power”: a probability that a genuine effect will be recognized and not mistaken for the workings of chance. Like a signal-to-noise ratio, it depends crucially on the number of subjects. Increasing the number increases the probability that random variation will cancel out, but the relation is not linear: if you want to reduce the random effect by half, you need four times as many observations; by three-quarters, you need 16 times as many; by seven-eighths, 64 times. The minimum number of patients for a given trial is determined by the minimum size of effect that the researchers hope to observe, the error of observation, and the power of the experiment. These factors are mutually connected and, as the number of patients in the study decreases, they all work together to reduce the validity of the results.
If you want to look at the difference between male and female responses to a given treatment, you need to look at the female response in isolation, then the male response, and then compare the two. So if you had needed 1,000 patients to assure the required power for a sex-blind experiment, you would need 2,000 to achieve the same power for an investigation of the male and female response in isolation. If you want to
compare
those responses, you are now working with the results of this initial experiment, already affected by one layer of error; so, if your comparison is to have the same power as the experiment you first proposed, you need 4,000 patients. Add two further conditions, say, age and incomeâyou need an initial sample of 64,000. If your group reflects the relative proportion of black Americans in the population (10 percent), you need to multiply your initial sample by 10 to be able to say anything of significance about them in isolation. Determining the numbers you need is easyâachieving them may be impossible.
In medicine, we all want certaintyâbut we'd settle for rigor. Rigor, though, demands a high price in the complexity and size of experiment; and the numbers required for confidence in the results may be beyond any institution's capacity to administer. Ultimately, we reach a point where society has to trust the researchers to isolate the right variables in the right studies. We will never be entirely free of medical tact.
Â
Fisher never took the Hippocratic oath; the beings whose development he encouraged or stunted were mere plants. His standards were mathematical, and his highest duty was to precision under uncertainty. Importing Fisher's methods into medicine, however, brings clinical researchers constantly up against ethical questions. The point of randomization is to purge inference of human preconception and allow simple error full play. The point of ethics is to save the situation from error in the name of a human preconception. The two do not sit easily together.
Do you withhold untested but promising AIDS treatments from dying patients in the name of controlled experiment? Do you, to preserve the validity of your results, continue hormone replacement therapy trials in the face of statistically significant increases in the incidence of breast cancer? Every question puts you back in the Lanarkshire classroom, milk jug in hand, choosing between braw Sandy and poor wee Robert.
In practice, somebody else will usually make the choice. Almost all clinical trials now have to be approved beforehand by institutional ethical committees, often combining laypeople with experts. These committees are increasingly overworked and sometimes ill equipped to assess the studies they must consider. Of course, ethical decisions must trump scientific onesâbut this raises the question whether it is
ethical
to involve patients in a poorly designed study with insufficient statistical power to provide definitive results. And although the Declaration of Helsinki is supposed to govern all research committees, different countries have different standardsâso, just as some shipowners register their leaky tankers in a country with low safety standards, others take their dubious experiments to less demanding jurisdictions.
You can see where this is leading: the ethical component has become a variable in itself. When, as now so often happens, a study hopes to include samples from many institutions in different countries, setting up the protocol to harmonize the review process can be just as important as the experimental design. This requires so much time and money that there is now
an international review of multicenter studies of review bodies to determine the international protocols that can govern the review of multicenter studies
âproving there is such a thing as a conceptual palindrome.
Â
Medical research is self-sacrifice: years of study, long hours in ill-smelling buildings, complex statistical analysisâand, brooding behind it all, the Null Hypothesis: a form of self-mortification far more difficult to bear than eschewing fish without scales or fasting during daylight for a month.
Fisher said that experiment gives facts a chance to disprove the null hypothesis. As in hide-and-seek, the positive result is “it”: it will find youâassuming that it exists. If it doesn't, the null hypothesis prevails: alone in your office, you mark down a negative result, document your methodology, tidy up your notes, and send it all off to a journal.
This is not failure. Verifying the null hypothesis
should
be a valuable resultâassuming the trial stands up to scrutiny, it consigns the treatment tested to history's trash heap: we no longer bleed fever patients because of a negative result for Dr. Broussais' leeches. But a negative result, like a positive one, requires statistical power: there is a big difference between “We found nothing” and “We didn't find anything.” An effect might be strong enough to appear even in an underpowered study where, if it
hadn't
shown up, the null hypothesis could not be assumed. Short are the steps from uncertainty to ambiguity to confusion.
The Harvard statistician William Cochran said that experimenters always started a consultation by saying “I want to do an experiment
to show that
. . .” We are a hopeful species with an urge toward the positive, even in science: beneath the lab coat beats a human heart. Does this urge tip the results of clinical trials?
Statistically,
it does. In many medical fields, published work shows a slight overrepresentation of positive resultsâwhat's called “publication bias”âsince, after all, no news is no news.
All the things that affect the quality of a studyâsample size, randomization, double-blinding, placebo selection, statistical powerâerr statistically in the same direction: the less perfect the study, the more likely a positive result. Not every trial can afford the 58,050 patients of ISIS-4; not every scientific committee insists on perfect methodology; and every such departure from absolute rigor increases the chance of seeing something that might not be there. Bias need not be intentional, or even human errorâthe researcher's desire to have something positive to sayâit's inherent in the experimental process. We may think of a positive result as something hewn with great effort out of the surrounding randomness, but genuine randomness is actually harder to demonstrate. When it comes to seeking out the Null Hypothesis, the researcher is “it.”
Â
Doctors' mailboxes are full of glossy advertisements for new drugs, and doctors are eager to get their hands on new cures. A harassed GP hasn't time to trawl through peer-reviewed journals on the off chance of finding what he needs. His starting place has to be the flyer with the sample attached; FDA or MCA approval gives a guarantee that he's not dealing with crooks or wishful thinkersâbut it is his responsibility to read the very fine print and decide if the product is safe, effective, and appropriate for his patients.
Richard Franklin has a Ph.D. in mathematics as well as an M.D.; he is also president of a medical information search company, so he is unusually aware of the situation our doctor is in: “He'll read that, in a double-blind placebo-controlled clinical trial, this drug was shown to be
effective
in the treatment of a particular problem. If it's a fatal disease, his presumption is that patients who took the drug didn't die, or at least that fewer people died. But if we have the leisure to look at the design of a clinical trial, we discover that there is a definition of the word âeffective'âand that might be, for this particular trial, that there was a 30 percent reduction in the size of a lesion. You have your own understanding of âeffective'âbut in fact, âeffective' has its specific definition in each individual contextâa definition that aims to produce the minimum commercially acceptable difference.”
In many cases what a layman would consider to be the real clinical trial of a new drug takes place only after it has been approved, is on the market, and is being prescribed: postmarketing use. Over time enough data will be accumulated that even if something fails for its intended indication it succeeds for another. The most famous example is Viagra, which was initially tested as a hypertension reducer.
Â
Numbers can be just as slippery as words. Suppose that you are a doctor and have been presented with a choice of four cancer-screening programs to recommend for your hospital: here are the results as laid out in your questionnaire. All you need to do is mark your grade for each on a line stretching from 0 (“would not support”) to 10 (“definitely would support”).
⢠Program A reduced the death rate by 34 percent
⢠Program B produced an absolute reduction in deaths of 0.06 percent
⢠Program C increased the patients' survival rate from 99.82 percent to 99.88 percent
⢠Program D meant that 1,592 patients needed to be screened to prevent 1 death.
Â
Program A looks pretty good, doesn't it? Doctors and administrators who were given this questionnaire agreed: they gave it a score of 7.9 out of 10, well above its rivals. In fact, these numbers all describe exactly the same program.
The same misunderstanding appeared in studies of decisions by health purchasers in the UK, teaching-hospital doctors in Canada, physicians in the United States and Europe, and American pharmacists. All plumped for relative risk reduction, the percentage drop in the rate of deaths. We live in a world of percentagesâbut because they are a measure of proportion, not of absolute size, they contain the seeds of confusion. “Compared to what?” is never a pointless question.