How to Read a Paper: The Basics of Evidence-Based Medicine (13 page)

The chapter ‘The Clinical Examination’ in Sackett and colleagues' book ‘Clinical epidemiology: a basic science for clinical medicine’ [14] provides substantial evidence that when examining patients, doctors find what they expect and hope to find. It is rare for two competent clinicians to reach complete agreement for any given aspect of the physical examination or interpretation of any diagnostic test. The level of agreement beyond chance between two observers can be expressed mathematically as the Kappa score, with a score of 1.0 indicating perfect agreement. Kappa scores for specialists in the field assessing the height of a patient's jugular venous pressure, classifying diabetic retinopathy from retinal photographs and interpreting a mammogram X-ray, were, respectively, 0.42, 0.55 and 0.67 [14].

This digression into clinical disagreement should have persuaded you that efforts to keep assessors ‘blind’ (or to avoid offence to the visually impaired,
masked
), to the group allocation of their patients are far from superfluous. If, for example, I knew that a patient had been randomised to an active drug to lower blood pressure rather than to a placebo, I might be more likely to re-check a reading that was surprisingly high. This is an example of
performance bias
, which, along with other pitfalls for the unblinded assessor, are listed in
Figure 4.1
.

An excellent example of controlling for bias by adequate ‘blinding’ was published in the
Lancet
a few years ago [15]. Majeed and colleagues performed an RCT that demonstrated, in contrast with the findings of several previous studies, that the recovery time (days in hospital, days off work and time to resume full activity) after laparoscopic removal of the gallbladder (the ‘keyhole surgery’ approach) was no quicker than that associated with the traditional open operation. The discrepancy between this trial and its predecessors may have been because of the authors' meticulous attempt to reduce bias (see
Figure 4.1
). The patients were not randomised until after induction of general anaesthesia. Neither the patients nor their carers were aware of which operation had been performed, as all patients left the operating theatre with identical dressings (complete with blood stains!). These findings challenge previous authors to ask themselves whether it was expectation bias (see section ‘Ten questions to ask about a paper that claims to validate a diagnostic or screening test’), rather than swifter recovery, which spurred doctors to discharge the laparoscopic surgery group earlier.

Were preliminary statistical questions addressed?

As a non-statistician, I tend only to look for three numbers in the methods section of a paper.

a.
The size of the sample;
b.
The duration of follow-up; and
c.
The completeness of follow-up.

Sample size

One crucial prerequisite before embarking on a clinical trial is to perform a sample size (‘power’) calculation. A trial should be big enough to have a high chance of detecting, as statistically significant, a worthwhile effect if it exists, and thus to be reasonably sure that no benefit exists if it is not found in the trial.

In order to calculate sample size, the clinician must decide two things.

 
  • The level of difference between the two groups that would constitute a
    clinically significant
    effect. Note that this may not be the same as a statistically significant effect. To cite an example from a famous clinical trial of hypertension therapy, you could administer a new drug that lowered blood pressure by around 10 mmHg, and the effect would be a statistically significant lowering of the chances of developing stroke (i.e. the odds are less than 1 in 20 that the reduced incidence occurred by chance) [16]. However, if the people being asked to take this drug had only mildly raised blood pressure and no other major risk factors for stroke (i.e. they were relatively young, not diabetic, had normal cholesterol levels, etc.), this level of difference would only prevent around one stroke in every 850 patients treated—a clinical difference in risk which many patients would classify as not worth the hassle of taking the tablets. This was shown over 20 years ago—and confirmed by numerous studies since (see a recent Cochrane review [17]). Yet far too many doctors still treat their patients according to the
    statistical
    significance of the findings of mega trials rather than the clinical significance for their patient; hence (some argue), we now have a near-epidemic of over-treated mild hypertension [18].
  • The mean and the standard deviation (abbreviated SD; see ‘a’ of section ‘Have the authors set the scene correctly?’) of the principal outcome variable.

If the outcome in question is an event (such as hysterectomy) rather than a quantity (such as blood pressure), the items of data required are the proportion of people experiencing the event in the population, and an estimate of what might constitute a clinically significant change in that proportion.

Once these items of data have been ascertained, the minimum sample size can be easily computed using standard formulae, nomograms or tables, which may be obtained from published papers [19], textbooks [20], free access websites (try
http://www.macorr.com/ss_calculator.htm
) or commercial statistical software packages (see, for example,
http://www.ncss.com/pass.html
). Hence, the researchers can,
before the trial begins
, work out how large a sample they will need in order to have a moderate, high or very high chance of detecting a true difference between the groups. The likelihood of detecting a true difference is known as the
power
of the study. It is common for studies to stipulate a power of between 80% and 90%. Hence, when reading a paper about an RCT, you should look for a sentence that reads something like this (which is taken from Majeed and colleagues' cholecystectomy paper described earlier) [15].

For a 90% chance of detecting a difference of one night's stay in hospital using the Mann–Whitney U-test [see Chapter 5,
Table 5.1
], 100 patients were needed in each group (assuming SD of 2 nights). This gives a power greater than 90% for detecting a difference in operating times of 15 minutes, assuming a SD of 20 minutes
.

If the paper you are reading does not give a sample size calculation
and
it appears to show that there is no difference between the intervention and control arms of the trial, you should extract from the paper (or directly from the authors) the information in (a) and (b) earlier and do the calculation yourself. Underpowered studies are ubiquitous in the medical literature, usually because the authors found it harder than they anticipated to recruit their participants. Such studies typically lead to a Type II or β error—that is, the erroneous conclusion that an intervention has no effect. (In contrast, the rarer Type I or α error is the conclusion that a difference is significant when, in fact, it is because of sampling error.)

Duration of follow-up

Even if the sample size itself was adequate, a study must be continued for long enough for the effect of the intervention to be reflected in the outcome variable. If the authors were looking at the effect of a new painkiller on the degree of post-operative pain, their study may only have needed a follow-up period of 48 h. On the other hand, if they were looking at the effect of nutritional supplementation in the preschool years on final adult height, follow-up should have been measured in decades.

Even if the intervention has demonstrated a significant difference between the groups after, say, 6 months, that difference may not be sustained. As many dieters know from bitter experience, strategies to reduce obesity often show dramatic results after 2 or 3 weeks, but if follow-up is continued for a year or more, the unfortunate participants have (more often than not) put most of the weight back on.

Completeness of follow-up

It has been shown repeatedly that participants who withdraw from research studies are less likely to have taken their tablets as directed, more likely to have missed their interim check-ups and more likely to have experienced side effects on any medication, than those who do not withdraw (incidentally, don't use the term
drop out
as this is pejorative). People who fail to complete questionnaires may feel differently about the issue (and probably less strongly) than those who send them back by return of post. People on a weight-reducing programme are more likely to continue coming back if they are actually losing weight.

The following are among the reasons patients withdraw (or are withdrawn by the researchers) from clinical trials.

1.
Incorrect entry of patient into trial (i.e. researcher discovers during the trial that the patient should not have been randomised in the first place because he or she did not fulfil the entry criteria).
2.
Suspected adverse reaction to the trial drug. Note that you should never look at the ‘adverse reaction’ rate in the intervention group without comparing it with that on placebo. Inert tablets bring people out in a rash surprisingly frequently.
3.
Loss of participant motivation (‘I don’t want to take these tablets any more').
4.
Clinical reasons (e.g. concurrent illness, pregnancy).
5.
Loss to follow-up (e.g. participant moves away).
6.
Death. Clearly, people who die will not attend for their outpatient appointments, so unless specifically accounted for they might be misclassified as withdrawals. This is one reason why studies with a low follow-up rate (say, below 70%) are generally considered untrustworthy.

Ignoring everyone who has failed to complete a clinical trial will bias the results, usually in favour of the intervention. It is, therefore, standard practice to analyse the results of comparative studies on an
intent-to-treat
basis. This means that all data on participants originally allocated to the intervention arm of the study, including those who withdrew before the trial finished, those who did not take their tablets and even those who subsequently received the control intervention for whatever reason, should be analysed along with data on the patients who followed the protocol throughout. Conversely, withdrawals from the placebo arm of the study should be analysed with those who faithfully took their placebo. If you look hard enough in a paper, you will usually find the sentence, ‘results were analysed on an intent-to-treat basis’, but you should not be reassured until you have checked and confirmed the figures yourself.

There are, in fact, a few situations when intent-to-treat analysis is, rightly, not used. The most common is the
efficacy
(
or per-protocol
)
analysis
, which is to explain the effects of the intervention itself, and is therefore of the treatment actually received. But even if the participants in an efficacy analysis are part of an RCT, for the purposes of the analysis they effectively constitute a cohort study (see section ‘Cohort studies’).

Summing up

Having worked through the Methods section of a paper, you should be able to tell yourself in a short paragraph what sort of study was performed, on how many participants, where the participants came from, what treatment or other intervention was offered, how long the follow-up period was (or, if a survey, what the response rate was) and what outcome measure(s) were used. You should also, at this stage, identify what statistical tests, if any, were used to analyse the data (see Chapter 5). If you are clear about these things before reading the rest of the paper, you will find the results easier to understand, interpret and, if appropriate, reject. You should be able to come up with descriptions such as those given here.

This paper describes an unblinded randomised trial, concerned with therapy, in 267 hospital outpatients aged between 58 and 93 years, in which four-layer compression bandaging was compared with standard single-layer dressings in the management of uncomplicated venous leg ulcers. Follow-up was six months. Percentage healing of the ulcer was measured from baseline in terms of the surface area of a tracing of the wound taken by the district nurse and calculated by a computer scanning device. Results were analysed using the Wilcoxon matched-pairs test.
This is a questionnaire survey of 963 general practitioners randomly selected from throughout the UK, in which they were asked their year of graduation from medical school and the level at which they would begin treatment for essential hypertension. Response options on the structured questionnaire were ‘below 89 mm Hg’, ‘90-99 mm Hg’ and ‘100 mm Hg or greater’
.
Results were analysed using a Chi-squared test on a 3 × 2 table to see whether the threshold for treating hypertension was related to whether the doctor graduated from medical school before or after 1985
.
This is a case report of a single patient with a suspected fatal adverse drug reaction to the newly-released hypnotic drug Sleepol
.

When you have had a little practice in looking at the Methods section of research papers along the lines suggested in this chapter, you will find that it is only a short step to start using the checklists in Appendix 1, or the more comprehensive Users' Guides to the Medical Literature (
http://www.cche.net/usersguides/main.asp
). I will return to many of the issues discussed here in Chapter 6, in relation to evaluating papers on trials of drug therapy and other simple interventions.

Other books

Frost: A Novel by Thomas Bernhard
In Shadows by Chandler McGrew
A Florentine Death by Michele Giuttari
For the King’s Favor by Elizabeth Chadwick
Qumrán 1 by Eliette Abécassis
The Iron Chancellor by Robert Silverberg
If Ever I Loved You by Phyllis Halldorson
Irrepressible by Leslie Brody
Celia's House by D. E. Stevenson