Evidence-Based Practice & Medical Statistics
Because "I read something online" doesn't quite cut it in the AKT.
Last updated: April 2026 ยท Also known as Evidence-Based Medicine (EBM)
๐ฅ Downloads
Handouts, summaries, and teaching extras โ ready when you are.
path: Evidence-Based Practice is also known as Evidence-Based Medicine or EBM for short./statistics
- kappa values - why you should know a bit about them.doc
- nnt - 3 examples.doc
- nnt-what is it.pdf
- nnt.docx
- p value in plain english.doc
- prevalence and incidence.doc
- sensitivity specificity and ppv (TEACHING RESOURCE).ppt
- sensitivity specificity ppv and npv.docx
- statistical terms made simple.pdf
- statistics - basic statistics.ppt
- statistics - including forest plots reader nnt pvalues aar rrr.doc
- statistics handbook - unit 1 basic ideas.pdf
- statistics handbook - unit 2 graphing data.pdf
- statistics handbook - unit 3 describing data - measures of location.pdf
- statistics handbook - unit 4 describing data - measures of spread.pdf
- statistics handbook - unit 5 probability, risk and odds.pdf
- statistics handbook - unit 6 estimation and confidence intervals.pdf
- statistics handbook - unit 7 hypothesis tests and p values.pdf
- statistics handbook - unit 8 relationships between variables.pdf
- statistics handbook - unit 9 diagnostic tests.pdf
Web Resources
A hand-picked mix of official guidance and real-world GP training resources. Because sometimes the best pearls are not hiding in the official documents.
One-Minute Recall
Scanning this before clinic โ or the night before an AKT paper? These are the things that score you marks.
๐งฎ Risk Formulas
- ARR = CER โ EER
- NNT = 1 รท ARR
- RRR = ARR รท CER
- RR = EER รท CER
- NNH = 1 รท ARI
๐ฌ Diagnostic Testing
- Sens = TP รท (TP+FN) โ SnNout
- Spec = TN รท (TN+FP) โ SpPin
- PPV = TP รท (TP+FP) โ falls with low prevalence
- NPV = TN รท (TN+FN) โ falls with high prevalence
๐ Study Designs
- SR/Meta-analysis โ highest evidence
- RCT โ gold standard for treatment
- Cohort โ RR & incidence
- Case-control โ OR, rare diseases
- Cross-sectional โ prevalence
๐ Graphs
- Forest plot diamond crosses line = not significant
- Funnel asymmetry = publication bias
- Cates plot: NNT = 100 รท yellow faces
- Box plot middle line = median
- Iยฒ >50% = substantial heterogeneity
๐ Significance
- p < 0.05 = statistically significant
- CI crosses 1.0 (ratio) = not significant
- CI crosses 0 (difference) = not significant
- Mean = average; Median = middle value
- 68-95-99.7 rule for normal distribution
โ๏ธ Bias Types
- Selection โ unrepresentative sample
- Recall โ cases remember more
- Publication โ positive studies only
- Lead time โ screening illusion
- Attrition โ dropout distorts results
Why This Matters in GP & the AKT
EBM and statistics aren't just theoretical. In your consulting room, every conversation about treatment options involves NNTs whether you name them or not. Every blood test has a sensitivity and specificity. Every new guideline is based on a study design that affects how much trust you should place in it.
In the AKT, this topic accounts for a significant proportion of marks โ roughly 10โ15% of the paper according to RCGP guidance. It is one of the few areas where a small amount of targeted revision pays dividends immediately. Many candidates lose easy marks here not because the concepts are difficult, but because they've never sat down and learned them systematically.
The statistics questions in the AKT often present a table of trial data and ask you to calculate a value, interpret a graph, or identify the best study design. They reward methodical thinking, not medical knowledge. This makes them the most "learnable" marks in the paper.
Evidence-Based Medicine (EBM)
"When I was in training in the mid-1980s, I gave an intravenous infusion of lidocaine to every patient who came through the door after a heart attack. That was the standard. Everyone did it. It seemed to make perfect sense."
โ Professor Gordon Guyatt, the physician who coined the term "Evidence-Based Medicine", describing his own training before EBM existed
He later discovered that the practice he'd been trained in โ and that hundreds of thousands of doctors worldwide were performing โ was not only useless, but potentially killing people. Not through negligence. Not through incompetence. But because no one had ever properly tested whether it actually worked.
That story is why Evidence-Based Medicine exists โ and why it matters deeply to every patient you will ever see.
"The conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients."
โ David Sackett, BMJ 1996 โ the most widely cited definition in medicine
In plain English: rather than treating patients based on habit, opinion, tradition, or what your professor told you, EBM requires you to base clinical decisions on the best available research โ rigorously conducted, critically appraised, and honestly interpreted.
It does not replace clinical judgement โ it informs it. EBM rests on three inseparable pillars that must work together:
๐ฐ๏ธ How Did EBM Come About? โ A Brief History
EBM didn't appear from nowhere. It was the culmination of decades of quietly revolutionary thinking in Canada and the UK.
| Year | Event |
|---|---|
| 1938 | John Paul (Yale) coins the term "clinical epidemiology" โ the idea that medicine should be studied scientifically in populations, not just observed in individual patients |
| 1967 | McMaster University (Hamilton, Canada) opens its new medical school with a Department of Clinical Epidemiology and Biostatistics โ radical at the time, dedicated to applying research methods to clinical decisions |
| 1972 | Archie Cochrane, a Scottish epidemiologist, publishes Effectiveness and Efficiency: Random Reflections on Health Services โ a landmark text arguing that medicine must test its own treatments rigorously. His work eventually gives birth to the Cochrane Collaboration, named in his honour. |
| 1981 | David Sackett and colleagues at McMaster publish a nine-article series in the Canadian Medical Association Journal teaching clinicians how to critically appraise medical literature. This is the formal beginning of the EBM movement. |
| 1990 | Gordon Guyatt, a young resident director at McMaster, designs a new teaching programme and initially calls it "Scientific Medicine." Colleagues recoil โ the implication that current practice isn't scientific is too direct. |
| 1991 | Guyatt renames the approach "Evidence-Based Medicine" and publishes the term in an editorial in the ACP Journal Club. The phrase sticks immediately. |
| 1992 | The landmark JAMA paper โ "Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine" โ introduces EBM to the world. The response, Guyatt recalls, was initially "rage." Colleagues felt they were being told they weren't good doctors. |
| 1993 | The Cochrane Collaboration is formally founded โ an international network to produce and disseminate systematic reviews of healthcare evidence. |
| 1996 | David Sackett publishes the definitive three-pillar definition in the BMJ. EBM becomes mainstream. |
| 2000sโpresent | EBM becomes embedded in UK training: NICE guidelines, the GMC's Good Medical Practice, and the RCGP curriculum all require it. It is the foundation of how every UK doctor is now trained and assessed. |
โ ๏ธ What Was Medicine Like Before EBM? The Problem It Solved
Before EBM, medicine ran on what Gordon Guyatt memorably called "GOBSAT" โ Good Old Boys Sitting Around a Table. Clinical guidelines were written by senior experts who pooled their personal opinions, and what happened to your patient depended entirely on which doctor happened to see them.
- Eminence-based medicine: You treated patients the way your professor did. Authority came from seniority, not evidence. A consultant who had "always done it this way" for 30 years was deferred to โ even if "this way" had never been tested.
- Intuition-based medicine: If a treatment seemed to make physiological sense, it was used. If suppressing abnormal heart rhythms seemed logical, you suppressed them. Whether it actually helped patients was rarely tested.
- Anecdote-based medicine: "In my experience, I've found that..." was the standard of evidence. Individual cases drove practice โ even when those cases were statistical outliers.
- Enormous variation: The same patient presenting to two different hospitals โ or even two different doctors in the same hospital โ might receive completely different treatment for exactly the same condition.
๐ด The Example That Changed Medicine โ The CAST Trial
This is not a hypothetical. It is one of the most important true stories in modern medicine โ and one of the strongest arguments that EBM has ever needed.
The Setup โ The Logic That Seemed Unassailable
Heart attacks cause dangerous heart rhythm abnormalities (ventricular arrhythmias). Ventricular arrhythmias cause sudden death. Therefore: suppress the arrhythmias โ prevent sudden death. This seemed so obviously right that from the 1970s onwards, antiarrhythmic drugs โ particularly lidocaine, flecainide, and encainide โ were routinely given to post-MI patients in hospitals across the world. Not occasionally. Routinely. As standard care.
The Trial โ Someone Actually Tested It
In 1987, the Cardiac Arrhythmia Suppression Trial (CAST) enrolled over 1,700 post-MI patients and randomised them to antiarrhythmic drugs (flecainide or encainide) or placebo. The drugs did exactly what they were supposed to โ they successfully suppressed the arrhythmias. But something unexpected happened.
The Result โ What Nobody Expected
Patients on the drugs were 2.5 times more likely to die than those on placebo. The trial had to be stopped early because the harm was so clear. The drugs had been killing the very patients they were meant to protect.
NNH = 21. Every 21 patients treated with flecainide or encainide, one additional person died who would otherwise have survived.
The Lesson
Gordon Guyatt โ then a young cardiologist โ had personally given lidocaine infusions to every post-MI patient who came through his ward. He was following best practice. He had been taught correctly. He had good intentions. And yet, without the rigorous test of an RCT, neither he nor his colleagues had any way of knowing the treatment was harmful. This experience became central to why he dedicated his career to EBM. The history of medicine, he later said, "is full of treatments that were based mostly on guess-work and intuition rather than solid evidence."
Before EBM, what you got depended on where you happened to live, which hospital you attended, and which doctor saw you. The same patient with the same condition might receive completely different treatments in Leeds and London. Different hospitals. Different countries. Wildly different outcomes.
EBM changed this. By anchoring clinical decisions to the same body of evidence โ the same trials, the same systematic reviews, the same guidelines โ it gave medicine a common language and a common standard. Today, a patient presenting with an MI in Bradford and a patient presenting with an MI in Bristol should receive essentially the same evidence-based care. Not because doctors are identical, but because the treatment is driven by the evidence, not by individual preference.
In the UK, this is operationalised through NICE guidelines, the RCGP curriculum, QOF indicators, clinical audits, and MRCGP examinations โ all of which require and assess evidence-based practice. When you sit the AKT, you are being tested on your ability to apply this framework.
๐ A World Without EBM โ The International Picture
The UK's commitment to EBM โ through NICE, the NHS, and postgraduate training โ is not universal. In many parts of the world, what you receive as a patient still depends heavily on who sees you, where you present, and how much you can pay. Understanding this helps you appreciate what EBM protects your patients from.
The variation described below reflects healthcare systems and structures, not the competence or dedication of individual doctors. Many brilliant, hard-working physicians practise in every country listed. The issue is the absence of the standardising infrastructure โ guidelines, oversight, training frameworks โ that EBM provides. Individual doctors cannot overcome systemic problems alone.
| Country / Region | How Practice Varies Without Strong EBM Frameworks |
|---|---|
| ๐ฎ๐ณ India (private sector) | A 2018 Lancet study found C-section rates of 40โ58% in private hospitals compared to 10โ14% in public facilities โ often driven by financial incentives rather than clinical need. Over-investigation and polypharmacy are widely documented in the private sector. The same cancer patient may receive dramatically different treatment based on where they present and what they can afford. |
| ๐ต๐ฐ Pakistan | Significant variation in adherence to antibiotic guidelines โ one of the highest antibiotic prescription rates in South Asia. Drug-resistant TB and antimicrobial resistance are direct consequences. Access to specialist care and standardised management pathways is highly dependent on geography and income. |
| ๐ณ๐ฌ Nigeria / ๐ฌ๐ญ Ghana | Magnesium sulphate is the WHO-recommended, evidence-based, inexpensive treatment for eclampsia. Studies show it reduces maternal mortality significantly. Yet availability and actual use in Nigerian and Ghanaian facilities varies enormously depending on hospital resources and clinician training โ meaning whether a woman with eclampsia lives or dies may depend on which facility she reaches. |
| ๐ธ๐ฉ Sudan / ๐ฎ๐ถ Iraq | Prolonged conflict and instability have devastated healthcare infrastructure. In Iraq after 2003, public health services collapsed, guideline implementation stalled, and access to basic drugs became geography-dependent. In Sudan, conflict has disrupted vaccination programmes, maternal health services, and chronic disease management. Practice variation in such environments is not a matter of preference โ it is a matter of what is available. |
| ๐ช๐ฌ Egypt / ๐ฎ๐ท Iran | Both countries have medical schools producing skilled physicians and have published EBM guidelines โ but implementation is inconsistent between public and private sectors, and between urban and rural areas. In Iran, international sanctions have affected drug availability, forcing adaptations that diverge from evidence-based protocols. |
| ๐ท๐ด Romania | Romania has been documenting the practice of plicul (the "envelope") โ informal cash payments to doctors and nurses to ensure care. Officially illegal, widely practised. Parliamentary enquiries and investigative journalism have confirmed that the quality of surgical care can depend on what a patient can pay privately, regardless of their official NHS-equivalent entitlement. Brain drain has removed an estimated 14,000+ doctors since EU accession. |
| ๐บ๐ธ United States | The most expensive healthcare in the world โ over $12,000 per person per year โ yet outcomes often no better than the UK. The Dartmouth Atlas project has documented enormous geographic variation in clinical practice: the same patient in Miami may receive twice as many investigations and procedures as the same patient in Minneapolis, with no difference in outcomes. A 2019 JAMA study estimated $935 billion โ roughly a quarter of all US healthcare spending โ is wasted on unnecessary care. The opioid crisis was partly fuelled by pharmaceutical companies influencing prescribing practices outside of EBM frameworks. |
| ๐ซ๐ท France | France has excellent healthcare โ but antibiotic prescribing rates have historically been among the highest in Europe, driven partly by cultural expectations that a consultation should always end with a prescription. Campaigns to reduce this ("Antibiotics are not automatic") have helped, but the pattern illustrates how cultural and commercial pressures can override evidence-based guidance even in sophisticated systems. |
| ๐ฎ๐น Italy | Healthcare quality in Northern Italy (Milan, Bologna) is among the best in Europe. In parts of Southern Italy, the picture is very different โ longer waiting times for cancer surgery, less consistent application of screening programmes, lower adherence to guideline-based care. Geographic origin within the same country can significantly affect outcomes. |
| ๐ฌ๐ท Greece / ๐ช๐ธ Spain | Greece's austerity crisis (2010โ2015) led to healthcare spending cuts of over 25%, causing documented shortages of medicines, staff reductions, and quality deterioration. Over 35,000 healthcare workers emigrated. In Spain, significant regional variation in cancer survival rates has been documented โ the tumour you develop may behave differently depending on which region you happen to live in, not because the biology differs, but because the system's application of evidence-based treatment does. |
In healthcare systems without robust EBM frameworks or universal entitlements, the relationship between payment and treatment quality is often direct and documented:
- In some Indian private hospitals, a patient presenting with chest pain who can pay for private care may receive immediate catheterisation and stenting. The same patient in the public system may wait hours for an ECG.
- In parts of Sub-Saharan Africa, whether a child with severe malaria receives artemisinin-based combination therapy (the evidence-based standard) or an older, less effective drug depends on which facility they reach and what their family can pay.
- In countries without universal drug access, cancer chemotherapy agents may be available only to those who can pay out-of-pocket โ meaning identical cancers have dramatically different outcomes based solely on income.
- In Romania and some other Eastern European countries, the quality of a surgical procedure โ the surgeon's diligence, the quality of anaesthesia monitoring, even the availability of post-operative nursing โ has been documented to depend on informal payment, not clinical need.
Every time you board a commercial flight, you benefit from one of the most effective safety systems humans have ever built. Not because individual pilots are exceptionally talented โ though they are. But because aviation is built around standardised, evidence-tested protocols. Every pilot, every airline, every country follows the same pre-flight checklists, the same landing procedures, the same emergency protocols. The system protects you regardless of which individual pilot you get.
Medicine without EBM is aviation without checklists. Your safety depends entirely on whether you happen to get a good pilot, whether that pilot trained recently enough, whether they are having a good day, and whether they've heard the latest thinking from someone they trust.
EBM replaces luck with systems. It replaces "in my experience" with "in 15,000 trials involving 2 million patients." It replaces the opinion of whoever happens to be most senior in the room with the accumulated evidence of humanity's collective clinical experience. Wouldn't you prefer that for your patients? Wouldn't your patients prefer it for themselves?
- Every NICE guideline you follow is the product of systematic evidence review โ someone has done the work of ensuring that what you do is supported by the best available science
- Every AKT question on statistics and research methods is testing your ability to critically appraise evidence โ to be an active consumer of EBM, not a passive follower of instructions
- When you explain an NNT to a patient, or discuss the limitations of a screening test, or refuse to prescribe an antibiotic that isn't indicated, you are practising EBM โ consciously, explicitly, and judiciously
- And when a drug rep sits across from you and tells you their new medication reduces cardiovascular events by 35%, your first question โ "35% relative or absolute?" โ is the question that EBM taught medicine to ask
Study Designs & Hierarchy of Evidence
Before you interpret any result, you need to know where it came from. Different study designs answer different questions, generate different statistics, and carry different levels of reliability.
The Evidence Pyramid
Strongest evidence at the top; weakest at the bottom
โฌ Strongest evidence | โฌ Weakest evidence
| Study Design | Direction | Best For | Generates | Key Weakness |
|---|---|---|---|---|
| Systematic Review / Meta-Analysis | โ | Best overall evidence on a question | Pooled effect size | Only as good as underlying studies; heterogeneity |
| RCT (Randomised Controlled Trial) | Forward | Does treatment X work? | RR, ARR, NNT | Expensive, artificial setting, ethical issues |
| Cohort Study | Forward (prospective) or backward (retrospective) | Does exposure cause outcome? Incidence? | Relative Risk (RR), Incidence | Attrition; expensive over time; confounding |
| Case-Control Study | Backward | Rare diseases; risk factors | Odds Ratio (OR) | Recall bias; cannot calculate incidence directly |
| Cross-Sectional Study | Single snapshot | How common is X right now? (prevalence) | Prevalence | Cannot establish causation; temporal ambiguity |
| Case Report / Expert Opinion | โ | Hypothesis generation; rare events | Description only | Highly susceptible to bias; not generalisable |
๐ Qualitative vs Quantitative Research
Answers "how many" or "how much." Uses numbers, statistics, and structured data. Examples: RCTs, cohort studies, surveys with numerical outcomes. Generates p-values, CIs, NNTs.
Answers "why" or "how." Uses words, themes, and interviews. Examples: focus groups, ethnographic studies, grounded theory. Explores patient experiences and beliefs.
๐ฌ Systematic Review vs Meta-Analysis โ Not the Same Thing
Systematic Review: A rigorous, structured literature search that identifies, selects, and critically appraises all relevant studies on a question. The result is a qualitative summary of the evidence.
Meta-Analysis: A systematic review that goes one step further โ it mathematically pools the quantitative results from multiple studies into a single combined estimate. Not all systematic reviews include a meta-analysis (e.g., if studies are too heterogeneous to combine).
๐ The PICO Framework
PICO is the standard framework for structuring a clinical research question โ used to search the literature and design studies.
| Letter | Stands For | Example |
|---|---|---|
| P | Population / Patient | Adults with type 2 diabetes |
| I | Intervention | SGLT2 inhibitors |
| C | Comparison | Metformin alone |
| O | Outcome | Cardiovascular events at 5 years |
๐ฒ RCT Design Features (commonly tested)
| Feature | What It Means | Why It Matters |
|---|---|---|
| Randomisation | Participants allocated to groups by chance | Eliminates selection bias; balances confounders |
| Single blind | Participants don't know their allocation | Reduces placebo effect and participant bias |
| Double blind | Neither participants nor investigators know allocation | Eliminates observer bias AND participant bias |
| Triple blind | Participants, investigators, AND data analysts blinded | Maximum bias reduction |
| Intention to Treat (ITT) | Analysed in their original group regardless of adherence | Preserves randomisation; reflects real-world use |
| Per Protocol | Analysed only if they completed the protocol | Shows biological efficacy but overestimates real-world benefit |
| Crossover design | Participants receive both treatments in sequence | Each person acts as their own control; needs washout period |
| Allocation concealment | The person recruiting participants cannot see which group the next participant will be assigned to until after they have been enrolled | Prevents the recruiter from subconsciously (or deliberately) allocating healthier patients to the treatment group โ a form of selection bias that randomisation alone does not prevent |
| Cluster RCT | Whole groups (e.g. GP practices, wards, schools) are randomised rather than individuals | Used when individual randomisation is impractical (e.g. testing a new consultation style). Requires larger sample sizes and statistical adjustment for clustering effect |
Intention to treat gives a more conservative (lower) estimate of effectiveness โ because it includes non-adherent participants. This is the preferred analysis for clinical decisions. Per protocol overestimates effectiveness but is useful for understanding biological mechanism.
โ๏ธ Superiority, Non-Inferiority & Equivalence Trials
Not all trials ask the same question. The AKT occasionally tests whether you understand what a trial was actually designed to show โ and why that matters when interpreting its results.
| Trial Type | The Question Being Asked | Common Context |
|---|---|---|
| Superiority trial | "Is the new treatment better than the comparator?" | Most standard RCTs โ testing a genuinely new drug or approach |
| Non-inferiority trial | "Is the new treatment no worse than the comparator by more than a pre-specified small margin?" | New drug with fewer side effects, lower cost, or easier to administer โ aim is to show it's "good enough" |
| Equivalence trial | "Are the two treatments essentially the same?" | Biosimilar drugs; generic medicines; different routes of administration |
A new anticoagulant might be shown to be "non-inferior" to warfarin for stroke prevention โ not better, but not meaningfully worse โ while being easier to use (no INR monitoring). That's a clinically valuable finding even if the drug didn't "beat" warfarin. The AKT may ask you to interpret a non-inferiority trial result correctly.
A non-inferiority trial that shows "no significant difference" is not the same as a superiority trial that shows "no significant difference." In a superiority trial, a non-significant result means you failed to prove the new drug works better. In a non-inferiority trial, a non-significant difference is exactly what you were hoping for.
Research Bias, Validity & Reliability
| Type of Bias | Definition | Which Study Designs? | How to Reduce It |
|---|---|---|---|
| Selection Bias | Participants are not representative of the target population | All types | Randomisation; careful sampling |
| Recall Bias | Cases (people with disease) remember past exposures more vividly than controls | Case-control studies | Objective data sources; standardised questioning |
| Publication Bias | Positive/significant studies are published more often than negative ones | Meta-analyses (detected via funnel plot) | Trial registration; grey literature search |
| Attrition Bias | Loss of participants to follow-up distorts results (dropouts differ from completers) | Cohort studies, RCTs | Intention-to-treat analysis; minimise dropout |
| Lead Time Bias | Screening gives the illusion of improved survival by detecting disease earlier | Screening studies | Use disease-specific mortality, not survival from diagnosis |
| Length Bias | Screening detects more slow-growing (less aggressive) disease | Screening studies | RCTs with disease-specific outcomes |
| Observer / Assessment Bias | Knowledge of treatment allocation affects outcome assessment | RCTs | Blinding (single, double, triple) |
| Hawthorne Effect | Participants change their behaviour because they know they are being observed | All types, especially observational | Control groups; blind observers |
| Verification Bias | Only patients with a positive test result get the gold-standard confirmatory test โ so sensitivity appears falsely high and specificity falsely low | Diagnostic test studies | Ensure all patients (positive and negative) receive the gold-standard test |
| Confounding | A third variable is associated with both exposure and outcome, distorting the apparent relationship | Observational studies | Randomisation (RCTs); stratification; multivariate analysis |
๐ Confounding โ When Two Things Look Linked But Aren't
Confounding is one of the most important concepts in research methodology โ and one of the most common reasons why apparently convincing observational findings turn out to be wrong. The key idea is simple: two things can appear to be linked not because they directly affect each other, but because both are linked to a hidden third variable.
A confounder (or confounding variable) is a third variable that is independently associated with both the exposure and the outcome. It creates a spurious (false) association โ or masks a real one โ between the exposure and outcome you are studying.
Classic Examples
| Apparent Link | The Confounder | Why It Explains Everything |
|---|---|---|
| Ice cream sales โ drowning deaths | Hot weather (summer) | Hot weather causes both more ice cream eating AND more swimming โ more drownings. Ice cream doesn't cause drowning. |
| Coffee drinking โ lung cancer | Smoking | Smokers drink more coffee on average. Early studies linked coffee to cancer โ until smoking was controlled for. |
| Carrying a lighter โ lung cancer | Smoking | Smokers carry lighters. The lighter has no biological effect โ smoking does. |
| Grey hair โ heart disease | Age | Both grey hair and heart disease increase with age. Age is the confounder โ not hair colour. |
| Shoe size โ reading ability (in children) | Age | Older children have bigger feet AND read better. Age explains both. |
A variable is a confounder if it meets all three of these:
- It is associated with the exposure (the thing you're studying)
- It is associated with the outcome (the result you're measuring)
- It is not on the causal pathway between exposure and outcome (it's a separate third variable, not a step in between)
- Randomisation (in RCTs) โ distributes known and unknown confounders equally between groups. This is the strongest protection.
- Restriction โ only enrol participants who are similar on the confounder (e.g. only non-smokers in the coffee study)
- Matching โ pair cases and controls on the confounder variable
- Stratification โ analyse results separately for each level of the confounder
- Multivariate statistical adjustment โ statistically account for multiple confounders simultaneously
In a well-conducted RCT, randomisation distributes confounders (both known and unknown) equally between groups โ eliminating confounding as an explanation for differences. In cohort and case-control studies, you can only adjust for confounders you know about and have measured. Unmeasured confounders always remain a potential explanation for any observed association โ which is why observational studies can never definitively prove causation.
โ Validity & Reliability โ What's the Difference?
Does the study measure what it claims to measure within the study population? Are the results of this study trustworthy? Threatened by bias and confounding.
Can the results be applied to other populations or real-world settings? A highly controlled RCT in a specialist centre may not reflect what happens in primary care.
Does the test produce consistent results when repeated under the same conditions? Measured by inter-rater reliability (kappa statistic) or test-retest reliability.
Measures agreement between two raters beyond chance. ฮบ = 1 (perfect agreement); ฮบ = 0 (agreement no better than chance); ฮบ < 0 (worse than chance).
Measuring Risk & Treatment Effect
This is the most heavily tested area in AKT statistics. You need to know these formulas cold and be able to apply them to trial data tables under exam conditions.
The Key Metrics
๐ NNT Interpretation โ What Do The Numbers Actually Mean?
NNT tells you how many patients you need to treat for one to benefit. So the fewer patients you need to treat to get one benefit, the more effective the treatment. NNT = 2 means 1 in every 2 patients benefits โ that's excellent. NNT = 100 means only 1 in 100 benefits โ much weaker.
| NNT Range | Rough Interpretation | Example Context |
|---|---|---|
| < 10 | Very effective | Antibiotics for certain infections; some acute treatments |
| 10 โ 50 | Moderate effect | Many common preventative medications |
| > 100 | Weak effect | Some population-level preventive strategies |
These thresholds are not from NICE or RCGP โ they are rough teaching aids only. The "right" NNT is always context-dependent:
- An NNT of 100 might still be worthwhile if the outcome prevented is death or serious irreversible harm
- An NNT of 5 might not be acceptable if the treatment has frequent or serious side effects
One-line rule: NNT tells you how many patients you treat for one to benefit โ lower = stronger effect. But always weigh it against the severity of the outcome and the burden of treatment.
If 60 out of 100 patients benefit, that is an ARR of 60% (= 0.6), giving NNT = 1 รท 0.6 โ 1.7 โ an excellent result. The NNT is not the number who benefit; it is the number you treat to get one benefit.
๐ Odds Ratio & Hazard Ratio โ When Are These Used?
| Measure | Used In | Interpretation |
|---|---|---|
| Relative Risk (RR) | Cohort studies, RCTs | Directly compares risk in two groups. More intuitive than OR. |
| Odds Ratio (OR) | Case-control studies | Compares odds of exposure in cases vs controls. Approximates RR when disease is rare. |
| Hazard Ratio (HR) | Survival analysis (time-to-event) | Like RR but accounts for when events occur over time. HR <1 = reduced hazard in treatment group. |
For common diseases, the OR overestimates risk compared to the RR. For rare diseases (<10% prevalence), they are approximately equal. A case-control study generates an OR โ you cannot directly calculate incidence or RR from a case-control study.
๐งฎ Worked Example โ Calculating NNT from Trial Data
๐ฏ Scenario: Statin Trial
A 5-year RCT shows that among patients with high cardiovascular risk: 6% in the placebo group had a heart attack, compared to 4% in the statin group.
Treat 50 patients for 5 years to prevent 1 heart attack
Sounds impressive! But the absolute benefit is only 2%.
The statin group had 67% of the risk of the placebo group โ a 33% relative reduction.
When a patient asks "Will this statin help me?", use NNT. "If 50 people like you take this tablet for 5 years, 1 heart attack will be prevented. For you personally, it's a 2% absolute benefit." That's far more honest than "It reduces your risk by 33%."
๐ฌ Communicating Risk to Patients (AKT Favourite)
The AKT often tests how you would explain statistical information to patients. There are four main formats:
| Format | Example | Best For |
|---|---|---|
| Natural frequency | "5 out of every 100 people" | Easiest for patients to understand |
| Percentage | "5% of people" | Widely used but can mislead |
| NNT | "Treat 20 to prevent 1 event" | Communicates absolute benefit clearly |
| Cates plot | Visual grid of 100 faces | Best visual aid for shared decision-making |
Saying "this reduces your risk by 33%" without giving the baseline risk is misleading. A 33% RRR from a baseline of 0.3% means your absolute benefit is 0.1%. Always pair RRR with baseline risk or give ARR/NNT instead.
๐ The Pharma Rep Trick โ How Drug Companies Spin Statistics
Drug company representatives are trained to present trial data in the most favourable light. They use a simple but effective statistical sleight of hand:
- Benefits โ quoted as Relative Risk Reduction (RRR) โ because it sounds bigger and more impressive
- Harms โ quoted as Absolute Risk Increase (ARI) โ because it sounds smaller and less concerning
Worked Example โ a fictional statin rep visit
A new statin reduces heart attacks from 2% to 1% over 5 years, but increases myopathy from 1% to 2%.
| What the rep says | What it actually means |
|---|---|
| "This drug reduces heart attacks by 50%" | RRR = 50% โ but ARR is only 1%. NNT = 100. Treat 100 people for 5 years to prevent 1 heart attack. |
| "The myopathy risk increases by only 1%" | ARI = 1% โ but that's actually a doubling of the myopathy risk (Relative Risk Increase = 100%). |
The antidote: always ask โ "What is the absolute difference?" When a rep quotes a relative figure, convert it yourself: ARR = CER โ EER, then NNT = 1 รท ARR. This applies equally to benefits and harms.
Diagnostic Testing & Screening โ The 2ร2 Table
The 2ร2 contingency table is the foundation of all diagnostic statistics. If you can build and read this table, you can answer most diagnostic AKT questions.
The 2ร2 Table
| ACTUAL DISEASE STATUS | ||
|---|---|---|
| TEST RESULT | Disease Present | Disease Absent |
| Test Positive | True Positive (TP) Test says YES โ has disease โ | False Positive (FP) Test says YES โ no disease โ |
| Test Negative | False Negative (FN) Test says NO โ has disease โ | True Negative (TN) Test says NO โ no disease โ |
๐ง SnNout & SpPin โ The Memory Aids (and Why They Work)
Snout: A highly Sensitive test, if Negative, rules the disease Out.
If sensitivity is very high, a negative test result means the disease is very unlikely (low false-negative rate). Use a sensitive test to screen and rule out.
Spin: A highly Specific test, if Positive, rules the disease In.
If specificity is very high, a positive test result means disease is very likely (low false-positive rate). Use a specific test to confirm diagnosis.
๐ The Effect of Prevalence on PPV & NPV โ Critical AKT Topic
This is one of the most important and most tested concepts in diagnostic statistics. Sensitivity and specificity are fixed properties of the test. But PPV and NPV change dramatically depending on how common the disease is in the population you're testing.
| Scenario | Prevalence | PPV | NPV |
|---|---|---|---|
| Screening the general population for a rare disease (e.g. HIV in low-risk) | 1% | LOW (~16%) | HIGH (~99.9%) |
| Testing in a high-risk specialist clinic | 50% | HIGH (~95%) | HIGH (~95%) |
- As prevalence โ โ PPV โ, NPV โ
- As prevalence โ โ PPV โ, NPV โ
- In a low-prevalence population, most positive results are false positives (low PPV) โ even with a highly specific test
- A negative test in a high-prevalence population may still miss disease (lower NPV)
๐ Likelihood Ratios โ When You Want to Go Further
Likelihood ratios (LRs) combine sensitivity and specificity into a single number that tells you how much a test result shifts the probability of disease. More advanced than PPV/NPV, but useful โ and occasionally tested in AKT.
| LR+ | Effect on Post-Test Probability |
|---|---|
| >10 | Large and often conclusive increase in probability |
| 5โ10 | Moderate increase |
| 2โ5 | Small increase |
| 1 | No change (test is useless) |
| 0.1โ0.5 | Small to moderate decrease |
| <0.1 | Large decrease โ strong negative rule-out |
Population Statistics & Epidemiology
๐ Standardised Mortality Ratio (SMR)
| SMR Value | Interpretation |
|---|---|
| = 100 | Mortality same as reference population |
| > 100 | Excess mortality (higher than expected) |
| < 100 | Lower mortality than expected |
๐ฏ Screening โ Lead Time & Length Bias (Key AKT Traps)
Screening detects a disease earlier, making it appear that survival has improved โ even if the patient dies at the same point in time. The "survival time from diagnosis" has simply been extended by early detection, not by actual treatment benefit. The patient doesn't live longer; they just know for longer.
Screening programmes are more likely to detect slow-growing, indolent disease (which has a longer "detectable preclinical phase") than aggressive disease that progresses quickly. This makes screening look more effective than it really is for severe disease.
โ Wilson & Jungner Screening Criteria
Before a screening programme is introduced, it should satisfy the Wilson & Jungner criteria (originally published 1968, still the standard framework). The AKT tests both knowledge of these criteria and application of them to scenarios.
| # | Criterion | What It Means In Practice |
|---|---|---|
| 1 | Important health problem | The condition has significant morbidity or mortality โ worth the effort of screening |
| 2 | Accepted treatment available | No point detecting disease you cannot treat |
| 3 | Facilities for diagnosis & treatment exist | Infrastructure must be in place before launching |
| 4 | Recognisable latent or early stage | The disease must have a detectable pre-symptomatic phase |
| 5 | Suitable test available | Test must be acceptable to the population, safe, and reasonably accurate |
| 6 | Test acceptable to the population | People must be willing to undergo it โ invasive or uncomfortable tests may deter uptake |
| 7 | Natural history adequately understood | Must know how the disease progresses if left untreated |
| 8 | Agreed policy on who to treat | Clear protocols needed โ not just detection but what happens next |
| 9 | Cost-effective | Cost of finding each case must be balanced against benefit |
| 10 | Continuous process, not one-off | Screening must be ongoing โ disease incidence continues |
PSA screening for prostate cancer is not part of the NHS national screening programme precisely because it struggles with criteria 5 and 7: PSA is not a sufficiently accurate test (low specificity โ many false positives), and the natural history of many low-grade prostate cancers means they would never cause symptoms in the patient's lifetime. This links directly to overdiagnosis (see below).
๐ Overdiagnosis โ Finding Problems That Wouldn't Have Caused Problems
Overdiagnosis occurs when a real disease is detected โ one that truly exists โ but that disease would never have caused symptoms, harm, or death during the patient's lifetime if left undetected. It is not a false positive (the disease is real); it is a true positive that did not need to be found.
Overdiagnosis converts well people into patients. It exposes them to the anxiety, side effects, and risks of treatment for a condition that would never have harmed them. It is one of the most important harms of screening programmes โ and one the AKT tests directly.
| Condition | Overdiagnosis Example |
|---|---|
| Prostate cancer | Many low-grade cancers detected by PSA would never progress or cause symptoms โ men die with them, not from them |
| Thyroid cancer | Ultrasound finds tiny papillary thyroid cancers that are almost universally indolent โ detection has soared but mortality unchanged |
| DCIS (breast) | Ductal carcinoma in situ detected by mammography โ some would never become invasive cancer |
Overdiagnosis = finding a disease that didn't need finding. Overtreatment = treating a disease that didn't need treating (which may follow overdiagnosis, or may occur independently). They are related but distinct concepts.
๐ข Age Standardisation
When comparing disease rates or mortality between different populations (e.g. different countries, different time periods), you need to adjust for the fact that those populations may have different age distributions. Older populations will naturally have higher mortality even if their health is equally good.
Age standardisation is a statistical technique that removes the distorting effect of different age distributions when comparing health outcomes between populations. It produces a rate that would be observed if both populations had the same age structure (the "standard population").
๐๏ธ Cancer Statistics โ AKT Must-Knows
The AKT occasionally tests specific cancer statistics. You do not need exhaustive oncology knowledge โ but these headline figures come up and are worth knowing.
| Cancer | Key Statistic | Why It Matters |
|---|---|---|
| Testicular cancer | >98% 10-year survival | One of the most treatable cancers โ important to know for counselling young men |
| Lung cancer | Leading cause of cancer death (UK) | Despite not being the most common cancer, it kills more people than any other |
A cancer can have a high incidence (common) but low mortality (treatable), like breast cancer. Or it can have a lower incidence but very high mortality, like pancreatic cancer. The AKT may test your ability to interpret cancer statistics correctly โ don't assume the most common cancer is the deadliest.
โ๏ธ Health Inequalities
Health inequalities describe unfair, avoidable differences in health between different groups of people. They are a significant focus of UK public health policy and appear in the AKT in the context of epidemiology and social determinants of health.
Health inequalities are unfair, avoidable differences in health status or in the distribution of health determinants between different population groups. They are "avoidable" because they stem from social, economic, or environmental conditions that could in principle be changed โ not from random chance or biological variation.
| Type | Examples in the UK |
|---|---|
| Socioeconomic | Lower life expectancy in deprived areas; higher rates of cardiovascular disease, diabetes, and mental illness in poorer communities |
| Geographic | "North-South divide" โ poorer health outcomes in parts of Northern England compared to the South |
| Ethnic | Higher rates of type 2 diabetes in South Asian populations; higher cardiovascular risk in Black populations |
| Gender | Men have lower life expectancy but women have more years of ill health (morbidity) |
Data Distribution & Statistical Significance
Measures of Central Tendency
| Measure | Definition | Best Used When | Watch Out |
|---|---|---|---|
| Mean | Sum of all values รท number of values | Data is normally distributed | Easily skewed by outliers |
| Median | Middle value when sorted in order. If there is an even number of values, the median is the average of the two middle numbers. | Skewed data (e.g. income, hospital stay length) | Ignores the actual values at extremes |
| Mode | Most frequently occurring value | Categorical data; bimodal distributions | Can be meaningless with continuous data |
| Range | Maximum โ Minimum | Quick sense of spread | Entirely determined by outliers |
| IQR (Interquartile Range) | 75th percentile โ 25th percentile | Paired with median for skewed data | Ignores upper and lower 25% |
| Standard Deviation (SD) | Average spread from the mean | Normally distributed data | Misleading if data is not normally distributed |
๐๏ธ Types of Data โ Nominal, Ordinal, Interval, Ratio
Understanding what type of data you have determines which summary statistics and which statistical tests are appropriate. The AKT tests this โ usually by presenting a dataset and asking which test or measure to use.
| Type | Definition | Examples | Analogy |
|---|---|---|---|
| Nominal | Categories with no natural order | Blood group (A, B, AB, O); sex; eye colour; cause of death | A fruit bowl โ apples and bananas are just different, neither is "more" |
| Ordinal | Ordered categories, but the gaps between them are not necessarily equal | NYHA heart failure class (IโIV); pain scale (mild/moderate/severe); Likert scales | Race positions โ 1st, 2nd, 3rd. We know the order, but the gap between 1st and 2nd may be very different to the gap between 2nd and 3rd |
| Interval | Ordered with equal gaps between values, but no true zero | Temperature in ยฐC; calendar dates; IQ scores | A thermometer โ 0ยฐC doesn't mean "no temperature." You can't say 20ยฐC is "twice as warm" as 10ยฐC in any absolute sense |
| Ratio | Ordered, equal gaps, AND a true absolute zero | Height, weight, blood pressure, age, income, drug dose | Money โ ยฃ0 means you genuinely have nothing. ยฃ40 is twice as much as ยฃ20 |
- Nominal/Ordinal data โ use non-parametric tests, mode or median for averages
- Interval/Ratio data (normally distributed) โ can use mean, SD, parametric tests
- You cannot meaningfully calculate a mean for nominal data (e.g. "mean blood group" is nonsense) or make proportional statements with interval data (you can't say someone with an IQ of 120 is "twice as clever" as someone with 60)
๐ฌ Parametric vs Non-Parametric Tests
Statistical tests fall into two families depending on the assumptions they make about your data. The AKT tests which type of test is appropriate for a given scenario โ you do not need to perform the calculations, just know when to use which.
Parametric tests assume the data is normally distributed (or the sample is large enough that this doesn't matter much) and that the data is at least interval-level. Non-parametric tests make no such assumptions โ they work with ranks or categories and are suitable for skewed data, small samples, or ordinal/nominal data.
| Purpose | Parametric Test | Non-Parametric Equivalent |
|---|---|---|
| Compare means of 2 independent groups | Independent t-test | Mann-Whitney U test |
| Compare means: same group, 2 time points | Paired t-test | Wilcoxon signed-rank test |
| Compare means of 3 or more groups | ANOVA (Analysis of Variance) | Kruskal-Wallis test |
| Correlation between two continuous variables | Pearson correlation | Spearman rank correlation |
| Compare proportions / categorical data | โ | Chi-squared test (ฯยฒ) |
- Data is skewed or not normally distributed โ use non-parametric
- Small sample size (and normality uncertain) โ use non-parametric
- Data is ordinal (e.g. pain scores, Likert) โ use non-parametric
- Comparing proportions or categories โ chi-squared test
- Large sample, continuous, approximately normal โ parametric test is fine
๐ Normal Distribution โ The 68-95-99.7 Rule
A normal distribution is a symmetrical bell-shaped curve. The mean, median, and mode are all equal and sit at the centre.
| Range | % of Values Included |
|---|---|
| Mean ยฑ 1 SD | 68% |
| Mean ยฑ 2 SD | 95% |
| Mean ยฑ 3 SD | 99.7% |
With positively skewed data (e.g. income, GP waiting times, serum bilirubin in a ward), the mean is pulled to the right by a few high outliers. In this case, use the median โ it better represents the typical value. The AKT loves testing this distinction.
๐ P-Values & Confidence Intervals โ What They Really Mean
P-Values
The p-value is the probability of observing results at least as extreme as those seen if the null hypothesis is true (i.e. if there is no real effect). It is not the probability that the null hypothesis is correct.
| p-value | Interpretation |
|---|---|
| p < 0.05 | Statistically significant โ less than 5% probability this result occurred by chance |
| p > 0.05 | Not statistically significant โ result may be due to chance |
| p = 0.01 | 1% chance of getting this result if no real effect |
Confidence Intervals (CIs)
A 95% CI means: if you repeated the study 100 times, in 95 of those times the true population value would fall within this range.
- For a ratio (RR, OR, HR): if 95% CI includes 1.0 โ not statistically significant
- For a difference (mean difference, ARR): if 95% CI includes 0 โ not statistically significant
- If the CI is entirely above 1 (for ratios) โ significantly increased risk
- If the CI is entirely below 1 (for ratios) โ significantly reduced risk
โ Type I & Type II Errors โ Crying Wolf vs Missing the Wolf
Rejecting the null hypothesis when it is actually true. In other words: concluding that a treatment works when it doesn't. The false positive rate. Conventionally acceptable at 5% (ฮฑ = 0.05, p < 0.05).
"Crying wolf" โ sounding the alarm when there's no wolf.
Failing to reject the null hypothesis when it is actually false. In other words: missing a true treatment effect. The false negative rate. Conventionally acceptable at 20% (ฮฒ = 0.2, power = 80%).
"Missing the wolf" โ saying no wolf when there is one.
Power = 1 โ ฮฒ. It is the probability of correctly detecting a true effect. Higher power = less likely to miss a real effect. Power increases with larger sample sizes. A well-designed trial needs โฅ80% power.
โก Statistical Significance โ Clinical Importance
This is one of the most important and most tested nuances in AKT statistics โ and one that many trainees miss.
A result can be statistically significant (p < 0.05) while being clinically meaningless. Statistical significance only tells you the result is unlikely to be due to chance โ it says nothing about whether the effect is large enough to matter in practice.
| Scenario | Statistically Significant? | Clinically Important? |
|---|---|---|
| BP drops 1 mmHg, n=50,000, p=0.001 | Yes | No |
| BP drops 15 mmHg, n=30, p=0.08 | No | Probably yes |
| BP drops 12 mmHg, n=500, p=0.02 | Yes | Yes |
Always look at the effect size (ARR, NNT, mean difference) alongside p-values and CIs. A narrow confidence interval that excludes zero or one is more meaningful than a p-value alone โ it tells you both the direction and the precision of the effect.
- Narrow CI โ precise estimate โ high confidence the true value is close to the point estimate (usually from a large sample)
- Wide CI โ uncertain estimate โ true value could be anywhere in a broad range (usually from a small or heterogeneous sample)
Example: RR = 1.5 (95% CI 1.4โ1.6) โ precise, convincing. RR = 1.5 (95% CI 0.6โ3.8) โ wide, uncertain โ and crossing 1.0 so not even significant.
๐ Regression to the Mean โ The Hidden Confounder of Clinical Practice
Regression to the mean is the statistical tendency for an extreme measurement to be closer to the average on a second measurement โ regardless of any intervention. It is one of the most under-recognised sources of misleading conclusions in medicine.
Patients are often investigated or treated precisely when their symptoms or measurements are at their worst. Natural variation means those measurements are likely to improve anyway on re-testing โ not necessarily because of your intervention. Without a control group, it is impossible to distinguish regression to the mean from genuine treatment effect.
| Scenario | What Looks Like Treatment Effect | What May Actually Be Happening |
|---|---|---|
| Patient has very high BP on one reading โ started on medication โ BP lower at next visit | Drug is working | First reading may have been an outlier; BP would have been lower anyway on repeat measurement |
| Pupil scores very poorly on a test โ gets extra tuition โ scores better next time | Tuition helped | The poor score may have been unrepresentative; natural performance tends toward their average |
| Patient with severe flare of eczema starts a new cream โ flare improves | Cream is effective | Severe flares naturally improve over time regardless of treatment |
- Take multiple baseline measurements and use the average before starting treatment
- Use a control group โ regression to the mean affects both groups equally, so any difference between groups is more likely to be a real treatment effect
- This is one of the strongest arguments for RCTs over uncontrolled before-and-after studies
Statistical Graphs โ What to Look For
The AKT regularly presents graphs and asks you to interpret them. Learn to spot the key feature in each graph type โ do not try to read everything. One targeted observation is all that's needed.
| Graph Type | Primary Use | The One Thing to Look For |
|---|---|---|
| Forest Plot | Meta-analysis results | Does the diamond cross the vertical line of no effect? |
| Funnel Plot | Publication bias detection / GP outlier monitoring | Is the plot asymmetrical? (Gap = missing unpublished studies) |
| Cates Plot | Communicating NNT visually to patients | Count the coloured "benefit" faces; NNT = 100 รท those faces |
| L'Abbรฉ Plot | Exploring heterogeneity in meta-analyses | Dot on the diagonal line = zero treatment effect |
| Box-and-Whisker Plot | Data distribution and spread | Middle line = median (not mean); dots beyond whiskers = outliers |
| Fagan's Nomogram | Pre-test โ post-test probability | Draw a line from pre-test probability through LR to read post-test probability |
| Stem-and-Leaf Plot | Distribution โ preserves original values | Like a histogram but shows individual data points |
| Kaplan-Meier Curve | Survival analysis โ time to event | Steps drop when events occur; curves that diverge early and stay apart suggest sustained treatment benefit |
| Histogram | Distribution of continuous data | Shape of curve: symmetrical = normal distribution; skewed = use median not mean |
๐ฒ Forest Plots โ In Detail
- Each square = one study. The size of the square = the study's weight (usually driven by sample size)
- Horizontal lines = 95% confidence interval for that study
- The diamond at the bottom = the pooled overall estimate. Its width = the pooled 95% CI
- Vertical line at 1.0 = "line of no effect" (for ratios). If a CI line or the diamond crosses this line, that result is not statistically significant
Iยฒ measures how much variation between studies is due to true heterogeneity (real differences) rather than chance.
Iยฒ < 25% = low heterogeneity โ
Iยฒ = 25โ50% = moderate heterogeneity
Iยฒ > 50% = substantial heterogeneity โ pooled result is less reliable โ ๏ธ
๐ฏ Funnel Plots โ Publication Bias & GP Outlier Monitoring
Funnel plots appear in two entirely different contexts in the AKT โ make sure you recognise both.
Plots effect size (x-axis) against study precision or size (y-axis). A symmetric inverted funnel = no bias. An asymmetric funnel with a gap at the bottom-left = missing small negative studies โ publication bias.
Compares GP practices on metrics (e.g. referral rates, mortality). Data points outside the funnel lines are statistical outliers warranting investigation โ not necessarily proof of poor performance.
๐ Cates Plots โ How to Extract NNT Visually
A Cates plot (sometimes called a "smiley face plot") uses a grid of 100 faces to help patients visualise absolute benefit and harm.
- Each of the 100 faces represents one person per 100 treated
- Yellow/green faces = people who benefited (events prevented)
- Red faces = people who experienced harm
- Grey/plain faces = no effect either way
๐ Kaplan-Meier Survival Curves โ How to Read Them
Kaplan-Meier (KM) curves show the probability of surviving (or remaining event-free) over time. They appear in AKT questions about cancer trials, cardiovascular studies, and any research tracking time to an event.
| Feature | What It Means |
|---|---|
| The Y-axis | Probability of survival (or event-free survival) โ starts at 1.0 (100%) and falls over time |
| The X-axis | Time (days, months, years) |
| Each downward step | An event has occurred (e.g. death, relapse). Larger steps = more events at that point |
| Tick marks on the line | Censored patients โ lost to follow-up or study ended. Not events. |
| Two curves diverging early and staying apart | Suggests sustained treatment benefit throughout follow-up |
| Curves crossing | Suggests hazard is not proportional over time โ complicates interpretation |
| Median survival | The time point where the survival curve crosses 50% โ the point where half the patients have had the event |
The log-rank test compares two KM curves statistically. The hazard ratio (HR) summarises the overall difference between curves. HR < 1 = treatment group has fewer events over time. If the curves overlap substantially, the HR will be close to 1 (no benefit).
๐ Histograms โ Distribution at a Glance
A histogram displays the distribution of a continuous variable (e.g. age, blood pressure, BMI) by grouping values into intervals (bins) and showing how many observations fall in each. Unlike a bar chart, the bars touch โ because the data is continuous.
| Shape | Meaning | Use Mean or Median? |
|---|---|---|
| Symmetrical bell shape | Normal distribution โ mean, median, mode all equal | Either โ but conventionally mean ยฑ SD |
| Right (positive) skew | Long tail to the right โ a few very high values pulling mean up | Median (more representative) |
| Left (negative) skew | Long tail to the left โ a few very low values pulling mean down | Median (more representative) |
| Bimodal (two peaks) | Two distinct subgroups in the data | Report both peaks separately |
Quality Improvement & Clinical Audit
| Tool | Purpose | Key Features |
|---|---|---|
| Clinical Audit | Measure practice against explicit standards; identify and close gaps | Requires a defined standard. Uses PDSA cycle. Involves closing the loop (re-audit). Not research โ no hypothesis, no new knowledge generated. |
| Significant Event Analysis (SEA) | Systematic multidisciplinary review of a single significant event | Blame-free. Focuses on system learning, not individual fault. Shared within the team. Documents learning and action. |
| Root Cause Analysis (RCA) | In-depth investigation of serious incidents | More structured than SEA. Uses "5 Whys" technique. Identifies contributory and root causes. Often for never events or serious harm. |
| QOF Exception Reporting | Appropriate exclusion of patients from QOF indicators | Clinically appropriate for: maximal tolerated therapy, informed patient dissent, extreme frailty, or clinical contraindication. |
๐ Clinical Audit vs Research โ Key Differences
| Feature | Clinical Audit | Research |
|---|---|---|
| Purpose | Improve care by comparing with standards | Generate new knowledge |
| Hypothesis | None โ compares against existing standard | Always has a hypothesis |
| Ethics approval | Usually not required | Usually required |
| Consent | Usually not required | Usually required |
| Randomisation | Never | May include RCTs |
| End result | Service improvement; action plan | New evidence; publication |
๐ PDSA Cycle
The Plan-Do-Study-Act (PDSA) cycle is the continuous improvement framework used in clinical audit and quality improvement.
| Stage | What Happens |
|---|---|
| Plan | Define the question, set the standard, plan data collection |
| Do | Implement the change or measure current practice |
| Study | Analyse results; compare against the standard; identify gaps |
| Act | Implement improvements; plan re-audit to close the loop |
Clinical Calculations (AKT Formula Bank)
These formulas crop up in AKT calculation questions. Learn each one and know the clinical cut-offs that trigger action.
๐งฎ ABPI Worked Example
๐ฏ Scenario: Mrs T, 72, with leg pain on walking
๐ท Alcohol Unit Worked Examples
| Drink | Volume (ml) | ABV% | Calculation | Units |
|---|---|---|---|---|
| Large glass wine | 250ml | 13% | 250 ร 13 รท 1000 | 3.25 |
| Bottle of wine | 750ml | 12% | 750 ร 12 รท 1000 | 9.0 |
| Pint lager (strong) | 568ml | 5% | 568 ร 5 รท 1000 | 2.84 |
| Single spirit measure | 25ml | 40% | 25 ร 40 รท 1000 | 1.0 |
Worked Examples
๐ฏ Example 1: Calculating All Risk Metrics from a Trial Table
An RCT randomises patients to receive either Drug X or placebo. After 3 years: 80 out of 500 placebo patients had a stroke; 40 out of 500 Drug X patients had a stroke.
๐ฏ Example 2: Constructing a 2ร2 Table and Calculating Sensitivity & Specificity
A new test is applied to 200 patients, 100 of whom have the disease. The test correctly identifies 85 of the 100 with disease, and correctly identifies 80 of the 100 without disease.
๐ฏ Example 3: Paediatric Dose Calculation
A child requires 300mg of amoxicillin. The available suspension is 250mg/5ml. What volume should be given?
Formulas Cheat Sheet & Memory Aids
Risk & Treatment Formulas
Diagnostic Testing
Population & Clinical Formulas
- SnNout: Sensitive test โ Negative rules OUT disease (screening)
- SpPin: Specific test โ Positive rules IN disease (confirmation)
- CI crosses 1.0 (ratio) or 0 (difference) โ not significant
- Forest plot diamond crosses line โ not significant
- Funnel asymmetry โ publication bias (or GP outlier if performance context)
- Box plot middle line โ MEDIAN (not mean)
- 68-95-99.7 โ ยฑ1SD, ยฑ2SD, ยฑ3SD in normal distribution
- Iยฒ >50% โ substantial heterogeneity
- OR from case-control | RR from cohort | Prevalence from cross-sectional
- Audit โ Research (audit measures vs standard; research generates new knowledge)
- Case-control โ OR (Odds Ratio) โ looking BACK at exposures
- Cohort โ RR (Relative Risk) โ going FORWARD from exposure
- Cross-sectional โ Prevalence โ SNAPSHOT in time
- RCT / SR โ Gold standard for treatment questions
Trainer & Teaching Pearls
- Trainees often learn NNT as a formula without understanding what it actually means clinically โ ensure they can explain it in a sentence a patient would understand
- The distinction between sensitivity/specificity (test properties) and PPV/NPV (influenced by prevalence) is poorly understood โ the pregnancy test analogy works well
- Many trainees know what a forest plot looks like but cannot explain what to look at โ focus on the diamond and whether it crosses the line of no effect
- Lead time bias is frequently confused with length bias โ use diagrams or timelines to illustrate the distinction
- Trainees routinely confuse clinical audit with research โ use the RCGP's own examples from assessments
- "Here's the summary table from a drug trial โ can you tell me the NNT and whether you'd prescribe it for this patient?"
- "Look at this forest plot. Is the intervention effective? How confident are you in that answer?"
- "Your patient has a positive FIT test. The PPV in a low-risk population is about 3%. How do you explain this to him?"
- "We're seeing a lot of high PSA results. What's the issue with using PSA as a screening test?"
- "A colleague wants to compare our sepsis outcomes with the Trust's. Is that audit or research? Does it need ethics?"
- "How would you explain a 1-in-50 chance of a side effect to a patient who asks 'Is it safe?'"
- How do you currently explain risk to patients? Do you use absolute or relative figures?
- Can you recall a recent clinical guideline that cited a significant NNT โ what was it, and how did it influence your practice?
- Have you ever ordered a test and not known its sensitivity/specificity? How did you interpret the result?
- Why might a drug company choose to present their trial results as RRR rather than ARR?
๐ฅ AKT High-Yield Tips
These are the patterns that repeatedly appear in AKT papers. Memorise these and you will score marks.
Always convert percentages to decimals first. ARR = 5% โ NNT = 1 รท 0.05 = 20. Always round up to the nearest whole number. A lower NNT = more effective treatment.
For ratios (RR, OR, HR): CI crosses 1.0 = not significant. For differences: CI crosses 0 = not significant. This comes up in nearly every forest plot question.
Rare disease โ case-control โ OR. New exposure going forward โ cohort โ RR. Prevalence snapshot โ cross-sectional. Best evidence for treatment โ RCT or SR/meta-analysis.
Screening test โ want high sensitivity (don't want to miss cases โ SnNout). Confirmatory test โ want high specificity (don't want false positives โ SpPin).
Even a highly specific test (99%) gives poor PPV in a low-prevalence setting. Most positive results in population screening are false positives. This is why we don't screen everyone for everything.
If the diamond (pooled estimate) crosses the vertical line of no effect โ overall result is NOT statistically significant. If Iยฒ > 50% โ substantial heterogeneity โ pooled result is less reliable.
A gap in the bottom-left of a funnel plot = publication bias โ small negative studies were not published. This inflates the apparent effect of a treatment in the meta-analysis.
The line inside the box is the median. The box = IQR (middle 50%). Dots or circles beyond the whiskers = outliers.
Pharmaceutical companies love quoting RRR because it sounds bigger. A 50% RRR sounds amazing โ until you know the baseline risk was only 2% (โ ARR = 1%, NNT = 100). Always ask: what was the baseline risk?
Skewed distributions (income, hospital stay, serum bilirubin) โ use median not mean. The mean is pulled by outliers; the median is not.
Audit: measures against an existing standard; no ethics needed; no hypothesis. Research: generates new knowledge; needs ethics approval; has a hypothesis. A key distinction the AKT tests repeatedly.
ITT analysis includes all randomised participants regardless of adherence. This gives a conservative estimate of effectiveness โ more realistic for clinical practice. Per-protocol analysis overestimates the effect.
ABPI < 0.9 = peripheral arterial disease. ABPI > 1.3 = non-compressible (calcified) vessels โ unreliable result. Compression bandaging is contraindicated if ABPI < 0.8 (check with your compression guidelines).
Count the benefit faces (usually yellow or green). NNT = 100 รท (number of benefit faces). 5 yellow faces โ NNT = 20.
Common Mistakes & Trainee Traps
These are the errors that appear repeatedly across AKT marking schemes. Every one of these is a real mark lost by real candidates.
- Forgetting to convert percentages to decimals before calculating NNT (e.g., ARR = 5% โ must use 0.05, not 5, to get NNT = 20, not 0.2)
- Rounding NNT down rather than up (NNT = 12.5 โ answer is 13, not 12)
- Confusing RRR with ARR and quoting the more impressive-sounding relative figure as the clinical benefit
- Saying a result is "significant" when the CI just touches 1.0 โ it must not include 1.0 to be significant
- Stating the box-and-whisker plot middle line is the mean โ it is always the median
- Confusing sensitivity with PPV โ sensitivity is a fixed property of the test; PPV depends on prevalence
- Thinking a highly specific test in a low-prevalence population will give a reliable positive result โ it won't (low PPV)
- Confusing an OR with an RR โ ORs cannot be directly used as RRs except when disease is rare
- Saying a case-control study generates RR โ it generates OR, because you start with cases and controls, not an exposed cohort
- Confusing clinical audit with research โ claiming an audit needs ethical approval
- Misinterpreting lead time bias as meaning a screening programme genuinely improves survival
- Forgetting that Iยฒ >50% in a forest plot raises concerns about the validity of the pooled result
- Using the mean to describe skewed data (e.g. income, hospital stay length) โ use the median
๐ Final Take-Home Points
- NNT = 1 รท ARR. Always convert percentages to decimals. Always round up. Lower NNT = better treatment.
- ARR is clinically honest. RRR sounds impressive but can mislead. Always pair RRR with baseline risk.
- Sensitivity and specificity are fixed properties of a test. PPV and NPV change with disease prevalence.
- SnNout: sensitive tests rule OUT when negative. SpPin: specific tests rule IN when positive.
- Forest plot diamond crosses the line of no effect โ result not statistically significant. Iยฒ >50% โ heterogeneity concerns.
- Funnel plot asymmetry โ publication bias. Points outside funnel limits in performance monitoring โ outlier practices.
- CI for a ratio that includes 1.0 โ not significant. CI for a difference that includes 0 โ not significant.
- Case-control โ OR. Cohort โ RR. Cross-sectional โ prevalence. RCT/SR โ gold standard for treatment.
- Skewed data โ use median, not mean. Box plot middle line = median. Dots beyond whiskers = outliers.
- Clinical audit measures against standards โ no ethics needed, no hypothesis. Research generates new knowledge โ ethics required.
Statistics questions in the AKT are among the most reliably learnable marks in the paper. A few hours with this page and a handful of practice questions will pay dividends well beyond their investment.