Professor of Genetic Epidemiology and
Statistical Genetics, University of Edinburgh
also Honorary Consultant in Public Health, NHS Lothian and (since April 2020) Public Health Scotland.
This is my personal web page. Larger files, including software packages and public datasets, can be found on my research group’s home page on my server.
For the Usher Institute, information about research is on this page, information about taught postgraduate courses is on this page, and information about PhDs is on this page. I am a listed supervisor for the UKRI Centre for Doctoral Training in Biomedical Artificial Intelligence, which offers a four-year training program with the first year spent working for a Master’s degree through taught courses and projects, followed by a three-year PhD programme.
In the Usher Institute I am a member of the Molecular Epidemiology Group. I am a member of the Executive Group of the Centre for Statistics based in the School of Mathematics.
Since 2019 I have been on the Medical Research Council’s Methodology Research Programme Panel. Since June 2020 I have been on a UKRI MRC COVID-19 Research & Innovation Panel reviewing applications submitted under the COVID-19 Rapid Response Rolling Call
Usher Institute of Population Health
Sciences and Informatics
University of Edinburgh Medical School,
Teviot Place, Edinburgh EH8 9AG
Phone +44 131 650 4556
PGP public key associated with this email address:
1024D/B2D6769A 2013-07-11 Paul McKeigue <email@example.com> Fingerprint=683E 7E3B E8B3 83BB 8F80 363A A034 3F3B B2D6 769A
For secure email, the most convenient method is to use a ProtonMail account to send messages to my ProtonMail address, shown below.
Alternatively, you can use the PGP public key associated with this Protonmail address:
2048R/E082D1A7 2019-01-31 "firstname.lastname@example.org" <email@example.com> Fingerprint=53CB E853 7843 8894 638D E226 41D0 FAC2 E082 D1A7
These PGP public keys can be imported from keyservers such as http://keys.gnupg.net/
If you need to transfer large data files securely, I can set up an SFTP account for you on my server. You will need to use SSH public key authentication. Instructions for setting this up on a Windows PC are here.
My research focuses on methods for molecular and genetic epidemiology, with applications in clinical prediction and personalized medicine. These methods make use of Bayesian and computationally-intensive statistical methods, and machine learning methods for constructing predictors. I work closely with Helen Colhoun’s research group at the Centre for Genomic and Experimental Medicine. This collaboration includes the development of an analysis platform based on deidentified electronic health records and the use of this platform to study drug safety and complications of diabetes.
My group’s current research includes
Development of biomarker-based predictions of diabetic complications in the SUMMIT European consortium and the Scottish Diabetes Research Network Type 1 Bioresource
Development of a platform (GENOSCORES) for constructing genotypic predictors from summary results of genome-wide association studies
The relation of C-peptide persistence in Type 1 diabetes to the genetic architecture of Type 1 and Type 2 diabetes
Development of statistical methods for learning to classify with high-dimensional biomarker panels and quantifying the incremental contribution of each biomarker or panel to predictive performance.
Risk stratification for colorectal cancer (with Evropi Theodoratou). I am also planning a new project to study the relationship of colorectal cancer risk to the colonic microbiome profile, based on establishing a consented bioresource of faecal samples from the Scottish bowel screening programme
A list of my research grants is here. In addition Helen Colhoun and I have recently been awarded 1.4 million euros from the EU for the University of Edinburgh as a partner in the Hypo-RESOLVE project studying hypoglycaemia and its impact in diabetes
Associations of severe COVID-19 with polypharmacy in the REACT-SCOT case-control study Paul M McKeigue, Sharon Kennedy, Amanda Weir, Jen Bishop, Stuart J McGurnaghan, David McAllister, Chris Robertson, Rachael Wood, Nazir Lone, Janet Murray, Thomas M Caparrotta, Alison Smith-Palmer, David Goldberg, Jim McMenamin, Colin Ramsay, Bruce Guthrie, Sharon Hutchinson, Helen M Colhoun. medRxiv 2020.07.23.20160747; doi: https://doi.org/10.1101/2020.07.23.20160747
McGurnaghan, Stuart J and Weir, Amanda and Bishop, Jen and Kennedy, Sharon and Blackbourn, Luke AK and Hutchinson, Sharon and Caparrotta, Thomas M and Mellor, Joseph and Jeyam, Anita and O’Reilly, Joseph E and Wild, Sarah and Hatam, Sara and Höhn, Andreas and Colombo, Marco and Robertson, Chris and Lone, Nazir I. and Murray, Janet and Butterly, Elaine and Petrie, John and Kennon, Brian and McCrimmon, Rory and Lindsay, Robert and Pearson, Ewan and Sattar, Naveed and McKnight, John and Samuel, Ashirwad Philip and Collier, Andrew and McMenamin, Jim and Smith-Palmer, Alison and Goldberg, David and McKeigue, Paul M and Colhoun, Helen M and Group, Public Health Scotland COVID-19 Health Protection Study and Group, Scottish Diabetes Research Network Epidemiology, COVID-19 Disease in People with Diabetes in Scotland: Incidence, Severity and Risk Stratification Using Matched Case-Control and Prospective Cohort Studies (6/29/2020). Available at SSRN: https://ssrn.com/abstract=3640560 or http://dx.doi.org/10.2139/ssrn.3640560
Rapid Epidemiological Analysis of Comorbidities and Treatments as risk factors for COVID-19 in Scotland (REACT-SCOT): a population-based case-control study Paul M McKeigue, Amanda Weir, Jen Bishop, Stuart McGurnaghan, Sharon Kennedy, David McAllister, Chris Robertson, Rachael Wood, Nazir Lone, Janet Murray, Thomas Caparrotta, Alison Smith-Palmer, David Goldberg, Jim McMenamin, Colin Ramsay, Sharon Hutchinson, Helen M Colhoun. medRxiv 2020.05.28.20115394; doi: https://doi.org/10.1101/2020.05.28.20115394 – now accepted by PLoS Medicine
Evaluation of "stratify and shield" as a policy option for ending the COVID-19 lockdown in the UK. Paul M. McKeigue, Helen M Colhoun medRxiv 2020.04.25.20079913; doi: https://doi.org/10.1101/2020.04.25.20079913
Quantifying performance of a diagnostic test as the expected information for discrimination: relation to the C-statistic. (Statistical Methods for Medical Research 2018. This paper proposes that the expected information for discrimination (expected weight of evidence) should supplant the C-statistic (area under the ROC curve) for quantifying the performance of a diagnostic test or risk predictor, and for evaluating the incremental contribution of a new biomarker. This page demonstrates the statistical methods on publicly available datasets, and links to R scripts for running these analyses. This paper has had wide media coverage because it draws on unpublished results obtained by Alan Turing while working on the Banburismus procedure at Bletchley Park, later described and extended by his assistant Jack Good. This poster describes the ideas.
Other publications for which the final accepted manuscript is available under open access can be found by searching Edinburgh Research Explorer.
These pages include slide presentations, together with tutorials and notes prepared when advising colleagues, students and other researchers
Statistical modelling of risk factors for hypoglycemic attacks - HypoRESOLVE meeting, Nijmegen (May 2019)
Quantifying predictive performance using the distributions of weight of evidence
Methodology and misunderstandings in precision medicine
Methods for protection of privacy in the 2015 Charter for Safe Havens in Scotland for handling unconsented data from NHS patient records: a critical look – Dealing With Data conference 2017
Stratified medicine as a statistical modelling problem: learning finite mixture models for disease subtyping – Centre for Statistics launch 2017
Using GWAS summary statistics to construct polygenic scores for hypothesis testing and prediction
Statistical methods for learning to classify with biomarker panels: insights from cryptanalysis
Weighing evidence in an information war – Ethics Forum, Transatlantic Seminar series, School of Social and Political Sciences 2019
O'Reilly JE, Blackbourn LAK, Caparrotta TM, Jeyam A, Kennon B, Leese GP, Lindsay RS, McCrimmon RJ, McGurnaghan SJ, McKeigue PM, McKnight JA, Petrie JR, Philip S, Sattar N, Wild SH, Colhoun HM; Scottish Diabetes Research Network Epidemiology Group. Time trends in deaths before age 50 years in people with type 1 diabetes: a nationwide analysis from Scotland 2004-2017. Diabetologia. 2020 Aug;63(8):1626-1636. doi: 10.1007/s00125-020-05173-w. Epub 2020 May 26. PMID: 32451572; PMCID: PMC7351819.
Caparrotta TM, Blackbourn LAK, McGurnaghan SJ, Chalmers J, Lindsay R, McCrimmon R, McKnight J, Wild S, Petrie JR, Philip S, McKeigue PM, Webb DJ, Sattar N, Colhoun HM; Scottish Diabetes Research Network–Epidemiology Group. Prescribing Paradigm Shift? Applying the 2019 European Society of Cardiology-Led Guidelines on Diabetes, Prediabetes, and Cardiovascular Disease to Assess Eligibility for Sodium-Glucose Cotransporter 2 Inhibitors or Glucagon-Like Peptide 1 Receptor Agonists as First-Line Monotherapy (or Add-on to Metformin Monotherapy) in Type 2 Diabetes in Scotland. Diabetes Care. 2020 Jun 24:dc200120. doi: 10.2337/dc20-0120. Epub ahead of print. PMID: 32581068.
Li X, Timofeeva M, Spiliopoulou A, McKeigue P, He Y, Zhang X, Svinti V, Campbell H, Houlston RS, Tomlinson IPM, Farrington SM, Dunlop MG, Theodoratou E. Prediction of colorectal cancer risk based on profiling with common genetic variants. Int J Cancer. 2020 Jul 7. doi: 10.1002/ijc.33191. Epub ahead of print. PMID: 32638365.
Cherlin S, Lewis MJ, Plant D, Nair N, Goldmann K, Tzanis E, Barnes MR, McKeigue P, Barrett JH, Pitzalis C, Barton A; MATURA Consortium, Cordell HJ. Investigation of genetically regulated gene expression and response to treatment in rheumatoid arthritis highlights an association between IL18RAP expression and treatment response. Ann Rheum Dis. 2020 Jul 30:annrheumdis-2020-217204. doi: 10.1136/annrheumdis-2020-217204. Epub ahead of print. PMID: 32732242.
Jeyam A, McGurnaghan SJ, Blackbourn LAK, McKnight JM, Green F, Collier A, McKeigue PM, Colhoun HM; SDRNT1BIO Investigators. Diabetic Neuropathy Is a Substantial Burden in People With Type 1 Diabetes and Is Strongly Associated With Socioeconomic Disadvantage: A Population-Representative Study From Scotland. Diabetes Care. 2020 Jan 23:dc191582. doi: 10.2337/dc19-1582. Epub ahead of print. PMID: 31974100.
Colombo M, McGurnaghan SJ, Blackbourn LAK, Dalton RN, Dunger D, Bell S, Petrie JR, Green F, MacRury S, McKnight JA, Chalmers J, Collier A, McKeigue PM, Colhoun HM; Scottish Diabetes Research Network (SDRN) Type 1 Bioresource Investigators. Comparison of serum and urinary biomarker panels with albumin/creatinine ratio in the prediction of renal function decline in type 1 diabetes. Diabetologia. 2020 Jan 8:10.1007/s00125-019-05081-8. doi: 10.1007/s00125-019-05081-8. Epub ahead of print. PMID: 31915892.
Colombo M, McGurnaghan SJ, Bell S, MacKenzie F, Patrick AW, Petrie JR, McKnight JA, MacRury S, Traynor J, Metcalfe W, McKeigue PM, Colhoun HM; Scottish Diabetes Research Network (SDRN) Type 1 Bioresource Investigators and the Scottish Renal Registry. Predicting renal disease progression in a large contemporary cohort with type 1 diabetes mellitus. Diabetologia. 2020 Mar;63(3):636-647. doi: 10.1007/s00125-019-05052-z. Epub 2019 Dec 5. PMID: 31807796.
Li X, Meng X, He Y, Spiliopoulou A, Timofeeva M, Wei WQ, Gifford A, Yang T, Varley T, Tzoulaki I, Joshi P, Denny JC, Mckeigue P, Campbell H, Theodoratou E. Genetically determined serum urate levels and cardiovascular and other diseases in UK Biobank cohort: A phenome-wide mendelian randomization study. PLoS Med. 2019 Oct 18;16(10):e1002937. doi: 10.1371/journal.pmed.1002937. PMID: 31626644; PMCID: PMC6799886.
Ochs A, McGurnaghan S, Black MW, Leese GP, Philip S, Sattar N, Styles C, Wild SH, McKeigue PM, Colhoun HM; Scottish Diabetes Research Network Epidemiology Group and the Diabetic Retinopathy Screening Collaborative. Use of personalised risk-based screening schedules to optimise workload and sojourn time in screening programmes for diabetic retinopathy: A retrospective cohort study. PLoS Med. 2019 Oct 17;16(10):e1002945. doi: 10.1371/journal.pmed.1002945. PMID: 31622334; PMCID: PMC6797087.
Meng X, Li X, Timofeeva MN, He Y, Spiliopoulou A, Wei WQ, Gifford A, Wu H, Varley T, Joshi P, Denny JC, Farrington SM, Zgaga L, Dunlop MG, McKeigue P, Campbell H, Theodoratou E. Phenome-wide Mendelian-randomization study of genetically determined vitamin D on multiple health outcomes using the UK Biobank study. Int J Epidemiol. 2019 Oct 1;48(5):1425-1434. doi: 10.1093/ije/dyz182. PMID: 31518429; PMCID: PMC6857754.
Salem RM, Todd JN, Sandholm N, Cole JB, Chen WM, Andrews D, Pezzolesi MG, McKeigue PM, Hiraki LT, Qiu C, Nair V, Di Liao C, Cao JJ, Valo E, Onengut- Gumuscu S, Smiles AM, McGurnaghan SJ, Haukka JK, Harjutsalo V, Brennan EP, van Zuydam N, Ahlqvist E, Doyle R, Ahluwalia TS, Lajer M, Hughes MF, Park J, Skupien J, Spiliopoulou A, Liu A, Menon R, Boustany-Kari CM, Kang HM, Nelson RG, Klein R, Klein BE, Lee KE, Gao X, Mauer M, Maestroni S, Caramori ML, de Boer IH, Miller RG, Guo J, Boright AP, Tregouet D, Gyorgy B, Snell-Bergeon JK, Maahs DM, Bull SB, Canty AJ, Palmer CNA, Stechemesser L, Paulweber B, Weitgasser R, Sokolovska J, Rovīte V, Pīrāgs V, Prakapiene E, Radzeviciene L, Verkauskiene R, Panduru NM, Groop LC, McCarthy MI, Gu HF, Möllsten A, Falhammar H, Brismar K, Martin F, Rossing P, Costacou T, Zerbini G, Marre M, Hadjadj S, McKnight AJ, Forsblom C, McKay G, Godson C, Maxwell AP, Kretzler M, Susztak K, Colhoun HM, Krolewski A, Paterson AD, Groop PH, Rich SS, Hirschhorn JN, Florez JC; SUMMIT Consortium, DCCT/EDIC Research Group, GENIE Consortium. Genome-Wide Association Study of Diabetic Kidney Disease Highlights Biology Involved in Glomerular Basement Membrane Collagen. J Am Soc Nephrol. 2019 Oct;30(10):2000-2016. doi: 10.1681/ASN.2019030218. Epub 2019 Sep 19. PMID: 31537649; PMCID: PMC6779358.
McKeigue PM, Spiliopoulou A, McGurnaghan S, Colombo M, Blackbourn L, McDonald TJ, Onengut-Gomuscu S, Rich SS, A Palmer CN, McKnight JA, J Strachan MW, Patrick AW, Chalmers J, Lindsay RS, Petrie JR, Thekkepat S, Collier A, MacRury S, Colhoun HM. Persistent C-peptide secretion in Type 1 diabetes and its relationship to the genetic architecture of diabetes. BMC Med. 2019 Aug 23;17(1):165. doi: 10.1186/s12916-019-1392-8. PubMed PMID: 31438962.
Floyd JS, Bloch KM, Brody JA, Maroteau C, Siddiqui MK, Gregory R, Carr DF, Molokhia M, Liu X, Bis JC, Ahmed A, Liu X, Hallberg P, Yue QY, Magnusson PKE, Brisson D, Wiggins KL, Morrison AC, Khoury E, McKeigue P, Stricker BH, Lapeyre-Mestre M, Heckbert SR, Gallagher AM, Chinoy H, Gibbs RA, Bondon-Guitton E, Tracy R, Boerwinkle E, Gaudet D, Conforti A, van Staa T, Sitlani CM, Rice KM, aitland-van der Zee AH, Wadelius M, Morris AP, Pirmohamed M, Palmer CAN, Psaty BM, Alfirevic A; PREDICTION-ADR Consortium and EUDRAGENE. Pharmacogenomics of statin-related myopathy: Meta-analysis of rare variants from whole-exome sequencing. PLoS One. 2019 Jun 26;14(6):e0218115. doi: 10.1371/journal.pone.0218115. eCollection 2019. PubMed PMID: 31242253; PubMed Central PMCID: PMC6594672.
Colombo M, Valo E, McGurnaghan SJ, Sandholm N, Blackbourn LAK, Dalton RN, Dunger D, Groop PH, McKeigue PM, Forsblom C, Colhoun HM; FinnDiane Study Group and the Scottish Diabetes Research Network (SDRN) Type 1 Bioresource Collaboration. Biomarker panels associated with progression of renal disease in type 1 diabetes. Diabetologia. 2019 Jun 20. doi: 10.1007/s00125-019-4915-0. [Epub]
Mair C, Wulaningsih W, Jeyam A, McGurnaghan S, Blackbourn L, Kennon B, Leese G, Lindsay R, McCrimmon RJ, McKnight J, Petrie JR, Sattar N, Wild SH, Conway N, Craigie I, Robertson K, Bath L, McKeigue PM, Colhoun HM; Scottish Diabetes Research Network (SDRN) Epidemiology Group. Glycaemic control trends in people with type 1 diabetes in Scotland 2004-2016. Diabetologia. 2019 Aug;62(8):1375-1384. doi: 10.1007/s00125-019-4900-7. Epub 2019 May 18. PubMed PMID: 31104095.
Spiliopoulou A, Colombo M, Plant D, Nair N, Cui J, Coenen MJ, Ikari K, Yamanaka H, Saevarsdottir S, Padyukov L, Bridges SL Jr, Kimberly RP, Okada Y, van Riel PLC, Wolbink G, van der Horst-Bruinsma IE, de Vries N, Tak PP, Ohmura K, Canhão H, Guchelaar HJ, Huizinga TW, Criswell LA, Raychaudhuri S, Weinblatt ME, Wilson AG, Mariette X, Isaacs JD, Morgan AW, Pitzalis C, Barton A, McKeigue P. Association of response to TNF inhibitors in rheumatoid arthritis with quantitative trait loci for CD40 and CD39. Ann Rheum Dis. 2019 Apr 29. doi: 10.1136/annrheumdis-2018-214877. [Epub ahead of print] PubMed PMID: 31036624.
Hensor EMA, McKeigue P, Ling SF, Colombo M, Barrett JH, Nam JL, Freeston J, Buch MH, Spiliopoulou A, Agakov F, Kelly S, Lewis MJ, Verstappen SMM, MacGregor AJ, Viatte S, Barton A, Pitzalis C, Emery P, Conaghan PG, Morgan AW. Validity of a two-component imaging-derived disease activity score for improved assessment of synovitis in early rheumatoid arthritis. Rheumatology (Oxford). 2019 Mar 1. doi: 10.1093/rheumatology/kez049. [Epub ahead of print] PubMed PMID: 30824919.
McGurnaghan SJ, Brierley L, Caparrotta TM, McKeigue PM, Blackbourn LAK, Wild SH, Leese GP, McCrimmon RJ, McKnight JA, Pearson ER, Petrie JR, Sattar N, Colhoun HM; Scottish Diabetes Research Network Epidemiology Group. The effect of dapagliflozin on glycaemic control and other cardiovascular disease risk factors in type 2 diabetes mellitus: a real-world observational study. Diabetologia. 2019 Apr;62(4):621-632. doi: 10.1007/s00125-018-4806-9. Epub 2019 Jan 10. PubMed PMID: 30631892.
Colombo M, Looker HC, Farran B, Hess S, Groop L, Palmer CNA, Brosnan MJ,Dalton RN, Wong M, Turner C, Ahlqvist E, Dunger D, Agakov F, Durrington P,ivingstone S, Betteridge J, McKeigue PM, Colhoun HM; SUMMIT Investigators. Serum kidney injury molecule 1 and β(2)-microglobulin perform as well as larger biomarker panels for prediction of rapid decline in renal function in type 2 diabetes. Diabetologia. 2018 Oct 5. doi: 10.1007/s00125-018-4741-9. [Epub ah.pead of print] PubMed PMID: 30288572.
Cherlin S, Plant D , Taylor JC, Colombo M, Spiliopoulou A , Tzanis E , Morgan AW, Barnes MR , McKeigue P, Barrett JH, Pitzalis C, Barton A, MATURA Consortium, Cordell HJ. Prediction of treatment response in rheumatoid arthritis patients using genome-wide SNP data. Genetic Epidemiology 2018: Dec;42(8):754-771. doi: 10.1002/gepi.22159
Massey J, Plant D, Hyrich K, Morgan AW, Wilson AG, Spiliopoulou A, Colombo M, McKeigue P, Isaacs J, Cordell H, Pitzalis C, Barton A; BRAGGSS, MATURA Consortium. Genome-wide association study of response to tumour necrosis factor inhibitor therapy in rheumatoid arthritis. Pharmacogenomics J. 2018 Aug 31. doi: 10.1038/s41397-018-0040-6. [Epub ahead of print] PubMed PMID: 30166627.
McKeigue P. Quantifying performance of a diagnostic test as the expected information for discrimination: Relation to the C-statistic. Stat Methods Med Res. 2018 Jan 1:962280218776989. doi: 10.1177/0962280218776989. [Epub ahead of print] PubMed PMID: 29978758.
Colombo M,, Looker HC, Farran B, Agakov F, Brosnan MJ, Welsh P, Sattar N, Livingston SJ, Durrington PN, Betteridge DJ, McKeigue PM, Colhoun HM. Apolipoprotein CIII and N-Terminal Prohormone B-type Natriuretic Peptide as Independent Predictors for Cardiovascular Disease in Type 2 Diabetes. Atherosclerosis 2018, May 21;274:182-190
van Zuydam NR, Ahlqvist E, Sandholm N, Deshmukh H, Rayner NW, Abdalla M, Ladenvall C, Ziemek D, Fauman E, Robertson NR, McKeigue PM, Valo E, Forsblom C, Harjutsalo V; FINNDIANE Study centres, Perna A, Rurali E, Marcovecchio ML, Igo RP Jr, Salem RM, Perico N, Lajer M, Käräjämäki A, Imamura M, Kubo M, Takahashi A, Sim X, Liu J, van Dam RM, Jiang G, Tam CHT, Luk AOY, Lee HM, Lim CKP, Szeto CC, So WY, Chan JCN; Hong Kong Diabetes Registry TRS Project Group, Ang SF, Dorajoo R, Wang L, Hua Clara TS, McKnight AJ, Duffy S; Warren 3/UK GoKinD Study Group, Pezzolesi MG, Consortium G, Marre M, Gyorgy B, Hadjadj S, Hiraki LT; DCCT/EDIC group, Ahluwalia TS, Almgren P, Schulz CA, Orho-Melander M, Linneberg A, Christensen C, Witte DR, Grarup N, Brandslund I, Melander O, Paterson AD, Tregouet D, Maxwell AP, Lim SC, Ma RCW, Tai ES, Maeda S, Lyssenko V, Tuomi T, Krolewski AS, Rich SS, Hirschhorn JN, Florez JC, Dunger D, Pedersen O, Hansen T, P, Remuzzi G; SUMMIT Consortium, Brosnan MJ, Palmer CNA, Groop PH, Colhoun HM, Groop LC, McCarthy MI. A Genome-Wide Association Study of Diabetic Kidney Disease in Subjects With Type 2 Diabetes. Diabetes. 2018 Apr 27. pii: db170914. doi: 10.2337/db17-0914. PubMed PMID: 29703844.
Morgan A, Taylor J, Bongartz T, Massey J, Mifsud B, Spiliopoulou A, Scott I, Wang J, Morgan M, Plant D, Colombo M, Orchard P, Twigg S, McInnes I, Porter D, Freeston J, Nam J, Cordell H, Isaacs J, Strathdee J, Arnett D, de Hair M, Tak P, Aslibekyan S, Padyukov L, Bridges SL Jr, Pitzalis C, Cope A, Verstappen S, Emery P, Barnes M, Agakov F, McKeigue P, Mushiroda T, Kubo M, Weinshilboum R, Barton A, Barrett J. Genome-wide Association Study of Response to Methotrexate in Early Rheumatoid Arthritis Patients. Pharmacogenomics Journal 2018, May 25. doi: 10.1038/s41397-018-0025-5.
Meng X, Spiliopoulou A, Timofeeva M, Wei W-Q, Gifford A, Shen X, He Y, Varley T, McKeigue P, Tzoulaki I, Wright A F, Joshi P, Denny J C, Campbell H, Theodoratou E. MR-PheWAS: exploring the causal effect of SUA level on multiple disease outcomes by using genetic instruments in UK Biobank. Ann Rheum Dis 2018; in press. doi:10.1136.
McKeigue P. Sample size requirements for learning to classify with high-dimensional biomarker panels. Stat Methods Med Res. 2018 Jan 1:962280217738807. doi: 10.1177/0962280217738807. [Epub ahead of print] PubMed PMID: 29179643
To calculate the required sample size for learning to predict from high-dimensional biomarker panels, classical statistical power calculations are not very useful. What researchers really need is a learning curve, showing how the expected predictive performance of the trained model depends on the size of the training sample from which this model is learned.
I have described a simple method for calculating the sample size required to learn to classify with a high-dimensional biomarker panel, based on the asymptotic distribution of the weight of evidence (log Bayes factor) in a recent paper: Sample size requirements for learning to classify with high-dimensional biomarker panels (Statistical Methods for Medical Research 2017, final version now on PubMed). The assumptions underlying this method are:-
There are no redundant biomarkers in the sense that none of them can be calculated as weighted sums of the others (covariance matrix is of full rank)
The effects of the biomarkers can be approximated by a linear discriminant model in which the class-conditional distributions of the biomarkers are gaussian with the same covariance in cases and noncases.
The class-conditional correlations between the biomarkers are the same as the correlations between their prior effect sizes.
The method is implemented in an online sample size calculator, written with the R shiny package and deployed at https://pmckeigue.shinyapps.io/sampsizeapp/.
To use it, move the sliders to specify:-
the performance of the optimal classifier that could be learned from a training sample of infinite size. This is specified as the C-statistic (area under the ROC curve). C-statistics of 0.80, 0.88 and 0.925 are equivalent to expected information for discrimination of 1, 2 and 3 bits respectively.
The proportion of biomarkers that have nonzero effect sizes, based on a spike-and-slab mixture model for the distribution of effect sizes. For instance, specifying 0.1 is equivalent to specifying that 90% of biomarkers have zero effect size (the spike component), and the remaining 10% have a gaussian distribution of effect sizes with mean zero (the slab component). The app will plot the learning curve based on the proportion that you specify, and also based on a model in which the biomarkers have a gaussian distribution of effect sizes. If the biomarkers in your panel are unselected (for instance a genome-wide profile of gene transcription levels) you may expect that only a small proportion of these biomarkers contain predictive information. If your biomarker panel has been preselected for relevance (for instance markers previously reported to be associated with the outcome under study, you may expect the proportion with nonzero effects to be relatively high.
Then click the “Submit” button
The table generated by the app has three columns:
the expectation of the information for discrimination extracted by the trained model, as a percentage of the information for discrimination extracted by the optimal model that would be learned from a sample of infinite size. This is scaled from 25% to 90%.
the C-statistic corresponding to these values of expected information for discrimination.
the ratio of cases to variables, assuming a balanced study design with equal numbers of cases and controls.
The app also plots a learning curve, showing how the expected information for discrimination and C-statistic obtained with the trained model depends on the ratio of cases to variables. In this example, where the optimal model has an expected discrimination of 1 bit equivalent to a C-statistic of 0.80, and 1% of biomarkers have nonzero effects, a sample size of at least 0.1 cases per variable is required to learn a model that has expected information for discrimination 60% of that obtained with the optimal model.