people publications research teaching jobs
contact publications research teaching opportunities

Model based inference in the life sciences, or how to get more from your data

As in other areas of science, mathematical models can be used to make inferences about complex dynamical systems when they are quantitatively fitted to data. This approach allows us to formally and quantitatively test and compare competing hypotheses, and allows us to make quantitative predictions for empirical testing. It is the most powerful and rapid way of culling wrong hypotheses.

The greatest challenge of modelling is that a vast number of mathematical models are consistent with the qualitative understanding we currently have of biological processes. So the fit of a particular model to data need not be that informative. This challenge is the core of my work: the iteration of hypothesis generation and testing (mathematically and empirically) in order to understand biological processes and their interactions.

Experimental Biology is generating large quantities of high quality dynamical data. Conventional statistical analysis of such data ignores its dynamical nature because the methodologies of fitting nonlinear mechanistic models to noisy multivariate data have, until recently, been poorly developed. This is an inefficient use of data that is often costly and time-consuming to collect. Recent advances in Bayesian population-based McMC allow us to fit nonlinear dynamical models to multivariate time-series data. This means we can make inferences about its dynamical nature and quantitatively compare competing hypotheses.  In addition we can extract far more information from it than conventional statistical analyses. This can be done rapidly and cheaply thus helping to reduce, refine and replace animal experiments as well as allowing the reuse of existing data. These methods do, however, generate a new set of challenges. The most significant of these is the complexity of the computer algorithms required to adequately characterise the posterior density. We have spent the last three years researching, implementing and tuning a new algorithm to work efficiently on multicore processors, whilst allowing flexibility in model coding so that it can be used on a wide range of problems.

Unfortunately the code is not yet ready for public consumption. Perhaps in the future if I can find someone to fund the development of a user-friendly interface. But If you want to collaborate, either in developing the code (particularly recoding for use on GPUs) or applying it on your own data, I'd be very happy to hear from you.

There are two critical issues that make model-based inference of biological systems challenging: the strong nonlinearity of the dynamical systems and the high-dimensional parameter space. These have several important consequences for nonlinear parameter estimation and biological inference:
  • analytical solutions rarely exist thus requiring time-consuming numerical estimates of solutions,
  • the likelihood function can be very complex (for example, multiple local maxima, and ridges) that often trap search or sampling algorithms in suboptimal regions of parameter space,
  • multiple solutions may exist all of which need to be visited by the search or sampling algorithm,
  • searching or sampling parameter space is very slow for high dimensional systems.
If these issues are not adequately addressed then a good characterisation of the posterior density is unlikely. The consequences are compromised model assessment, compromised model comparison and a lack of confidence in the validity of the biological inferences. A common method of model fitting, maximum likelihood, suffers from all of these problems to such an extent that it becomes almost unusable when used to model nonlinear, multivariate dynamical systems. Bayesian population-based McMC, on the other hand, overcomes all of these problems.

Here are some books and papers that I have find useful in developing our ideas and algorithms.

Jeffreys. H. (1998). The Theory of Probability, 3rd edition, OUP.
Lindley, D. V. (2006). Understanding Uncertainty, Wiley-Blackwell.
Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003), Bayesian Data Analysis, 2nd edition, Chapman & Hall.
Gilks, W. R., Richardson, S. and Spiegelhalter, D. (1995). Markov Chain Monte Carlo in Practice, Chapman & Hall.
Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors, J. R. Statist. Soc. B 70, 589-607.
Girolami, M (2008). Bayesian inference for differential equations, Theo. Comp.Sci. 408, 4-16.
Lartillot, N. and Philippe, H. (2006). Computing Bayes factors using thermodynamic integration, Syst. Biol. 55, 195-207.
Liang, F. and Wong, W. H. (2001) Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models, J. Am. Stat. Assoc., 96, 653-666.
Fearnhead, P. (2008). Editorial: Special issue on adaptive Monte Carlo methods, Stat. Comput. 18, 341-342.
Mackay, D. J. C. (1992). Bayesian Interpolation, Neural Computation 4, 415-447.

And a word of warning here about computing the Harmonic Mean estimate of the marginal likelihood. Something I did, but soon learnt my lesson.