I am the Chief Research Officer at Synteny Biotechnology. Our organisation uses modern AI methodologies combined with high-throughput experimental techniques to understand the specificities of T-cells, and in doing so, enable a new generation of T-cell based therapies and diagnostics. The company was founded by Lilly Wollman and Jamie Blundell, and I joined in January 2022 to build the organisation from scratch. By working at Synteny, I am able to focus on researching the fascinating natural computation that our bodies perform to identify pathogens and dysfunctional cells.
Before Synteny, I was a Principal Scientist and Research Manager at Microsoft Research Cambridge and project lead for Station B. Before that, I was a PhD student at University of Cambridge, where I worked on circadian timing in plants in the laboratory of Alex Webb.
During my career, I have always operated at the intersection of biological data and computational analysis. I have made biological discoveries using a wide range of computational techniques. The majority of my earlier work used ordinary differential equation (ODE) models and stochastic chemical kinetics (essentially continuous-time Markov chains). I have also developed techniques for parameter inference and parameter synthesis with dynamical models. In my later years at Microsoft Research, I became interested in probabilistic machine learning models and active learning approaches, including Bayesian optimization. At Synteny, the primary observations are of amino acid sequences, and so I have become interested in using methods that can classify functional properties of those sequences, or repertoires (sets) of (T-cell receptor and antigen) sequences.
At Synteny, I have been able to reignite my interest in Immunology, which started when I first joined Andrew Phillips’ research group at Microsoft in 2009. Together, we developed some of the first dynamical models of antigen presentation by class I molecules of the major histocompability complex (MHC), collaborating with Tim Elliott, who at the time was at the University of Southampton.
PhD in Plant Sciences, 2009
University of Cambridge
MMath in Mathematics, 2005
University of Oxford
Estimation of parameters in differential equation models can be achieved by applying learning algorithms to quantitative time-series data. However, sometimes it is only possible to measure qualitative changes of a system in response to a controlled condition. In dynamical systems theory, such change points are known as bifurcations and lie on a function of the controlled condition called the bifurcation diagram. In this work, we propose a gradient-based semi-supervised approach for inferring the parameters of differential equations that produce a user-specified bifurcation diagram. The cost function contains a supervised error term that is minimal when the model bifurcations match the specified targets and an unsupervised bifurcation measure which has gradients that push optimisers towards bifurcating parameter regimes. The gradients can be computed without the need to differentiate through the operations of the solver that was used to compute the diagram. We demonstrate parameter inference with minimal models which explore the space of saddle-node and pitchfork diagrams and the genetic toggle switch from synthetic biology. Furthermore, the cost landscape allows us to organise models in terms of topological and geometric equivalence.
During development, cells gain positional information through the interpretation of dynamic morphogen gradients. A proposed mechanism for interpreting opposing morphogen gradients is mutual inhibition of downstream transcription factors, but isolating the role of this specific motif within a natural network remains a challenge. Here, we engineer a synthetic morphogen-induced mutual inhibition circuit in E. coli populations and show that mutual inhibition alone is sufficient to produce stable domains of gene expression in response to dynamic morphogen gradients, provided the spatial average of the morphogens falls within the region of bistability at the single cell level. When we add sender devices, the resulting patterning circuit produces theoretically predicted self-organised gene expression domains in response to a single gradient. We develop computational models of our synthetic circuits parameterised to timecourse fluorescence data, providing both a theoretical and experimental framework for engineering morphogen-induced spatial patterning in cell populations.
We introduce a flexible, scalable Bayesian inference framework for nonlinear dynamical systems characterised by distinct and hierarchical variability at the individual, group, and population levels. Our model class is a generalisation of nonlinear mixed-effects (NLME) dynamical systems, the statistical workhorse for many experimental sciences. We cast parameter inference as stochastic optimisation of an end-to-end differentiable, block-conditional variational autoencoder. We specify the dynamics of the data-generating process as an ordinary differential equation (ODE) such that both the ODE and its solver are fully differentiable. This model class is highly flexible: the ODE right-hand sides can be a mixture of user-prescribed or "white-box" sub-components and neural network or "black-box" sub-components. Using stochastic optimisation, our amortised inference algorithm could seamlessly scale up to massive data collection pipelines (common in labs with robotic automation). Finally, our framework supports interpretability with respect to the underlying dynamics, as well as predictive generalization to unseen combinations of group components (also called "zero-shot" learning). We empirically validate our method by predicting the dynamic behaviour of bacteria that were genetically engineered to function as biosensors.
Tapasin, a component of the major histocompatibility complex (MHC) I peptide loading complex, edits the repertoire of peptides that is presented at the cell surface by MHC I and thereby plays a key role in shaping the hierarchy of CD8+ T-cell responses to tumors and pathogens. We have developed a system that allows us to tune the level of tapasin expression and independently regulate the expression of competing peptides of different off-rates. By quantifying the relative surface expression of peptides presented by MHC I molecules, we show that peptide editing by tapasin can be measured in terms of “tapasin bonus,” which is dependent on both peptide kinetic stability (off-rate) and peptide abundance (peptide supply). Each peptide has therefore an individual tapasin bonus fingerprint. We also show that there is an optimal level of tapasin expression for each peptide in the immunopeptidome, dependent on its off-rate and abundance. This is important, as the level of tapasin expression can vary widely during different stages of the immune response against pathogens or cancer and is often the target for immune escape.
Hebbian theory seeks to explain how the neurons in the brain adapt to stimuli to enable learning. An interesting feature of Hebbian learning is that it is an unsupervised method and, as such, does not require feedback, making it suitable in contexts where systems have to learn autonomously. This paper explores how molecular systems can be designed to show such protointelligent behaviors and proposes the first chemical reaction network (CRN) that can exhibit autonomous Hebbian learning across arbitrarily many input channels. The system emulates a spiking neuron, and we demonstrate that it can learn statistical biases of incoming inputs. The basic CRN is a minimal, thermodynamically plausible set of microreversible chemical equations that can be analyzed with respect to their energy requirements. However, to explore how such chemical systems might be engineered de novo, we also propose an extended version based on enzyme-driven compartmentalized reactions. Finally, we show how a purely DNA system, built upon the paradigm of DNA strand displacement, can realize neuronal dynamics. Our analysis provides a compelling blueprint for exploring autonomous learning in biological settings, bringing us closer to realizing real synthetic biological intelligence.
Cell-free gene expression systems have emerged as a promising platform for field-deployed biosensing and diagnostics. When combined with programmable toehold switch-based RNA sensors, these systems can be used to detect arbitrary RNAs and freeze-dried for room temperature transport to the point-of-need. These sensors, however, have been mainly implemented using reconstituted PURE cell-free protein expression systems that are difficult to source in the Global South due to their high commercial cost and cold-chain shipping requirements. Based on preliminary demonstrations of toehold sensors working on lysates, we describe the fast prototyping of RNA toehold switch-based sensors that can be produced locally and reduce the cost of sensors by two orders of magnitude. We demonstrate that these in-house cell lysates provide sensor performance comparable to commercial PURE cell-free systems. We further optimize these lysates with a CRISPRi strategy to enhance the stability of linear DNAs by knocking-down genes responsible for linear DNA degradation. This enables the direct use of PCR products for fast screening of new designs. As a proof-of-concept, we develop novel toehold sensors for the plant pathogen Potato Virus Y (PVY), which dramatically reduces the yield of this important staple crop. The local implementation of low-cost cell-free toehold sensors could enable biosensing capacity at the regional level and lead to more decentralized models for global surveillance of infectious disease.
Targeted high-throughput DNA sequencing is a primary approach for genomics and molecular diagnostics, and more recently as a readout for DNA information storage. Oligonucleotide probes used to enrich gene loci of interest have different hybridization kinetics, resulting in non-uniform coverage that increases sequencing costs and decreases sequencing sensitivities. Here, we present a deep learning model (DLM) for predicting Next-Generation Sequencing (NGS) depth from DNA probe sequences. Our DLM includes a bidirectional recurrent neural network that takes as input both DNA nucleotide identities as well as the calculated probability of the nucleotide being unpaired. We apply our DLM to three different NGS panels: a 39,145-plex panel for human single nucleotide polymorphisms (SNP), a 2000-plex panel for human long non-coding RNA (lncRNA), and a 7373-plex panel targeting non-human sequences for DNA information storage. In cross-validation, our DLM predicts sequencing depth to within a factor of 3 with 93% accuracy for the SNP panel, and 99% accuracy for the non-human panel. In independent testing, the DLM predicts the lncRNA panel with 89% accuracy when trained on the SNP panel. The same model is also effective at predicting the measured single-plex kinetic rate constants of DNA hybridization and strand displacement.
Estimation of parameters in differential equation models can be achieved by applying learning algorithms to quantitative time-series data. However, sometimes it is only possible to measure qualitative changes of a system in response to a controlled condition. In dynamical systems theory, such change points are known as bifurcations and lie on a function of the controlled condition called the bifurcation diagram. In this work, we propose a gradient-based semi-supervised approach for inferring the parameters of differential equations that produce a user-specified bifurcation diagram. The cost function contains a supervised error term that is minimal when the model bifurcations match the specified targets and an unsupervised bifurcation measure which has gradients that push optimisers towards bifurcating parameter regimes. The gradients can be computed without the need to differentiate through the operations of the solver that was used to compute the diagram. We demonstrate parameter inference with minimal models which explore the space of saddle-node and pitchfork diagrams and the genetic toggle switch from synthetic biology. Furthermore, the cost landscape allows us to organise models in terms of topological and geometric equivalence.