BIGS is a joint team of Inria, CNRS and University of Lorraine, within the Institut Élie Cartan of Lorraine (IECL), UMR 7502 CNRS-UL laboratory in mathematics, of which Inria is a strong partner. One member of BIGS, T. Bastogne, comes from the Research Center of Automatic Control of Nancy (CRAN), with which BIGS has strong relations in the domain “Health-Biology-Signal”. Our research is mainly focused on stochastic modeling and statistics but also aims at a better understanding of biological systems. BIGS involves applied mathematicians whose research interests mainly concern probability and statistics. More precisely, our attention is directed on (1) stochastic modeling, (2) estimation and control for stochastic processes, (3) regression and machine learning, and (4) statistical learning and application in health. The main objective of BIGS is to exploit these skills in applied mathematics to provide a better understanding of issues arising in life sciences, with a special focus on (1) tumor growth and heterogeneity, (2) gene networks, (3) telomere length dynamics, (4) epidemiology and e-health.
We give here the main lines of our research. For clarity, we made the choice to structure them in four items. Note that all of these items deal with stochastic modeling and inference, therefore they are all interconnected.
Our aim is to propose relevant stochastic frameworks for the modeling and the understanding of biological systems. The stochastic processes are particularly suitable for this purpose. Among them, Markov processes provide a first framework for the modeling of population of cells 84, 64. Piecewise deterministic processes are non-diffusion processes that are also frequently used in the biological context 51, 63, 52. Among Markov models, we developed strong expertise about processes derived from Brownian motion and Stochastic Differential Equations 79, 62. For instance, knowledge about Brownian or random walk excursions 83, 78 helps to analyse genetic sequences and to develop inference about them. We also have strong expertise in stochastic modeling of complex biological populations using individual-based models. These models can be used either from the point of view of asymptotic stochastic analysis 48, e.g. to study the long term Darwinian evolution of populations, or from the point of view of numerical analysis of biological phenomena 58, 39. We also develop mathematical tools for the analysis of the long-time behavior of stochastic population processes accounting for possible extinction of (sub)populations 49.
We develop inference about the stochastic processes that we use for modeling. Control of stochastic processes is also a way to optimise administration (dose, frequency) of therapy, such as targeted therapies in cancer. Our team has a good expertise about inference of the jump rate and the kernel of piecewise-deterministic Markov processes (PDMP) 43, 42, 2, but there are many directions to go further into. For instance, previous work made the assumption of a complete observation of jumps and mode, which is unrealistic in practice. We also tackle the problem of inference of “hidden PDMP”. For example, in pharmacokinetics modeling inference, we want to account for the presence of timing noise and identification from longitudinal data. We have expertise on these subjects 44, and we also use mixed models to estimate tumor growth or heterogeneity 45.
We consider the control of stochastic processes within the framework of Markov Decision Processes 76 and their generalization known as multi-player stochastic games, with a particular focus on infinite-horizon problems. In this context, we are interested in the complexity analysis of standard algorithms, as well as the proposition and analysis of numerical approximate schemes for large problems in the spirit of 46. Regarding complexity, a central topic of research is the analysis of the Policy Iteration algorithm, which has made significant progress in the last years 86, 75, 60, 82, but is still not fully understood. For large problems, we have an extensive experience of sensitivity analysis of approximate dynamic programming algorithms for Markov Decision Processes 80, 67, 81, and we currently investigate whether/how similar ideas may be adapted to multi-player stochastic games.
Recently, our group has focused its attention on modeling and inference for graph data. A graph data structure consists of a set of nodes, together with a set of pairs of these nodes called edges. This type of data is frequently used in biology because they provide a mathematical representation of many concepts such as biological networks of relationships in a population or between genes in a cell.
Network inference is the process of making inference about the link between two variables, by taking into account the information about other variables. Reference 85 gives a very good introduction and many references about network inference and mining. Many methods are available to infer and test edges in Gaussian graphical models 85, 69, 57, 59. However, the Gaussian assumption does not hold when dealing with typical “zero-inflated” abundance data, and we want to develop inference in this case.
Concerning gene networks, most studies have been based on population-averaged data: now that technologies enable us to observe mRNA levels in individual cells, a revolution in terms of precision, the network reconstruction problem paradoxically becomes more challenging than ever. Indeed, the traditional way of seeing a gene regulatory network as a deterministic system with some small external noise is being challenged by the probabilistic, bursty nature of gene expression revealed at single-cell level. Our objective is to propose dynamical models and inference methods that fully exploit the particular time structure of single-cell data. We described a promising strategy in which the network inference problem is seen as a calibration procedure for a new PDMP model that is able to acceptably reproduce real single-cell data 61, 77.
Among graphs, trees play a special role because they offer a powerful model for many biological concepts, from RNA to phylogenetic trees in heterogeneous tumors or through plant structures. Our research deals with several aspects of tree data. In particular, we work on statistical inference for this type of data under a given stochastic model. We also work on lossy compression of trees via directed acyclic graphs. These methods enable us to compute distances between tree data faster than from the original structures and with a high accuracy.
Regression models and machine learning aim at inferring statistical links between a variable of interest and covariates. In biological studies, it is always important to develop adapted learning methods both in the context of standard data and also for data of high dimension (sometimes with few observations) and very massive or online data.
Many methods are available to estimate conditional quantiles and test dependencies 74, 65. Among them we have developed nonparametric estimation by local analysis via kernel methods 55, 56 and we want to study properties of this estimator in order to derive a measure of risk based e.g. on confidence band and test. We study also other regression models like survival analysis, spatio-temporal models with covariates. Among the multiple regression models, we want to develop omnibus tests that examine several assumptions together.
Concerning the analysis of high dimensional data, our view on the topic relies on the French data analysis school, specifically on Factorial Analysis. In this context, stochastic approximation is an essential tool 66, which allows one to approximate eigenvectors in a stepwise manner 71, 70, 73. We aim at performing accurate classification or clustering by taking advantage of the possibility of updating the information "online" using stochastic approximation algorithms 47. We focus on several incremental procedures for regression and data analysis like linear and logistic regressions and PCA (Principal Component Analysis).
We also focus on the biological context of high-throughput bioassays in which several hundreds or thousands of biological signals are measured for a posterior analysis. We have to account for the inter-individual variability within the modeling procedure. We aim at developing a new solution based on an ARX (Auto Regressive model with eXternal inputs) model structure using the EM (Expectation-Maximisation) algorithm for the estimation of the model parameters.
We want to propose stochastic processes to model the appearance of mutations and the evolution of their frequencies in tumor samples, through new collaborations with clinicians who measure a particular quantity called circulating tumor DNA (ctDNA). The final purpose is to use ctDNA as an early biomarker of the resistance to a targeted therapy: this is the aim of the project funded by ITMO Cancer that we coordinate. In the ongoing work on low-grade gliomas, a local database of 400 patients will be soon available to construct models. We plan to extend it through national and international collaborations (Montpellier CHU, Montreal CRHUM). Our aim is to build a decision-aid tool for personalised medicine.
We already mentioned in Section 3.4 our interest in the modeling and inference of transcriptomic bursting in gene regulatory networks from single-cell data. We are also currently working on the prediction and identification of therapeutic targets for chronic lymphocytic leukemia from gene expression data. Our goal is to propose new models allowing to make prediction of gene silencing experiments. Inference will be performed on gene expression data from patients’ cells suffering from different forms of chronic lymphocytic leukemia. The goal is to identify therapeutic targets which could be silenced to reduce cell proliferation.
In the context of personalized medicine, we have many ongoing projects with CHU Nancy. They deal with biomarkers research, prognostic value of quantitative variables and events, scoring, and adverse events. We also want to develop our expertise in rupture detection in a project with APHP (Assistance Publique Hôpitaux de Paris) for the detection of adverse events, earlier than the clinical signs and symptoms. The clinical relevance of predictive analytics is obvious for high-risk patients such as those with solid organ transplantation or severe chronic respiratory disease for instance. The main challenge is the rupture detection in multivariate and heterogeneous signals (for instance daily measures of electrocardiogram, body temperature, spirometry parameters, sleep duration, etc.). Other collaborations with clinicians concern foetopathology and we want to use our work on conditional distribution function to explain fetal and child growth. To that end, we use data from the “Service de fœtopathologie et de placentologie” of the “Maternité Régionale Universitaire” (CHU Nancy).
Telomeres are disposable buffers at the ends of chromosomes which are truncated during cell division; so that, over time, due to each cell division, the telomere ends become shorter. By this way, they are markers of aging. Through a collaboration with Pr A. Benetos, geriatrician at CHU Nancy, we recently obtained data on the distribution of the length of telomeres from blood cells 9. We want to work in three connected directions: (1) refine methodology for the analysis of the available data; (2) propose a dynamical model for the lengths of telomeres and study its mathematical properties (long term behavior, quasi-stationarity, etc.); and (3) use these properties to develop new statistical methods.
D. Villemonais has been granted a delegation at Institut Universitaire de France from september 2023 to august 2028.
The aim of this collaboration is to better understand how living cells make decisions (e.g., differentiation of a stem cell into a particular specialized type), seeing decision-making as an emergent property of an underlying complex molecular network. Indeed, it is now proven that cells react probabilistically to their environment: cell types do not correspond to fixed states, but rather to “potential wells” of a certain energy landscape (representing the energy of the possible states of the cell) that we are trying to reconstruct. The achievement of last year was to show that the same mathematical model driven by transcriptional bursting can be used simultaneously as an inference tool, to reconstruct biologically relevant networks, and as a simulation tool, to generate realistic transcriptional profiles emerging from gene interactions: the article presenting these results is now published 25. In addition, the paper proposing a landscape reconstruction method with application to several datasets has also been published this year 22.
These results form the starting point of M. Gaillard's thesis work, which will focus on making links with interpretable dimension reduction for single-cell RNA-seq data. Finally, we are working with software engineer N. Seyler on a refactoring of the “Harissa” Python package used in 10 for stochastic simulation and inference of gene regulatory networks, with the aim of making it modular and scalable. The latest stable version is available on PyPI and is presented in a dedicated tool paper 30.
We are continuing our research on quasi-stationary distributions (QSD), that is, distributions of Markov stochastic processes with absorption, which are stationary conditionally on non-absorption. For models of biological populations, absorption usually corresponds to extinction of a (sub-)population. QSDs are fundamental tools to describe the population state before extinction and to quantify the large-time behavior of the probability of extinction.
Thanks to the previous general result of the team in 50, together with B. Cloez (INRAE), we proved in 16 the exponential convergence of a chemostat model, whose dynamics are highly degenerate due to a deterministic part, towards a unique quasi-stationary distributions.
We also finalized an important work 15 that provides general criteria for the exponential convergence of conditional distributions of absorbed Markov processes when the convergence is not uniform with respect to the initial distribution. Our results allow to characterize a large subset of the domain of attraction of the minimal QSD and apply to a large range of stochastic processes, including diffusion processes and perturbed dynamical systems.
In collaboration with E. Strickler (Univ. Lorraine), we also studied in 34 the convergence of general penalized Markov processes with soft killing in
In this collaborative study, we delve into the dynamics of measure-valued Pólya processes (MVPPs), commonly known as Pólya urns with infinitely-many colours. Our study introduces the first second-order results in the literature on MVPPs, extending classical fluctuation outcomes from finitely-many-colour Pólya urns to the infinite colour space scenario. The nature of fluctuations in MVPPs is intricately linked to the “spectral gap”, adding a layer of sophistication to our understanding of these processes.
By framing MVPPs as stochastic approximations operating within the set of measures on a measurable space
We continued our study of parameter scalings of individual-based models of biological populations under mutation and selection, taking into account the influence of negligible but non-extinct populations. In a work within the ERC SINGER 14, we were able to give an individual-based justification of the Hamilton-Jacobi equation of adaptive dynamics (see e.g. 68), with a specific parameter scaling that is promising for the study of local (in space) extinction of sub-populations. The analysis of models allowing for such an extinction is the next step of this project. We also wrote an article 26 for the proceedings of the International Congress of Mathematicians (ICM 2022) where S. Méléard gave an invited talk on several large population scalings that can be used in evolutionary biology.
We also worked on general evolutionary models of adaptive dynamics under an assumption of large population and small mutations. We obtained in 13 existence, uniqueness and ergodicity results for a centered version of the Fleming-Viot process of population genetics, which are key steps to recover variants of the canonical equation of adaptive dynamics, which describes the long time evolution of the dominant phenotype in the population, under less stringent biological assumptions than in previous works such as 48. We completed this second step in 33.
In this collaboration, our focus is on investigating the large population limit of a binary branching particle system with Moran type interactions. The novel model introduced in this paper features particles that evolve, reproduce, and die independently. It encompasses branching models and fixed size Moran type interacting particle systems. The death of a particle may trigger the reproduction of another, while a branching event may, in turn, lead to the demise of another particle. Our study 17 aims to elucidate the intricate dynamics of this model. We explore diverse applications of our model, including its relevance to the neutron transport equation and population size dynamics. We focus on the occupation measure of the new model, explicitly connecting it to the Feynman-Kac semigroup of the underlying Markov evolution. Additionally, we quantify the
The asexual multi-type Galton-Watson branching processes as well as the single-type bisexual processes have been studied in the literature. In particular, survival condition of the processes are well known in both cases. However, until now, the multi-type bisexual branching processes have only been studied in very specific situations and no general mathematical description has been established yet.
In 21, we studied general multi-type bisexual branching processes with superadditive mating function. We exhibited a necessary and sufficient condition for almost sure extinction, we proved a law of large numbers for our model and we studied the long-time convergence of the rescaled process.
In this study, we construct and analyze an individual-based model capturing the evolution of telomere length in a population across multiple generations 32. The model, a continuous-time typed branching process, incorporates individual characteristics such as gamete mean telomere length and age. Our investigation delves into the Malthusian behavior of the model, and we complement our findings with numerical simulations to elucidate the impact of biologically relevant parameters on telomere length dynamics on an evolutionary time scale.
Lung exposure to various types of particules, such as those present in cigarette smoke, can lead to chronic obstructive pulmonary disease (COPD). COPD bronchi are an area of intense immunological activity and tissue remodeling, as evidenced by the extensive immune cell infiltration and changes in tissue structures. This allows the persistent contact between resident cells and stimulated immune cells. Our hypothesis is that the contact between cells is a major cause of chronic destructive or fibrotic manifestations. We aim to analyze the potential cell-cell interactions in situ in human tissues, to characterize in vitro the dynamics of the interplay, and to define a computational model with intercellular interactions which fits to experimental measurements and explains the macroscopic properties of cell populations. The effects of potential therapeutic drugs modulating local intercellular interactions will be tested by simulations. A paper has been submitted this year 19 (see also 54).
In a collaboration with A. Lejay (Inria PASTA team) and their PhD student A. Anagnostakis, D. Villemonais proposed a method for approximating general, singular diffusions by discrete time and state space processes 11. One of the main interests compared to existing methods is to propose a numerical method whose main computational cost is done upstream and thus represents a fixed cost, independently of the number of simulations performed afterwards.
Many goodness-of-fit tests have been developed to assess the different assumptions of a (possibly heteroscedastic) regression model. Most of them are `directional' in that they detect departures from a given assumption of the model. Other tests are `global' (or `omnibus') in that they assess whether a model fits a dataset on all its assumptions. We focus on the task of choosing the structural part of the regression and the variance functions because they contain easily interpretable informations about the studied relationship. We consider two nonparametric `directional' tests and one nonparametric `global' test, all based on generalizations of the Cramér-von Mises statistic.
To perform these goodness-of-fit tests, we have developed the R package cvmgof 40, an easy-to-use tool for practitioners, available from the Comprehensive R Archive Network (CRAN). The package was updated in 2022 (this is its third version) 41. This latest version currently allows testing the “regression function” part of the model. In 2023, we worked to enrich the package by allowing the user to test the homoskedasticity/heteroskedasticity of the model. This new version will be submitted to CRAN in 2024 and an associated article is currently being written.
To complete this work, we plan to assess the other assumptions of a regression model such as the additivity of the random error term. The implementation of these directional tests would enrich the cvmgof package and offer a complete easy-to-use tool for validating regression models. Another perspective of this work would be to develop a similar tool for other statistical models widely used in practice such as generalized linear models.
The estimation of the probability density function underlying a finite set of observations is a fundamental problem that covers a broad range of applications including machine learning. We propose a new nonparametric method to estimate this function that combines both the Schwartz distribution theory and the possibility theory. It is an extension of the kernel density estimator that leads to imprecise estimation, based on a new type of kernel called maxitive kernel. The form of the obtained estimation is an interval. In collaboration with B. Nehme, S. Ferrigno demonstrated several theoretical properties of the imprecise estimator. We implement this method using very low complexity algorithms and illustrate some theoretical properties of the proposed imprecise density estimation as well as a comparative analysis with other estimation intervals. An associated article is currently being written.
A tool for analyzing streaming data is stochastic approximation introduced by Robbins and Monro in 1951, that can be used for example to estimate online parameters of a regression function 53 or centers of clusters in unsupervised classification 47. Another type of stochastic approximation processes was introduced by Benzécri in 1969 for estimating eigenvectors and eigenvalues of the unknown
In the article 24, we establish an almost sure convergence theorem of an extension of the stochastic approximation process of Oja for estimating eigenvectors of the unknown
In the article 36, after recalling an almost sure convergence theorem of an extended Oja process 24, we present the canonical correlation analysis (CCA) of two random vectors
The large application potential of microbiomes has led to a great need for mixed culture methods. However, microbial interactions can compromise the maintenance of biodiversity during cultivation in a reactor. In particular, competition among species can lead to a strong disequilibrium in favor of the fittest microorganism. The aim of this study was to evaluate the potential of single invert emulsions to alleviate competition during the culture of antagonistic microorganisms and therefore to maintain diversity in a more complex mixed culture. Experimental data obtained in this study were analyzed using a two-way analysis of variance using a fixed effects model, followed by Tukey's HSD test. In the droplet size distributions of the invert emulsions, factors involved were the presence or absence of bacteria, and the incubation of invert emulsions. In bacterial enumerations, factors were the cultivation system used and the incubation. In community cultivation experiments, differences in Shannon diversity index between groups of samples were tested using one-way analysis of variance, followed by a Tukey's HSD test. An article 18 has been published on this work in 2023.
In this collaboration, we work on the inference of dynamical gene networks from RNAseq and proteome data. The goal is to infer a model of gene expression allowing to predict gene expression in cells where the expression of specific genes is silenced (e.g. using siRNA), in order to select the silencing experiments which are more likely to reduce the cell proliferation. We expect the selected genes to provide new therapeutic targets for the treatment of chronic lymphocytic leukemia. This year, we have developed a new method of prediction of the effect of gene silencing, based on the re-exploitation of expression data of genes not influenced by the silenced gene 27. We also have developed the package MultiRNAflow (see Section 6.1.2) for the statistical analysis of temporal gene expression datasets with several biological conditions (in particular for exploratory analysis and the detection of differentially expressed genes). The package is described in the application note 35.
The start-up EMOSIS develops blood tests relying on flow cytometry in order to improve in vitro diagnosis of vascular thrombosis. This technology leads to multiparametric measurements on tens of thousands cells collected from each blood sample. Manual methods of analysis classically used in flow cytometry are based on data visualization by means of histograms or scatter plots. Computational algorithmic approach that would automate and deepen the search of differences or similarities between cell subpopulations could thus increase the quality of diagnosis.
Recent progresses in the active area of computational methods for dimension reduction suggest many directions of improvement of the classical approaches for the analysis of flow cytometry data. The approach that we considered is information geometry, whose principle is to lower the dimensionality of multiparametric observations by considering the subspace of the parameters of the statistical model describing the observation, whose points are probability density functions, and which is equipped with a special geometrical structure. The objective of the reported study is to use an algorithm belonging to the field of information geometry known as Fisher Information Non-parametric Embedding (FINE) to analyze flow cytometry data in the context of the specific severe disorder called heparin-induced thrombocytopenia. This work lead to two communications in conferences 28, 29.
Unfortunately the start-up EMOSIS non longer exists, which put an end to our collaboration.
Endometriosis is a chronic disease characterized by growth of endometrial tissue outside the uterine cavity which could affect 200 million women worldwide. One of the most common symptoms of endometriosis is pelvic chronic pain associated with fatigue. This pain can cause psychological distress and interpersonal difficulties. As for several chronic diseases, adapted physical activity could help to manage the physical and psychological symptoms.
We are participating in both design and statistical analysis of a randomized-controlled trial, led by G. Escriva-Boulley, to investigate the potential effects of a videoconference-based adapted physical activity combined with endometriosis-based education program 20. This study is one of the first trials to test the effects of a combined adapted physical activity and education program for improving endometriosis symptoms and physical activity.
As part of the French “Plan de relance”, we obtained funds for a 2-year engineering contract with the start-up EMOSIS based in Strasbourg (from October 1, 2022). Project MOSAiC : MultidimensiOnal Statistical Analysis of Information for Clinical use. Unfortunetly EMOSIS ordered to file for bankruptcy in 2024 an the project was stopped.
N. Champagnat is scientific collaborator of the ERC SINGER (AdG 101054787) on Stochastic dynamics of sINgle cells, coordinated by S. Méléard (Ecole Polytechnique). He is involved in the research axes “From stochastic processes to singular Hamilton-Jacobi equations” and “Lineages and time reversed trajectories” of this project.
A. Gégout-Petit is one the two PIs of the interdisciplinary program “Life Travel” of the I-Site “Lorraine Université d'Excellence” on life trajectories and longevity (under construction).
BIGS faculty members have teaching obligations at Univ. Lorraine and are teaching at least 192 hours each year. They teach probability and statistics at different levels (Licence, Master, Engineering school). Many of them have pedagogical responsibilities.
PhD
Other