The set of available drugs for neurological diseases is both aging and lacking in effectiveness. There remains a very high unmet medical need for treatments in neurology, despite heavy historical investment in the field 25, 56. A typical drug development process includes a set of successive stages (fig. 1), sequentially evaluating the effects of the candidate drug in vitro on cell culture models, then in vivo on animal models (pre-clinical), after which the mechanistic origins of the candidate drug effect can be assessed in animals (sometimes in humans). The next stage consists of studies in humans based on clinical trials that are themselves structured in successive phases: phase I to test for safety, phase II to determine the effect of the candidate on a small set of patients and phase III that includes large cohorts of patients and control people in a randomized setting. Each and every stage of this process has significant probability to fail and break off the development of the drug under process. With the recent public health issues (HIV, Covid-19) the general public has mostly been made aware of failures between the successive phases of the clinical trial stage. However, in the field of neurology, the difficulties in developing drug candidates are mainly due to a high failure rate in the clinic: the activity of the drug candidate in in vitro cell cultures or in animal models is very often not confirmed in humans 40, 44.
In recent years, numerical approaches have been proposed in the field, either with mechanistic modeling to predict the response of the cell to the candidate molecule (quantitative systems biology/pharmacology) 51, 57 or with machine learning to identify the impacted (sub)cellular systems or the effects of the candidate drug 65, 46. However, these approaches are still inefficient to meet the above challenges because they often address a unique scale or modality of interest (e.g., molecular, cellular, preclinical) and lose their predictive power at other scales (e.g. clinical, i.e. the patient). The main methodological objective of AIstroSight is to develop quantitative systems biology and Artificial Intelligence (AI) approaches able to embrace several of these scales.
Our overall goal is to develop innovative numerical methods for neuropharmacology that will provide us with levers to accelerate and derisk the early stages of drug design. As a main deliverable and proof of concept of the efficiency of these methods, our ambition for the first four years of the project is to identify a handful (2 to 4) of new candidate drugs against neurological diseases.
To improve the probability of success of drug candidates in neurology, we integrate complementary information offered by data harvested at different spatio-temporal scales (fig. 2): from the inside of the cell (molecular and cellular biology) to the whole brain (imaging) and even to a population of patients (hospital data), using numerical tools coupling mechanistic models with dedicated AI approaches. In a way, our strategy is to break down the classical stratification silo of Fig. 1, in which literature search, in vitro cell culture, in vivo preclinic studies and in vivo clinic studies are viewed as a sequential multi-stage process. Instead, we propose an integrated machine learning framework into which all those data are combined to predict the effect of a candidate drug molecule.
AIstroSight develops innovative numerical approaches to integrate these information sources into a coherent stream of data and expert knowledge, combining the analysis of experimental observations with reasoning (of different kinds). Currently, these tasks are carried out in isolation and their reconciliation is devolved to biologists/physicians. The originality of the AIstroSight contributions are approaches that automatically carry out this reconciliation to assist biologists/physicians.
Since AI algorithms are often black-box tools, we also develop mechanistic modeling approaches (multiscale quantitative systems biology/pharmacology) to produce explanations for the predictions of the AI algorithms, that can be rooted in neurobiology. Another important aspect of AIstroSight is to widen the focus of neuropharmacology beyond neurons, that constitute only one half of the nerve cells in the brain, and also take into account the other half, that is made up by glial cells and their interactions with neurons. In particular, we consider the pharmacology of astrocytes 67, one major subtype of glial cells, in interaction with the pharmacology of neurons.
To accelerate cross-fertilization between digital science and medical research, AIstroSight will be located on the East Hospital Campus of the Lyon University Hospital, the “Hospices Civils de Lyon” (HCL), from 2024. We will also benefit from our strong association with CERMEP, the preclinical and clinical in vivo imaging platform of the HCL. In 2024, the whole team is indeed expected to move to Lyon's neurology hospital, located just across tA joint team with the HCLhe street of CERMEP.
CERMEP is also affiliated with University Claude Bernard Lyon 1, Inserm and CNRS. This provides us with an exceptional environment for the engineering of brain biochemical imaging methods that allow the study of the effect of molecules on the whole brain (fMRI, PET, fUS) and the analysis methodologies for these images. The CERMEP also hosts team BIORAN of the “Centre de Recherche en Neurosciences de Lyon” (CRNL) laboratory, that has expertise ranging from the chemistry of candidate-molecules to their biochemical assays, from radiolabelling to animal PET/MRI imaging and from preclinical models to first-in-man studies in patients. The modeling expertise on the binding between candidate molecules and receptors (structural biology, docking) is also present at the CERMEP.
As a joint team with the HCL, part of AIstroSight technology developments is intended to be integrated into the hospital information system developed during the last decade by the HCL for patient management. This is in particular the case of the development of “Multi-patient query for care pathway characterization and clinical trials”. Beyond participating in the HCL's mission as an innovation leader in digital health, AIstroSight also represents an opportunity for the HCL to reinforce its infrastructure for the organization of clinical trials, for instance in cooperation with pharma/biotech companies like Theranexus. Like other teams of Inria Lyon, AIstroSight is intensively implicated in the “AI innovation department” (“Pôle de Développement IA”) that Inria Lyon and HCL are supporting.
Finally, to ensure the impact of our works on pharmacology and provide it with potential industrial exit routes, the AIstroSight partnership also includes an industrial partner, Theranexus, a French biopharmaceutical company (an SME) that develops drugs for the treatment of nervous system diseases with an original focus on both neurons and astrocytes. Theranexus is listed on the Euronext Growth market in Paris and its social headquarters are located in Lyon. A biotech company, the expertise of its members is entirely in the experimental aspects, not in the digital. Theranexus brings to AIstroSight experimental data (experimental cell biology and brain imaging for pharmacology), and provides their expertise for the development of the digital tools needed to analyze these data. In return, the objective is that the output of these digital tools reveal novel drug targets or novel candidate molecules that Theranexus may decide to use to develop new treatments, starting with the necessary clinical trials. Importantly, the fact that these candidate drugs have been selected from an innovative numerical approach strongly consolidates the credibility of their development on the pharmaceutics market. In addition, Theranexus brings to AIstroSight its know-how and industrial expertise on the development of drug candidates up to the market and its strategic knowledge of the neuropharmacology industry. In this operating scheme, Theranexus is therefore the preferred partner for the early-phase transfer of the molecules that AIstroSight could identify.
Independenlty from AIstroSight, Theranexus and BIORAN have a longstanding collaboration together, in particular in the framework of an ANR- and AURA Region-funded joint laboratory (LabCom) called « NeuroImaging for Drug Discovery (NI2D)» that aims at the development of gliopharmacology using preclinical imaging tools (PET/MRI/brain ultrasound). This LabCom is also hosted within the CERMEP premises. NI2D aim at developing preclinical neuroimaging techniques (in animals, mainly fMRI, PET and fUS = Functional Ultra-Sound) for the evaluation of drug-candidates. AIstroSight develops numerical methods capable of integrating data at multiple scales for pharmacology, data that include imaging but also molecular data (intracellular signaling, omics data) or clinical data (biology, treatments, medico-administrative). We thus benefits from the imaging data and methodologies of NI2D. The two are therefore complementary, especially since we both share a strong interest in neuron/astrocyte interactions.
Drug screening, either in vitro or in silico, generally does not provide an explanation of the mechanism by which the identified drug acts at the cellular level. However, this information is crucial (e.g., with respect to patients or health agencies), and the algorithms used for the screening must be made explicable. Our goal here is to use mathematical models and their hybridization with machine learning to provide explanations on the mechanisms of action of a candidate molecule.
We develop mechanistic models of regulatory networks or intracellular signaling pathways specific to the action of the candidate drugs identified by the screening. Those models predict the spatio-temporal evolution of the concentrations of the molecular species involved in the modelled pathways using classical reaction terms from biochemical kinetics and mass-action laws (first order reactions, bi-molecular reactions, Michaelis-Menten, Hill kinetics…). Depending on the importance of intracellular spatial gradients and biochemical noise, space and stochasticity is accounted for, thus resulting in models based on reaction-diffusion equations, stochastic or ordinary differential equations, or other related formalisms (Gillespie algorithm, flow analysis). These models allow us to simulate in time and/or space of the cell the mechanisms that govern the dynamics of the implicated molecules and how this dynamic is altered by the selected drug. The aim is to use such mechanistic modeling to produce explanations for the predictions made by the statistical learning techniques that are used in the other sections. It is unlikely that these mechanistic models in themselves allow us to decipher the totality of the molecular mechanisms involved, but they provide critical information to properly adjust the laboratory and clinical experiments.
To be efficient, this approach demands that we maintain an effective knowledge basis and expertise on the fundamental molecular mechanisms at play at these spatial scales and their modelling. To build and maintain this expertise, we rely on existing long-term collaborations between AIstroSight members and experimental neuroscientists -electrophysiologists- or neuropharmacologists on the intracellular signaling networks at play in neuron function or neuro-astrocyte interactions.
- Agonist bias in GPCR: G protein-coupled receptors (GPCR) are currently the largest family of molecular targets for potential new drugs 22. GPCR are cell-membrane receptors ubiquitously found in all mammalian cells, but in particular in brain cells (neurons and astrocytes), where they control a large repertoire of neuronal and astrocytic responses to a variety of external stimuli and molecules. Bias antagonism refers to the observation that two ligands of the same GPCR can activate very different cell responses 47. This phenomenon is still not understood, but it is one possible path towards the development of new drug discovery and has been already proposed and stated to be explored, in particular by members of AIstroSight (B. Vidal, L. Zimmer) 68, 55. Our objective here is to build realistic mechanistic models of GPCR-based cell signaling in the neuronal intracellular space. We plan then to use this/these models to propose molecular mechanisms to explain the experimentally observed biases. A first idea to explore is the hypothesis of a local subcellular compartmentalization of the signaling molecules over and close to the cell membrane (so-called nanodomains). Experimental validation of the main model predictions is then to be performed using brain imaging modalities available at CERMEP (TEP, MRI, fUS).
- Synaptic plasticity: Synaptic plasticity, the long-term adaptation of the efficacy of a synapse according to the activity of the neurons and astrocytes composing this synapse, is thought to underly learning and memory at the cellular scale 64. We have been enjoying a very fruitful collaboration with Laurent Venance's lab (INSERM U1050, CIRB, Collège de France, Paris) on the subcellular mechanisms at play in learning and memory formation by synaptic plasticity 32, 34, 35, 33, 41, 70, 45. Current work focuses on the control of synaptic plasticity mechanisms by endocannabinoids and its implication in fast learning and on metabolic regulation of synaptic plasticity by astrocytes. This collaboration is funded by ongoing ANR project EngFlea (see below).
- Calcium signaling in astrocytes: Calcium signaling in the terminal branchlets of astrocytes is thought to be crucial for astrocytic functions and neuron-astrocyte interactions 24. We are studying the local dynamics of calcium signaling in terminal branchlets of astrocytes and their interaction with synaptic activity in collaboration with U. Valentin Nägerl's lab (CNRS UMR 5297, Bordeaux) for experimental (subcellular) data with supra-resolution microscopy 36. Recently, a collaboration with Erik de Schutter's lab (Okinawa Institute of Science and Technology, Japan) has also been set up to develop new efficient modelling tools (stochastic reaction-diffusion systems) in realistic 3D geometric meshes based on the simulation framework they develop, STEPS 23, 37.
- Multiscale modelling of the effects of a candidate drug on neurons and astrocytes: To model the cellular effect of a candidate drug, the main molecular systems impacted by the drug, and thus to be accounted for in the mechanistic model, are isolated from the cellular signature data and literature exploration. Imaging data, by specifying the brain areas and structures mainly targeted by the candidate drug, helps refine these models using specific parameters. Whereas in the first models, the observation (and modelling scale) corresponds to a subcellular domain (one synapse, +/- a dendrite or the main astrocytic process in the neighborhood), we search to progressively scale up those mechanistic models from the intracellular scale of a single cell to the scale of a population of interacting brain cells, neurons and astrocytes. To do so, we explore model simplification /reduction methods, including those combining machine learning and dynamical systems modelling (see below). In the long run, this large-scale mathematical model will produce a digital twin of the pathology that will allow us to explain why the candidate drug has a positive effect on the disease. Calibration is based on fUS and fMRI imaging data in rodents obtained in the framework of the NI2D LabCom. This data provides us with quantitative measurements of the effects of microscopic perturbations by pharmacological agents or by external stimuli (e.g., visual) on the variation, correlation and spreading of cortical activity over the whole brain.
- Astrocyte roles on brain imaging signals: Although it is now widely accepted that astrocytes play a role in brain processes and pathologies, the exact perimeters of their roles remain to be delimited. For instance, variations of the signals measured by brain imaging methods (fMRI, PET, fUS) are still largely interpreted as variations of neuronal activity. Available experimental data however indicate that astrocytes also impact those signals but it not clear yet how they do it. A precise and quantitative answer to this question would allow us to use brain imaging to monitor not only the local activity of the neurons but also of the astrocytes. Such a feature would be precious in our framework of astrocyte pharmacology but it demands the development of new mathematical models. Existing models of fMRI signals, for instance, are either too crude to incorporate a separate astrocyte action (balloon models 43) or are limited to the role of astrocytes as energy suppliers of the neurons (astrocyte-neuron lactate shuttle 49). Our objective here is to start from a microscopic and mechanistic model of neuron-astrocyte-blood vessel interactions and use multi-scale modelling methodologies to obtain a large-scale model of astrocyte-neuron impact on a subset of brain imaging technics (fMRI, fUS), with explicit parametrization of local neuronal and astrocyte activities. Here again these models are calibrated using fUS and fMRI imaging data in rodents, in particular using pharmacological agents that are known to specifically silence the astrocytic population or a neuronal population in a given brain area. A crucial step is the development of a detailed, microscopic model of the astrocyte endfeet, the specialized astrocyte processes that respond to and control local vascular diameters. This model will provides us with causal mechanisms able to interlink neuronal electrical activity, astrocytic calcium activity and local blood flow. It is to be seen as a first stage towards understanding the implication of astrocytes in variations of neuroimaging signals.
Methodological challenges: The biological systems to be considered to explain drug effects on pathologies are not only very complex but also only partly understood by neurobiologists themselves. Therefore, the available biological knowledge on these systems is constantly evolving. Since we cannot know in advance what systems are affected by the candidate drug, a major difficulty for modelers is preparedness, i.e. maintain a level of expertise on the biology and modelling state-of-the-art of a wide range of those systems. This is the reason why the first three projects above are crucial to the success of our proposal.
The most important challenge we face is that of multiscale: causal data are mainly molecular but many observations are macroscopic (e.g. brain imaging). Traditionally, linking these two scales requires the development of new theories (e.g., homogenization, population boundaries, etc.), a slow and rather hazardous process. The availability of more and more important computing resources allows to consider "brute force" approaches in which all scales of time and space are numerically simulated (cf Blue Brain Project). But the results are often as difficult to interpret as the animal experiments that these simulations emulate. Instead we consider recent advances in hybrid digital-AI systems (physics-informed neural networks 60), in particular equation discovery methodologies 59. These methods usually use sparse regression techniques to select in a library of nonlinear terms and operators, those that, when composed, provide the best description of the data 28. Our idea is to generate a large number of numerical simulations at the microscopic scale of the kinetics of the biochemical reactions concerned, for example by the spatial Gillespie algorithm, and then to aggregate them at a higher spatio-temporal scale using, for example, averages at a space and time grain much higher than the spatio-temporal resolution of the initial microscopic simulations. The idea is then to use equation-discovery algorithms to infer a set of differential equations (and associated parameters) capable of describing these higher scale space-time kinetics. The resulting reduced model is then replicated in each cell of the cell population model. If successful, this model reduction process can even be reiterated at the upper scale to simulate the effect of the molecule on large brain areas. Of course, the risky and difficult nature of this objective makes it a long-term goal. If need be, alternative meta-modelling technics will also be considered when applicable (RKHS, model-order reduction).
Real-world data, especially the data that is routinely collected by hospitals (medical reports, hospital records…), provides rich information about possible links between patient information (demographic, pathological, life style), drug exposures and health events. In the context of drug development, this data can be useful at three stages. During the search for a new drug, they can be used to enrich cell culture data or imaging data. In this case, one can query patients that have been treated for the pathology in question and integrate their clinical data in the in-silico screening. This approach is presented below in the framework of data integration.
Electronic Health Records query algorithms:
Efficient patient query can also be used at the very initial stage of drug discovery: the assessment of the feasibility of drug development projects. Indeed, part of the pathologies we target are rare diseases. In this context, one has to make sure at the very early stages that the pathology in question is not so rare that the number of patients is too low to allow clinical trials, or that its description in terms of physiopathology is mature enough for the clinician to be able to diagnose it with good probability. We thus develop patient query algorithms on clinical data from hospitals (electronic health records, EHR), in particular of the HCL, that allow us to characterize the care pathways of the patients before and after diagnosis. They provides us with answers to many questions related to the clinical picture of the pathology, its genetic underpinnings, its prevalence rate, the typical care pathway of a patient with this pathology, the delay of diagnostic, the frequency of diagnostic errors etc. Answers to these questions are crucial to determine early on whether the drug discovery project is feasible. We aim at developping query algorithms and software pipelines for EHR that can provide us with tools able to answer these questions efficiently.
Efficient EHR query algorithms are also very useful at the final stage of the clinical trial itself (Fig. 2), where they can be used to finely select what patients should be integrated in the trial. Indeed, a major change of paradigm in medicine in recent years is the acceptation that the response of a group of patients to a drug treatment exhibits strong variability. The source of this variability is diverse 69. The definition of the pathology itself as a unique coherent class can be misleading and actually incorporate a range of different sub-classes of pathologies/disorders. The response to a drug also depends on how the patient's body affects the drug before it reaches its target organ/cells (pharmacokinetics). At the cellular level, the response can also vary because of inter-individual differences in gene sequence and receptor/protein structure (pharmacodynamics). Therefore, individual drug responses depend on the patient genes (pharmacogenetics) but also on more social factors (age, sex, anterior medical record, lifestyle, habits, exposure to pollution…). In any case, the strength of this variability is believed to be a major cause of failure for clinical trials, in particular in neurology and psychiatry 56, 50. The goal of “stratified medicine” in this perspective is to subdivide the available group of patients into a number of subgroups so that the response of each subgroup is less variable than the whole 38. Our objective is to develop computational tools and software packages able to stratify hospital data to assist in the selection of patients to be included in an evaluation protocol for a clinical trial or the building of a research cohort.
Computational phenotyping:
The task of querying patients according to a predefined criterion from a large population of EHRs is sometimes referred to as “computational phenotyping” 61. It remains a time-consuming and challenging task with complex criteria because the query is to be addressed within multiple document types and across multiple data points, in EHRs that usually comprise both structured and unstructured data. The computational challenges raised by patient query with complex criteria are therefore considerable (integration, query, analysis, privacy). Software tools (i2b2, ACE 29) have been proposed to query patients for cohorts or clinical trials based on EHRs but they can hardly be used by most of the physicians because they require advanced knowledge of the data in computer science terms (format, encoding). Moreover, our objective is to provide clinicians with tools able to manipulate these complex data together with medical concepts (e.g., exposure to a drug, treatment, or occurrence of a pathology). Data abstraction capabilities must therefore be integrated to automatically enrich the data using phenotype libraries that can be intuitively mobilized by the clinician. In analogy with bioinformatics workflows, we create workflows for computational phenotyping.
In cases where we already know how to stratify, the issue is not a learning problem but rather a query problem. On the other hand, when this is not the case, we have to develop methods to build these homogeneous subgroups, and in this case it is a question of (unsupervised) learning: the training criterion becomes a measure of cluster homogeneity. Two competing approaches can be thought of in order to create the building blocks of the workflow: 1) machine learning approaches that allow the construction of abstract patient phenotypes from massive data; 2) approaches inspired by both timed systems modeling and knowledge reasoning that rely on formal descriptions of computational phenotypes to enrich the data. The interest of formal descriptions is to be able to represent the whole data transformation in a formal way. This abstract representation of the construction of a cohort facilitates its understanding by users and its reproducibility (FAIR principle). On the other hand, they allow again to exploit intimately the formalized knowledge of the domain, but they also become objects that can be manipulated by reasoning tools. The use of semantic web technologies can therefore be an interesting tool for representing data, knowledge and their processing in order to propose query tools that guide the clinician through the knowledge.
Methodological challenge:
The challenge is to make these formal descriptions highly expressive and to ensure efficient processing of massive data. On the long run, we plan to take inspiration from the approach called “Ontology-Mediated Query Answering” which consists in using ontologies to mediate the query of a database by ontologies 27. In this context, a computational phenotype is seen as a query. The difficulties encountered with observational data is the semantic gap between the available data and the medical concepts that are interesting to manipulate. This gap may be bridged by automatic reasoning that exploits expert knowledge to relate different abstraction levels.
Since computational phenotypes are difficult to formalize, the challenge is to support clinicians in defining them. In other words, the challenge becomes to abstract phenotypes from clinical data. We plan to combine automatic reasoning methods and data analysis. The first research direction we propose is the exploration of a symbolic approach parallel to the work by Tijl de Bie 26 or by Silberschatz 63 on the notion of "subjective measure of interestingness". This approach was developed to identify user-relevant statistical analysis results by means of a statistical model to evaluate the novelty of the extracted patterns (a priori knowledge model). Symbolic approaches can be combined in a similar way by using symbolic data analysis methods such as pattern mining, and by relying on formal models of the system as a priori knowledge. Patterns that are not "explainable" by the formal model are potentially new or of particular interest to the user and will thus be extracted. This approach offers an original entry point to deeply integrate knowledge-based reasoning into pattern extraction methods. The research challenge here lies in combining formalized knowledge with experimental data. It may be implemented using the declarative pattern mining paradigm, that uses solvers to address the pattern mining task. The proofs of concept on the notion of novelty will open the way to more complex reasoning such as planning that can be used to integrate complex behaviors in biological systems, such as interaction networks. The second research direction we propose is based on recent machine learning techniques. Unsupervised ML has been applied to patient phenotyping, i.e. the discovery of phenotypes from EHR data, including temporal phenotyping 71. Our objective is to combine such kinds of algorithm with semantic knowledge to guide the discovery toward meaningful computational phenotypes. Indeed data embedding techniques can integrate ontologies to enhance data semantics 42.
On the long run, the methods developed above may be reunified to address the problem of drug discovery at both the biological and the body scale. This justifies the coherence of the methodological approaches (Semantic Web and machine learning) that are developed in the two objectives.
The list of pathologies that are of interest for Theranexus in the framework of AIstroSight is given in the stand-alone “convention d'équipe-projet commune” of AIstroSight. It comprises roughly 30 rare diseases of the central nervous system, including lysosomal pathologies, neurological genetic diseases, rare diseases due to
Many of the diseases that are of interest for AIstroSight are rare diseases. This means that the volume of experimental data and the basic understanding of the pathology at the (sub-)cellular level may be too limited for the machine learning or mechanistic modeling tools that we plan to use. For example, it is known that the NPC mutation in Niemann-Pick type C induces morbid cholesterol accumulation in cells but the molecular function of NPC in cholesterol metabolism is not clearly understood 58. Similarly, MeCP2, the gene mutated in Rett syndrome, is an epigenetic regulatory factor (DNA methylation) whose mutation theoretically impacts the expression of a large number of genes but it is not clear which ones are most involved in the symptoms of the disease 48. Although molecular (omic) studies have been published for both diseases 31, 62, their molecular contexts are still unclear.
Our goal here is to generate additional preclinical molecular and imaging data to better delineate the perturbations that these diseases cause at a cellular and tissue level. We introduce into cultured cells the same deficits as those observed in patients. Transcriptomic analysis of the effect of this manipulation gives us information on the implicated molecular networks and its major molecular consequences. In parallel, we induce these same perturbations in vivo in rodents. Observing these animals using brain imaging techniques (fMRI and fUS, possibly PET) gives us a more macroscopic view of the effect of the mutation (affected brain areas, nature and amplitude of the modifications, change in response to treatments or stimuli etc, see below).
Methodological challenges: Developing experimental models of pathologies can be a very difficult task for pathologies that are due to the conjunction of multiple factors, when the molecular alterations at the origin of the pathologies have effects over a very large range of cellular processes or when comparison of the phenotype of the experimental model with its human counterpart is ill-defined (psychiatric diseases, for instance). To mitigate this risk, we develop experimental models only for pathologies that are well-defined in molecular terms, like for Rett or Niemann-Pick type C for a start. We use viral vector strategies (mostly shRNA-mediated gene silencing or possibly CRISPR-based gene editing via adeno-associated viruses, AAV) to manipulate the sequence or expression of the target gene. We start with cell lines that are easy to grow and analyze using omics approaches, and then use neurons and astrocytes differentiated from human pluripotent stem cells. This approach is also used in vivo by locally injecting the viral vector into a given brain region of an animal model, to genetically modify a particular cell type by using a specific promoter. We should therefore be able to control the area of the brain in which the genetic manipulation will be induced (e.g. visual cortex or cerebellum) as well as the type of cells targeted (neurons vs. astrocytes, for example). Of course, like all experimental models, each model taken separately has its limitations: the genes expressed by cells in culture are not necessarily those expressed by these same cells in vivo, the effects of gene silencing in a rodent are not necessarily transposable to humans, etc. However our hypothesis is that by combining these different modalities and scales of data (see above), it should be possible to better predict the effect of a potential treatment. The molecular and cellular biology technologies to be mobilized here (in vitro and in vivo mutagenesis, cell culture, proteomics) are tools routinely used by Theranexus. The expertise on the use of medical imaging to observe the effects at the brain level is provided by CERMEP and benefits from the advances of the NI2D LabCom.
Recently, Theranexus changed its pharmacological strategy, from a strategy mainly based on the repositioning of pre-existing drugs to a technology based on antisense oligonucleotide drugs. These technologies rely on the ability to design on demand short RNA sequences that specifically bind the mRNA of a gene target, and knock it down after recognition by the RNase H1 enzymes present in all cells, or modulate its translation or splicing via steric hindrance . Pharmacological intervention thus consists in searching for a gene target able to correct the molecular perturbation caused by the disease and to synthesize an antisense oligonucleotide able to specifically bind this target gene. Note that the technology currently in clinical use does not (yet) provide ways to specifically target a cell type or a brain region.
Our first objective is to develop digital tools to model the molecular networks perturbated by the pathology of interest, and use this model to identify a gene or protein in the network the modulation of which would correct the perturbation caused by the pathology. These models are based on molecular data, in particular transcriptomics and metabolomics data. The set of data includes data derived from cell cultures as described above, that we augment with molecular data from the literature related to the pathology or more generic public, open access databases of transcriptomic responses to perturbating molecules, like CMap or the LINCS L1000 data repository. The latter, for instance currently includes the effect of close to 40,000 small perturbating molecules on 12,000+ genes of more than 200 cell types. We aggregate these data and use them to infer the gene interaction network, the metabolic network and/or the signaling network impacted by the pathology. Metabolic networks are important for instance for Niemann-Pick type C, to conciliate perturbations of the lipid metabolism with those of the gene expression network. This provides us with a view of the pathology at the molecular scale.
Integration of neuroimaging data:
A major objective of AIstroSight is to augment these molecular data with medical data, in particular brain imaging data and hospital data. We complement molecular data with data coming from the analysis of brain imaging (fMRI, PET, functional ultrasound brain imaging) i.e. with functional networks between brain areas targeted by the molecule or quantitative measures of radioligand binding. Most of this imaging is done in rodents (preclinic, see above) but a subset of human imaging data is also used. These imaging data are obtained by our collaborators from the CERMEP platform.
These different neuroimaging methods provide meaningful and complementary information for understanding the functional or molecular effects of drugs in the brain:
Integration of clinical data:
We also plant ot integrate hospital data from the Hospices Civils de Lyon according to availability and pathologies. Hospital data provide access to rich information on possible links between patient information (demographic, pathological), drug exposures, health events or biological sample analysis (e.g., blood markers). Our goal is to integrate brain imaging and hospital data with cellular signatures to enrich them with information at the individual scale in a form that can be analyzed with machine learning (clustering, classification) or data mining (pattern matching) methods.
Methodological challenges:
A first challenge resides in the nature of action of antisense oligonucleotides, that often work by knock down/loss of function. It is not straightforward to design such a strategy in the case of a pathology that is due to a mutation that already suppressed the effect of a gene. That is precisely where numerical models of the involved gene expression and metabolic networks are important because they can be systematically assessed for the effect of gene suppression, thus providing a quick in silico screening of the potential targets. However, part of this program implies typical bioinformatics processing steps: analysis of transcriptomic networks, network reconstruction, conciliation between transcriptomic and metabolic networks… We currently do not have this expertise in the team. Therefore we leverage collaborations with local experts of the field to get the necessary operational knowledge, including experts of brain transcriptomics analysis (MeLiS lab in Lyon).
Another difficulty lies in the heterogeneity of these multiscale data, their highly categorical character, the large dimension of the corresponding variable space and often, the small number of observations. Moreover, cellular signature data are intrinsically very noisy and can have low reproducibility 52, a caveat that feature selection may improve, at least in part 39. Class imbalance can also be strong. Finally, each type of available observation (molecular networks, imaging, hospital) gives a partial, fragmented and incomplete view of an abstract complex biological system. This is a partial view because each type of observation provides data at a given spatio-temporal scale, for a certain locus. This is a fragmented view because the data will be collected from different patients, and even from very different living systems (cell cultures, animals, patients). Each patient contributes to the description of the abstract system on only few types of observations. This is incomplete because there will be many gaps to bridge the different kinds of information related to functioning of the studied biological system.
To reach our objective, we explore the use of Semantic Web (SW) formalism which attracts a lot of interest in bioinformatics, to formalize knowledge and data. Data are observations of biological systems acquired within controlled experiments or in real life. Formalized knowledge is a representation of facts and rules acquired in a scientific domain, here medicine or life sciences. Applying machine learning techniques on data supports knowledge discovery, but it is only one particular source of knowledge. The methodological challenge is first to formalize the different types of available data within an abstract model of the biological system, and to integrate formalized knowledge in the model coming from medical literature and our medical expertise, including imaging or hospital data. By gathering a wide range of formalized data and knowledge within the same tool, we aim at creating a kind of abstract numerical twin that may be queried to infer new knowledge to assist drug design or drug repositioning.
On the longer run, the second challenge is to develop query answering at the abstract level but based on fragmented data. The objective is to answer queries about the numerical twin by exploiting the data coming from multiple patients. One of the difficulties is to detect groups of patients whose numerical twins are “similar to each other” (in a sense that remains to be defined). Semantic Web offers a natural framework for querying formalized data with multiple facets but may be limited by the time-efficiency of the query engines on a large number of patients. In such a context, numerical approaches (embedding) is more time-efficient but may lack accuracy. The challenge is to construct numerical representations in order to embed the data in a space in which the distances are both efficient to compute and semantically consistent with the applied notion of “similarity”. Numerical machine learning techniques turn out to be an interesting perspective to address this challenge 54. Recent research on advanced machine learning, such as representation learning, offers new perspectives to address our challenge. Our objective is to initiate collaborations with teams having strong backgrounds in machine learning (e.g. Ockham Inria Team) to propose innovative solutions. Another important point is the need for logic programming methodologies able to express complex queries, especially on heterogeneous or multimodal data. For neuroimaging, the availability of Neurolang for logic programming with heterogeneous data or NeuroQuery for query result consolidation based on automatic literature meta-analysis, for instance, should be very useful.
In most of the cases, the methodologies that we use to reach the above objectives are related to knowledge management/mining, formal reasoning, data mining or learning. Machine learning or deep learning approaches are probably less useful here. The main reason is related to the volume of available data. For rare disease like Niemann-Pick type C, for instance, the low prevalence means that 5 to 10 new patients are diagnosed in France each year, a number too low for deep neural networks. However, advances in transfer learning might be helpful here. For instance, a large number of brain pathologies come with dysfunction of intracellular cholesterol metabolism and storage. This is for instance the case of multiple sclerosis 53, for which large cohorts and databases are available worldwide. As a long term project, an interesting idea will be to leverage the large volume of data on multiple sclerosis to identify biomarkers of cholesterol dysfunction, e.g., in neuroimaging, and use transfer learning to adapt the network to Niemann-Pick type C patients.
A Python framework for machine- and deep-learning that makes it easy to use Hydra for hyperparameter configuration and MLflow for experiment tracking and result distribution. The user only needs to create a single YAML configuration file and a subclass of Hydronaut.Experiment to use the framework.
Hydra allows the user to systematically sweep all hyperparameter combinations or optimize them use different strategies with plugins for libraries such as Optuna.
MLflow provides a web interface, command-line interface and Python API for exploring and sharing the results.
The framework is fully compatible with PyTorch Lightning and provides a custom subclass to facilitate its use.
We have developped two software tools to meet the internal needs of the team regarding in silico drug discovery. Both have been published online under open-source licences for the wider scientific community (MIT licence) and are available on Software Heritage. Both were presented at the seminar for digital health engineers at Inria Rennes in 2023.
Two tools have been developped for the management of machine- or deep-learning models, including for the exploration of the hyperparameters. Here again, both have been published online under open-source licences for the wider scientific community (MIT licence) and are available on Software Heritage. Both were presented at the seminar for digital health engineers at Inria Rennes in 2022.
Tensor decomposition has recently been gaining attention in the machine learning community for the analysis of individual traces, such as Electronic Health Records (EHR). However, this task becomes significantly more difficult when the data follows complex temporal patterns.
Our paper 10 introduces the notion of a temporal phenotype as an arrangement of features over time and it proposes SWoTTeD (Sliding Window for Temporal Tensor Decomposition), a novel method to discover hidden temporal patterns. SWoTTeD integrates several constraints and regularizations to enhance the interpretability of the extracted phenotypes. We validate our proposal using both synthetic and real-world datasets, and we present an original usecase using data from the Greater Paris University Hospital. The results show that SWoTTeD achieves at least as accurate reconstruction as recent state-of-the-art tensor decomposition models, and extracts temporal phenotypes that are meaningful for clinicians.
The implementation of SWoTTeD has been released (see Section 5). This work has been submitted to a journal (see pre-print arxiv.org/abs/2310.01201) and is now under revision.
This research line is intended as an introduction to the versatile model of chronicles for temporal data. Chronicles have been studied in the context of two analysis problems for temporal sequences: recognizing situations in temporal sequences and abstracting a set of temporal sequences. The first challenge benefits from the simple but expressive formalism to specify temporal behavior to match in a temporal sequence. The second challenge aims to abstract a collection of sequences by chronicles with the objective to extract characteristic behaviors.
Chronicles are closely related to temporal constraint networks. Not only do they share a similar graphical representation, they also have in common a notion of constraints in the timed succession of events. However, chronicles are definitely oriented towards fairly specific tasks in handling temporal data, by making explicit certain aspects of temporal data such as repetitions of an event.
We published a book 12 that first proposes a formal account of chronicles. Then, it exhibits an original lattice structure on the space of chronicles and proposes new counting approach for multiple occurrences of chronicle occurrences. This book also proposes a new approach for frequent temporal pattern mining using pattern structures. This latter proposal has been extended in 2 to address the problem of the interpretability of chronicle mining.
Counterfactual explanations have become a mainstay of the explainable AI field. This particularly intuitive statement allows the user to understand what small but necessary changes would have to be made to a given situation in order to change a model prediction. The quality of a counterfactual depends on several criteria: realism, actionability, validity, robustness, etc. In the paper published in ECML 2023 5, we are interested in the notion of robustness of a counterfactual.
More precisely, we focus on robustness to counterfactual input changes. This form of robustness is particularly challenging as it involves a trade-off between the robustness of the counterfactual and the proximity with the example to explain. We propose a new framework, CROCO, that generates robust counterfactuals while managing effectively this trade-off, and guarantees the user a minimal robustness. An empirical evaluation on tabular datasets confirms the relevance and effectiveness of our approach.
In addition, 6, 9 presents an interactive visualization tool that exhibits counterfactual explanations to explain model decisions. Each individual sample is assessed to identify the set of changes needed to flip the output of the model. These explanations aim to provide end-users with personalized actionable insights with which to understand automated decisions. An interactive method is also provided so that users can explore various solutions. The functionality of the tool is demonstrated by its application to a customer retention dataset. The tool is compatible with any counterfactual explanation generator and decision model for tabular data.
Many transient processes in cells arise from the binding of cytosolic proteins to membranes. Quantifying this membrane binding and its associated diffusion in the living cell is therefore of primary importance. Dynamic photonic microscopies, e.g., single/multiple particle tracking, fluorescence recovery after photobleaching, and fluorescence correlation spectroscopy (FCS), enable non-invasive measurement of molecular mobility in living cells and their plasma membranes. However, FCS with a single beam waist is of limited applicability with complex, non-Brownian, motions. Recently, the development of FCS diffusion laws methods has given access to the characterization of these complex motions, although none of them is applicable to the membrane binding case at the moment. In 3, we combined computer simulations and FCS experiments to propose an FCS diffusion law for membrane binding. First, we generated computer simulations of spot-variation FCS (svFCS) measurements for a membrane binding process combined to 2D and 3D diffusion at the membrane and in the bulk/cytosol, respectively. Then, using these simulations as a learning set, we derived an empirical diffusion law with three free parameters: the apparent binding constant KD, the diffusion coefficient on the membrane D2D, and the diffusion coefficient in the cytosol, D3D. Finally, we monitored, using svFCS, the dynamics of retroviral Gag proteins and associated mutants during their binding to supported lipid bilayers of different lipid composition or at plasma membranes of living cells, and we quantified KD and D2D in these conditions using our empirical diffusion law. Based on these experiments and numerical simulations, we conclude that this new approach enables correct estimation of membrane partitioning and membrane diffusion properties (KD and D2D) for peripheral membrane molecules.
AIstroSight is a joint project-team with the biotech company Theranexus. A plain “tutelle” of the team, Theranexus brings its research expertise in in vitro cell culture, disease modelling and imaging, both in terms of research workforce and data. The stand-alone “convention d'équipe-projet commune” of AIstroSight lists a group of 30 rare diseases of the central nervous system, that are of direct interest to Theranexus and thhat are associated with a specific regimen in terms of IP and legal affairs. However, AIstroSight members are allowed to work on pathologies outside this list without any restriction but with a different legal regimen vis-à-vis Theranexus.
PhD. Students
Master Students