Data collection is ubiquitous nowadays and it is providing our society with tremendous volumes of knowledge about human, environmental, and industrial activity. This ever-increasing stream of data holds the keys to new discoveries, both in industrial and scientific domains. However, those keys will only be accessible to those who can make sense out of such data. This is, however, a hard problem. It requires a good understanding of the data at hand, proficiency with the available analysis tools and methods, and good deductive skills. All these skills have been grouped under the umbrella term “Data Science” and universities have put a lot of effort in producing professionals in this field. “Data Scientist” is currently an extremely sought-after job, as the demand far exceeds the number of competent professionals. Despite its boom, data science is still mostly a “manual” process: current data analysis tools still require a significant amount of human effort and know-how. This makes data analysis a lengthy and error-prone process. This is true even for data science experts, and current approaches are mostly out of reach of non-specialists.
The objective of the team LACODAM is to facilitate the process of making sense out of (large) amounts of data.
This can serve the purpose of deriving knowledge and insights for better decision-making.
Our approaches are mostly dedicated to provide novel tools to data scientists, that can either perform tasks not addressed by any other tools, or that improve the performance in some area for existing tasks (for instance reducing execution time, improving accuracy or better handling imbalanced data).
LACODAM is a research team on data science methods and applications, composed of researchers with a background in symbolic AI, data mining, databases, and machine learning. Our research is organized along the three following research axes:
LACODAM's core symbolic expertise is in methods for exploring efficiently large combinatorial spaces. Such expertise is used in three main research areas:
In the pattern mining domain, the team is well known for tackling problems where the data and expected patterns have a temporal components. Usually the data considered are timestamped event logs, an ubiquitous type of data nowadays. The patterns extracted can be more or less complex subsequences, but also patterns exhibiting temporal periodicity.
A well-known problem in pattern mining is pattern explosion: due to either underspecified constraints or the combinatorial nature of the search space, pattern mining approaches may produce millions of patterns of mixed interest. The current best approach to limit the number of output patterns is to produce a small size pattern set, where the set optimizes some quality criteria. The best pattern set methods so far are based on information theory and rely on the principle of Minimum Description Length (MDL). LACODAM is the leading French team on MDL-based pattern mining, especially for complex patterns.
After having integrated Peggy Cellier in 2021, who is the main French expert in MDL-based pattern mining, we integrated in April 2022 Sébastien Ferré, who is also an expert in this area, especially for graph patterns.
The contribution of the team in the Semantic Web domain focuses on different problems related to knowledge graphs (KGs) – usually extracted (semi-)automatically from the Web. These include applications such as mining and reasoning, as well as data management tasks such as provenance and archiving. Reasoning can resort to either symbolic methods such as Horn rules or numeric approaches such as KG embeddings that can be explained via post-hoc explainability modules. The integration of Sébastien Ferré (former SemLIS team leader) further strengthens the Semantic Web axis by extending our expertise on general graph mining, relation extraction, and semantic data exploration.
Skyline queries is a research topic from the database community, and is closely related to multi-criteria optimization. In transactional data, one may want to optimize over several different attributes of equal importance, which means discovering a Pareto Front (the "skyline").
The team has expertise on skyline queries in traditional databases as well as their application to pattern mining (extraction of skypatterns). Recently, the team started to tackle the extraction of skyline groups, i.e. groups of records that together optimize multiple criteria.
Making Machine Learning more interpretable is one of the greatest challenges for the AI community nowadays. LACODAM contributes to the main areas of explainable AI (XAI):
More generally, LACODAM is interested in the study of the interpretability-accuracy trade-off. Our studies may be able to answer questions such as “how much accuracy can a model lose (or perhaps gain) by becoming more interpretable?”. Such a goal requires us to define interpretability in a more principled way—a challenge that has very recently been addressed, not yet overcome.
LACODAM's research work is firmly rooted in applications. On the one hand the data science tools proposed in our fundamental work need to prove their value at solving actual problems. And on the other hand, working with practitioners allows us to understand better their needs and the limitations of existing approaches w.r.t. those needs. This can open new and fruitful (fundamental) research directions.
Our objective, in that axis, is to work on challenging problems with interesting and pertinent partners. We target problems where off-the-shelf data science approaches either cannot be applied or do not give satisfactory results: such problems are the most likely to lead to new and meaningful research in our field. For some problems, collaborative research may not necessarily lead to fundamental breakthroughs, but can still allow making progress in the practitioners' field. We also value such work, which contributes to the discovery of new knowledge and helps industrial partners innovate.
Due to the team expertise in handling temporal data, a lot of our applicative collaborations revolve around the analysis of time series or event logs. Naturally, our work on interpretability is also present in most of our collaborations, as experts want accurate models, but also want to understand the decisions of those models.
The precise application domains are described in more details in the next section (Section 4).
The current period is extremely favorable for teams working in Data Science and Artificial Intelligence, and LACODAM is not the exception. We are eager to see our work applied in real world applications, and have thus an important activity in maintaining strong ties with industrial partners concerned with marketing and energy as well as public partners working on health, agriculture and environment.
We present below our industrial collaborations. Some are well-established partnerships, while others are more recent collaborations with local industries that wish to reinforce their Data Science R&D with us.
There are two main axes that characterize the bulk of LACODAM's environmental impact: work trips, and computing resources utilisation.
While the sanitary crisis had drastically cut the number of work trips of the team, recent years have seen an increase in the physical participation in conferences and various committees. However compared to the pre-covid period, one can note that the majority of movements are national or at best European, with very few trips outside of Europe and most of them using trains (and not planes). It seems that in general, the possibility of participating to meetings by videoconference has removed many “low added value trips”. This is a first step in reducing our carbon footprint in a meaningful way, while preserving some of trips important for the scientific as well as human aspect of our work.
LACODAM contributed in 2020 with a new server (abacus12) to the Igrida computing platform. Being a team specialized in data science and machine learning, a recurrent task in LACODAM is to run CPU-intensive algorithms on large data collections, for example, to train deep neural networks. Some of our recent PhD research topics (e.g., the theses of Simon Corbillé, and Simon Felton) concern deep learning technologies, and the important place of eXplanaible AI in our research program have made our team highly reliant on Igrida (notably with the PhD of Julien Delaunay and Victor Guyomard). While the discontinuation of Igrida services and the transition towards Grid'5000 and Jean Zay has reduced our access to easily available computation resources (making it harder to perform experiments during the transition and requiring to learn new ways of operating), it can be said that it has a positive effect on energy consumption, as we are now using national infrastructures that benefit from even better sharing between users than Igrida (which was already heavily used).
We estimate that the research work can have actual impact in three different ways:
Francesco Bariatti received the "prix de thèse EGC 2023" for his PhD thesis on "Mining tractable sets of graph patterns with the minimum description length principle", co-supervised by Sébastien Ferré and Peggy Cellier and defended on 23/11/2021.
Maëva Durand (supervised by
Dexteris is a low-code tool for data exploration and transformation. It works as an interactive data-oriented query builder with JSONiq as the target query language. It uses JSON as the pivot data format but it can read from and write to a few other formats: text, CSV, and RDF/Turtle (to be extended to other formats).
Dexteris is very expressive as JSONiq is Turing-complete, and supports a varied set of data processing features: - reading JSON files, and CSV as JSON (one object per row, one field per column), - string processing (split, replace, match, ...), - arithmetics, comparison, and logics, - accessing and creating JSON data structures, i- terations, grouping, filtering, aggregates and ordering (FLWOR operators), - local function definitions.
The built JSONiq programs are high-level, declarative, and concise. Under-progress results are given at every step so that users can keep focused on their data and on the transformations they want to apply.
We organize the scientific results of the research conducted at LACODAM according to the axes described in our research program (Section 3).
Remark about the “Participants” boxes: we compiled syntactically the list of co-authors of the papers that make the “New Results” of the year, for each subsection. It obviously does not mean that other members of the team do not work on the topics listed, the correct meaning is that they did not have a publication on that topic this year.
Knowledge graphs and other forms of relational data have become a widespread kind of data, and powerful methods to analyze and learn from them are needed. Formal Concept Analysis (FCA) is a mathematical framework for the analysis of symbolic datasets, which has been extended to graphs and relational data, like Graph-FCA. It encompasses various tasks such as pattern mining or machine learning, but its application generally relies on the computation of a concept lattice whose size can be exponential with the number of instances. We propose to follow an instance-based approach where the learning effort is delayed until a new instance comes in, and an inference task is set. This is the approach adopted in k-Nearest Neighbors, and this relies on a distance between instances. We define a conceptual distance based on FCA concepts, and from there the notion of concepts of neighbors, which can be used as a basis for instance-based reasoning. Those definitions are given for both classical FCA and Graph-FCA. We provide efficient algorithms for computing concepts of neighbors, and we demonstrate their inference capabilities by presenting three different applications: query relaxation, knowledge graph completion, and relation extraction.
This chapter illustrates the basic concepts of fault localization using a data mining technique. It utilizes the Trityp program to illustrate the general method. Formal concept analysis and association rule are two well‐known methods for symbolic data mining. In their original inception, they both consider data in the form of an object‐attribute table. In their original inception, they both consider data in the form of an object‐attribute table. The chapter considers a debugging process in which a program is tested against different test cases. Two attributes, PASS and FAIL, represent the issue of the test case. The chapter extends the analysis of data mining for fault localization for the multiple fault situations. It addresses how data mining can be further applied to fault localization for GUI components. Unlike traditional software, GUI test cases are usually event sequences, and each individual event has a unique corresponding event handler.
A poster presenting the scikit-mine library, a python library for pattern mining. This library proposes an Open Source implementation of recent MDL-based pattern mining algorithms, that focuses on the ease of use of these algorithms.
To get a good understanding of a dynamical system, it is convenient to have an interpretable and versatile model of it. Timed discrete event systems are a kind of model that respond to these requirements. However, such models can be inferred from timestamped event sequences but not directly from numerical data. To solve this problem, a discretization step must be done to identify events or symbols in the time series. Persist is a discretization method that intends to create persisting symbols by using a score called persistence score. This allows to mitigate the risk of undesirable symbol changes that would lead to a too complex model. After the study of the persistence score, we point out that it tends to favor excessive cases making it miss interesting persisting symbols. To correct this behavior, we replace the metric used in the persistence score, the Kullback-Leibler divergence, with the Wasserstein distance. Experiments show that the improved persistence score enhances Persist's ability to capture the information of the original time series and that it makes it better suited for discrete event systems learning.
The ARC (Abstraction and Reasoning Corpus) challenge has been proposed to push AI research towards more generalization capability rather than ever more performance. It is a collection of unique tasks about generating colored grids, specified by a few examples only. We propose object-centered models analogous to the natural programs produced by humans. The MDL (Minimum Description Length) principle is exploited for an efficient search in the vast model space. We obtain encouraging results with a class of simple models: various tasks are solved and the learned models are close to natural programs.
The signature approach considers a sequence of itemsets, and given a number k it returns a segmentation of the sequence in k segments such that the number of items occurring in all segments is maximized. The limitation of this approach is that it requires to manually set k, and thus fixes the temporal granularity at which the data is analyzed. We propose the sky-signature model that removes this requirement, and allows us to examine the results at multiple levels of granularity, while keeping a compact output. We also propose efficient algorithms to mine sky-signatures, as well as an experimental validation with real data both from the retail domain and from natural language processing (political speeches).
Some otherwise presented documents also contribute to this research domain: 20 .
A number of extensions have been proposed for Formal Concept Analysis (FCA). Among them, Pattern Structures (PS) bring complex descriptions on objects, as an extension to sets of binary attributes; while Graph-FCA brings n-ary relationships between objects, as well as n-ary concepts. We have introduced a novel extension named Graph-PS that combines the benefits of PS and Graph-FCA. In conceptual terms, Graph-PS can be seen as the meet of PS and Graph-FCA, seen as sub-concepts of FCA. We have demonstrated how it can be applied to RDFS graphs, handling hierarchies of classes and properties, and patterns on literals such as numbers and dates.
Some of previously presented documents also contribute to this research domain: 34 .
We have proposed the use of controlled natural language as a target for knowledge graph question answering (KGQA) semantic parsing via language models as opposed to using formal query languages directly. Controlled natural languages are close to (human) natural languages, but can be unambiguously translated into a formal language such as SPARQL. Our research hypothesis is that the pre-training of large language models (LLMs) on vast amounts of textual data leads to the ability to parse into controlled natural language for KGQA with limited training data requirements. We devise an LLM-specific approach for semantic parsing to study this hypothesis. To conduct our study, we created a dataset that allows the comparison of one formal and two different controlled natural languages. Our analysis shows that training data requirements are indeed substantially reduced when using controlled natural languages, which is of relevance since collecting and maintaining high-quality KGQA semantic parsing training data is very expensive and time-consuming.
In recent years, research in RDF archiving has gained traction due to the ever-growing nature of semantic data and the emergence of community-maintained knowledge bases. Several solutions have been proposed to manage the history of large RDF graphs, including approaches based on independent copies, time-based indexes, and change-based schemes. In particular, aggregated changesets have been shown to be relatively efficient at handling very large datasets. However, ingestion time can still become prohibitive as the revision history increases. To tackle this challenge, we propose a hybrid storage approach based on aggregated changesets, snapshots, and multiple delta chains. We evaluate different snapshot creation strategies on the BEAR benchmark for RDF archives, and show that our techniques can speed up ingestion time up to two orders of magnitude while keeping competitive performance for version materialization and delta queries. This allows us to support revision histories of lengths that are beyond reach with existing approaches.
The dynamicity of semantic data has propelled the research on RDF Archiving, i.e., the task of storing and making the full history of a large RDF dataset accessible. That said, existing archiving techniques fail to scale when confronted to very large RDF archives and complex SPARQL queries. In this demonstration, we showcase GLENDA, a system capable of running full SPARQL 1.1 compliant queries over large RDF archives. We achieve this through a multi-snapshot change-based storage architecture that we interface using the Comunica query engine. Thanks to this integration we demonstrate that fast SPARQL query processing over multiple versions is possible. Moreover our demonstration provides different statistics about the history of RDF datasets. This provides insights about the evolution dynamics of the data.
The advancements and popularity of Semantic Web technologies in the last decades have led to an exponential adoption and availability of Web-accessible datasets. While most solutions consider such datasets to be static, they often evolve over time. Hence, efficient archiving solutions are needed to meet the users' and maintainers' needs. While some solutions to these challenges already exist, standardized benchmarks are needed to systematically test the different capabilities of existing solutions and identify their limitations. Unfortunately, the development of new benchmarks has not kept pace with the evolution of RDF archiving systems. In this paper, we therefore identify the current state of the art in RDF archiving benchmarks and discuss to what degree such benchmarks reflect the current needs of real-world use cases and their requirements. Through this empirical assessment, we highlight the need for the development of more advanced and comprehensive benchmarks that align with the evolving landscape of RDF archiving.
Knowledge graphs and other forms of relational data have become a widespread kind of data, and powerful methods to analyze and learn from them are needed. Formal Concept Analysis (FCA) is a mathematical framework for the analysis of symbolic datasets, which has been extended to graphs and relational data, like Graph-FCA. It encompasses various tasks such as pattern mining or machine learning, but its application generally relies on the computation of a concept lattice whose size can be exponential with the number of instances. We propose to follow an instance-based approach where the learning effort is delayed until a new instance comes in, and an inference task is set. This is the approach adopted in k-Nearest Neighbors, and this relies on a distance between instances. We define a conceptual distance based on FCA concepts, and from there the notion of concepts of neighbors, which can be used as a basis for instance-based reasoning. Those definitions are given for both classical FCA and Graph-FCA. We provide efficient algorithms for computing concepts of neighbors, and we demonstrate their inference capabilities by presenting three different applications: query relaxation, knowledge graph completion, and relation extraction.
Data exploration and transformation remain a challenging prerequisite to the application of data analysis methods. The desired transformations are often ad-hoc so that existing end-user tools may not suffice, and plain programming may be necessary. We propose a guided query builder approach to reconcile expressivity and usability, i.e. to support the exploration of data, and the design of ad-hoc transformations, through data-user interaction only. This approach is available online as a client-side web application, named Dexteris. Its strengths and weaknesses are evaluated on a representative use case, and compared to plain programming and ChatGPT-assisted programming.
The ubiquity of machine learning has highlighted the importance of explainability algorithms. Among these, model-agnostic methods generate artificial examples by slightly modifying original data and then observing changes in the model's decision-making on these artificial examples. However, such methods require initial examples and provide explanations only for the decisions based on these examples. To address these issues, we propose Therapy, the first model-agnostic explainability method for language models that does not require input data. This method generates texts that follow the distribution learned by the classifier to be explained through cooperative generation. Not depending on initial examples allows for applicability in situations where no data is available (e.g., for privacy reasons) and offers explanations on the global functioning of the model instead of multiple local explanations, thus providing an overview of the model's workings. Our experiments show that even without input data, Therapy provides instructive insights into the text features used by the classifier, which are competitive with those provided by methods using data.
We present a strategy, called Seq2Seg, to reach both precise and accurate recognition and segmentation for children handwritten words. Reaching such high performance for both tasks is necessary to give personalized feedback to children who are learning how to write. The first contribution is to combine the predictions of an accurate Seq2Seq model with the predictions of a R-CNN object detector. The second one is to refine the bounding box predictions provided by the detector with a segmentation lattice computed from the online signal. An ablation study shows that both contributions are relevant, and their combination is efficient enough for immediate feedback and achieves state-of-the-art results even compared to more informed systems.
Surrogate explanations approximate a complex model by training a simpler model over an interpretable space. Among these simpler models, we identify three kinds of surrogate methods: (a) feature-attribution, (b) example-based, and (c) rule-based explanations. Each surrogate approximates the complex model differently, and we hypothesise that this can impact how users interpret the explanation. Despite the numerous calls for introducing explanations for all, no prior work has compared the impact of these surrogates on specific user roles (e.g., domain expert, developer). In this article, we outline a study design to assess the impact of these three surrogate techniques across different user roles.
Knowledge graphs (KGs) are key tools in many AI-related tasks such as reasoning or question answering. This has, in turn, propelled research in link prediction in KGs, the task of predicting missing relationships from the available knowledge. Solutions based on KG embeddings have shown promising results in this matter. On the downside, these approaches are usually unable to explain their predictions. While some works have proposed to compute post-hoc rule explanations for embedding-based link predictors, these efforts have mostly resorted to rules with unbounded atoms, e.g., bornIn(x,y)->residence(x,y), learned on a global scope, i.e., the entire KG. None of these works has considered the impact of rules with bounded atoms such as nationality(x,England)->speaks(x,English), or the impact of learning from regions of the KG, i.e., local scopes. We therefore study the effects of these factors on the quality of rule-based explanations for embedding-based link predictors. Our results suggest that more specific rules and local scopes can improve the accuracy of the explanations. Moreover, these rules can provide further insights about the inner-workings of KG embeddings for link prediction.
Knowledge graphs (KGs) are vast collections of machine-readable information, usually modeled in RDF and queried with SPARQL. KGs have opened the door to a plethora of applications such as Web search or smart assistants that query and process the knowledge contained in those KGs. An important, but often disregarded, aspect of querying KGs is query provenance: explanations of the data sources and transformations that made a query result possible. In this article we demonstrate, through a Web application, the capabilities of SPARQLprov, an engine-agnostic method that annotates query results with how-provenance annotations. To this end, SPARQLprov resorts to query rewriting techniques, which make it applicable to already deployed SPARQL endpoints. We describe the principles behind SPARQLprov and discuss perspectives on visualizing how-provenance explanations for SPARQL queries.
In this paper, we present an interactive visualization tool that exhibits counterfactual explanations to explain model decisions. Each individual sample is assessed to identify the set of changes needed to flip the output of the model. These explanations aim to provide end-users with personalized actionable insights with which to understand automated decisions. An interactive method is also provided so that users can explore various solutions. The functionality of the tool is demonstrated by its application to a customer retention dataset. The tool is compatible with any counterfactual explanation generator and decision model for tabular data.
Counterfactual explanations have become a mainstay of the XAI field. This particularly intuitive statement allows the user to understand what small but necessary changes would have to be made to a given situation in order to change a model prediction. The quality of a counterfactual depends on several criteria: realism, actionability, validity, robustness, etc. In this paper, we are interested in the notion of robustness of a counterfactual. More precisely, we focus on robustness to counterfactual input changes. This form of robustness is particularly challenging as it involves a trade-off between the robustness of the counterfactual and the proximity with the example to explain. We propose a new framework, CROCO, that generates robust counterfactuals while managing effectively this trade-off, and guarantees the user a minimal robustness. An empirical evaluation on tabular datasets confirms the relevance and effectiveness of our approach.
Academic advising brings numerous benefits to the mission of Higher Education Institutions. One of the main actors is the advisors who support students in defining appropriate academic roadmaps. One central and challenging duty of academic advisors is course recommendation for term planning. This task requires both knowledge of the study programs and a thorough analysis of the students' unique circumstances. If we add limited time and a large student population, the task becomes overwhelming. As a result, an important body of research has sought to expedite term planning via data-oriented decision-support tools. While the impact of such tools on students has been extensively studied, the advisors' perspective remains largely unexplored. We contribute to filling this gap by studying how a grade prediction tool shapes advisors' course recommendation strategies. Our observations suggest that while the advisors' usual strategies tend to prevail, their recommendations are mostly affected by the advisee's historical performance. CCS Concepts: • Human-centered computing → Empirical studies in visualization.
Some of previously presented documents also contribute to this research domain: 31 .
We propose a new visual servoing method that controls a robot's motion in a latent space. We aim to extract the best properties of two previously proposed servoing methods: we seek to obtain the accuracy of photometric methods such as Direct Visual Servoing (DVS), as well as the behavior and convergence of pose-based visual servoing (PBVS). Photometric methods suffer from limited convergence area due to a highly non-linear cost function, while PBVS requires estimating the pose of the camera which may introduce some noise and incurs a loss of accuracy. Our approach relies on shaping (with metric learning) a latent space, in which the representations of camera poses and the embeddings of their respective images are tied together. By leveraging the multimodal aspect of this shared space, our control law minimizes the difference between latent image representations thanks to information obtained from a set of pose embeddings. Experiments in simulation and on a robot validate the strength of our approach, showing that the sought-out benefits are effectively found.
Context. We study the benefits of using a large public neuroimaging database composed of fMRI statistic maps, in a self-taught learning framework, for improving brain decoding on new tasks. First, we leverage the NeuroVault database to train, on a selection of relevant statistic maps, a convolutional autoencoder to reconstruct these maps. Then, we use this trained encoder to initialize a supervised convolutional neural network to classify tasks or cognitive processes of unseen statistic maps from large collections of the NeuroVault database. Results. We show that such a self-taught learning process always improves the performance of the classifiers but the magnitude of the benefits strongly depends on the number of samples available both for pre-training and finetuning the models and on the complexity of the targeted downstream task. Conclusion. The pre-trained model improves the classification performance and displays more generalizable features, less sensitive to individual differences.
Functional magnetic resonance imaging analytical workflows are highly flexible with no definite consensus on how to choose a pipeline. While methods have been developed to explore this analytical space, there is still a lack of understanding of the relationships between the different pipelines. We use community detection algorithms to explore the pipeline space and assess its stability across different contexts. We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters (e.g., number of motion regressors, software packages, etc.), with relative stability across groups of participants. By visualizing the differences between these subsets, we describe the effect of pipeline parameters and derive general relationships in the analytical space.
Results of functional Magnetic Resonance Imaging (fMRI) studies can be impacted by many sources of variability including differences due to: the sampling of the participants, differences in acquisition protocols and material but also due to different analytical choices in the processing of the fMRI data. While variability across participants or across acquisition instruments has been extensively studied in the neuroimaging literature the root causes of analytical variability remain an open question. Here, we share the HCP multi-pipeline dataset, including the resulting statistic maps for 24 typical fMRI pipelines on 1,080 participants of the HCP-Young Adults dataset. We share both individual and group results - for 1,000 groups of 50 participants - over 5 motor contrasts. We hope that this large dataset covering a wide range of analysis conditions will provide new opportunities to study analytical variability in fMRI.
Real-time combination of observed growth and feed intake performance with performance simulated by InraPorc® to apply precision feeding to growing pigs Precision feeding (PF) of growing pigs requires methods for real-time analysis of performance and prediction of nutritional requirements. Two calculation methods for reducing nutrient intake were evaluated. Individual daily kinetics of body weight (BW) and feed intake (FI) of 285 pigs, reared from 81 to 156 days of age (ad libitum feeding) were used. The PF1 approach (from the Feeda-Gene project) used the Holt-Winters and MARS methods to predict individual daily FI and BW, respectively. The standardised digestible lysine (dLys) requirement was calculated daily from the predicted performance using the factorial method. The PF2 method used 2200 virtual pigs whose performance was simulated using InraPorc®. By comparing FI and BW dynamics of real and virtual pigs, the 10 closest virtual pigs were selected daily for each real pig. Individual daily performance and expected dLys requirements were obtained by averaging the InraPorc® data of these 10 virtual pigs. PF was then simulated for each real pig. For each method, the blend proportions of two diets (A and B, 9.7 MJ net energy (NE)/kg, crude protein: 16.9% and 9.3%, dLys: 1.0 and 0.4 g/MJ NE, respectively) were calculated daily to achieve calculated requirements. Nitrogen (N) and dLys intake and N excretion were calculated individually. A two-phase (2-P) feeding strategy was also simulated (A:B = 83:17 until 65 kg PV, 50:50 afterwards). Compared to 2-P, N and dLys intake and N excretion were reduced by, respectively, 6.7%, 9.7% and 11.9% with PF1 and by 9.2%, 13.3% and 16.3% with PF2. The PF2 method provided better day-to-day stability of performance predictions, leading to a more regular decrease in N and dLys intakes during growth. The potential of this new method needs to be confirmed under field conditions.
The relational database SOWELL was created to better understand the behaviour and individual responses of gestating sows facing different short-term events induced: a competitive situation for feed, hot and cold thermal conditions, a sound event, an enrichment (straw, ropes and bags available) and an impoverishment (no straw, no objects) of the pen. The data were collected on 102 crossbred sows equipped with activity sensors, group-housed in video-recorded pens (16–18 sows per pen), with access to automatons. Feeding and drinking behaviours were extracted from the electronic feeders and drinkers’ recordings. Social behaviours, physical activities and locations in the pen were recorded thanks to manual video analysis labelling at the individual scale. Accelerometer fixed on the sows’ ears also recorded individual physical activities. The physical activity was also determined at a group scale by automatic video analysis using deep learning techniques. BWs, back fat thickness, and body condition (cleanliness, body damages) were recorded weekly during the whole gestation. Last gestation room data regarding environmental conditions (temperature, humidity, noise level) were recorded using automatic sensors. The database can fulfil different research purposes, namely sows’ nutrition for example to better calculate the energy requirements regarding environmental factors, or also on welfare or health during gestation by providing indicators.
Background and Objectives Precision feeding aims to define the right feeding strategy according to individual’s nutrient requirements, in order to improve health and reduce feed cost. Usually, the nutrient requirements of gestating sows are provided by a mechanistic nutritional model requiring input data such as age and body status. This paper proposes to predict the daily nutritional requirements, with only the data measured by sensors. According to various digital farm configurations, we explore and evaluate Machine Learning (ML) methods to predict nutrient requirements of gestating sows. Material and Methods Behavioural data of gestating sows are extracted from sensors data collected on 73 sows from parities 1 to 9. Their nutrient requirements concerned metabolisable energy (ME, in MJ/d) and standard ileal digestible lysine (SID Lys, in g/ d). Various digital farm configurations are proposed, from low-cost to more expensive equipments (electronic feeder and drinker, connected weight scale, accelerometers and video analysis software), producing various data at different levels of detail on sow behavior. Nine ML algorithms were trained on these 23 scenarios to predict daily energy and lysine for each sow. Results proposed by the ML algorithms are compared with outputs given by the nutritional model InraPorc. Results Using a Random Forest algorithm, the RMSE were lower with data feeder alone (1.22 MJ/d for ME and 0.53 g/d for SID Lys, 2.4 and 4.02% of mean absolute error respectively) compared those obtained with combined data from feeders and accelerometers (1.01 MJ/d and 0.29 g/d, 1.9 and 2.1%). The inclusion of the sows’ characteristics reduced the RMSE, on average, by 20% for ME and by 35% for Lys. Discussion and Conclusion This study highlights that daily requirements of gestating sows can be predicted accurately thanks to behavioural data provided by sensors. It paves the way to propose simpler solutions for the application of precision feeding on farms.
Precision feeding is a strategy for supplying an amount and composition of feed as close that are as possible to each animal’s nutrient requirements, with the aim of reducing feed costs and environmental losses. Usually, the nutrient requirements of gestating sows are provided by a nutrition model that requires input data such as sow and herd characteristics, but also an estimation of future farrowing performances. New sensors and automatons, such as automatic feeders and drinkers, have been developed on pig farms over the last decade, and have produced large amounts of data. This study evaluated machine-learning methods for predicting the daily nutrient requirements of gestating sows, based only on sensor data, according to various configurations of digital farms. The data of 73 gestating sows was recorded using sensors such as electronic feeders and drinker stations, connected weight scales, accelerometers, and cameras. Nine machine-learning algorithms were trained on various dataset scenarios according to different digital farm configurations (one or two sensors), in order to predict the daily metabolizable energy and standardized ileal digestible lysine requirements for each sow. The prediction results were compared to those predicted by the InraPorc model, a mechanistic model for the precision feeding of gestating sows. The scenario predictions were also evaluated with or without the housing conditions and sow characteristics at artificial insemination usually integrated into the InraPorc model. Adding housing and sow characteristics to sensor data improved the mean average percentage error by 5.58% for lysine and by 2.22% for energy. The higher correlation coefficient values for lysine (0.99) and for energy (0.95) were obtained for scenarios involving an automatic feeder system (daily duration and number of visits with or without consumption) only. The scenarios including an automatic feeder combined with another sensor gave good performance results. For the scenarios using sow and housing characteristics and automatic feeder only, the root mean square error was lower with Gradient Tree Boosting (0.91 MJ/d for energy and 0.08 g/d for lysine) compared with those obtained using linear regression (2.75 MJ/d and 1.07 g/d). The results of this study show that the daily nutrient requirements of gestating sows can be predicted accurately using data provided by sensors and machine-learning methods. It paves the way to simpler solutions for precision feeding.
Estimating the welfare status at an individual level on the farm is a current issue to improve livestock animal monitoring. New technologies showed opportunities to analyze livestock behavior with machine learning and sensors. The aim of the study was to estimate some components of the welfare status of gestating sows based on machine learning methods and behavioral data. The dataset used was a combination of individual and group measures of behavior (activity, social and feeding behaviors). A clustering method was used to estimate the welfare status of 69 sows (housed in four groups) during different periods (sum of 2 days per week) of gestation (between 6 and 10 periods, depending on the group). Three clusters were identified and labelled (scapegoat, gentle and aggressive). Environmental conditions and the sows’ health influenced the proportion of sows in each cluster, contrary to the characteristics of the sow (age, body weight or body condition). The results also confirmed the importance of group behavior on the welfare of each individual. A decision tree was learned and used to classify the sows into the three categories of welfare issued from the clustering step. This classification relied on data obtained from an automatic feeder and automated video analysis, achieving an accuracy rate exceeding 72%. This study showed the potential of an automatic decision support system to categorize welfare based on the behavior of each gestating sow and the group of sows.
This paper presents CAWET, a hybrid worst-case program timing estimation technique. CAWET identifies the longest execution path using static techniques, whereas the worst-case execution time (WCET) of basic blocks is predicted using an advanced language processing technique called Transformer-XL. By employing Transformers-XL in CAWET, the execution context formed by previously executed basic blocks is taken into account, allowing for consideration of the microarchitecture of the processor pipeline without explicit modeling. Through a series of experiments on the TacleBench benchmarks, using different target processors (Arm Cortex M4, M7, and A53), our method is demonstrated to never underestimate WCETs and is shown to be less pessimistic than its competitors.
This book chapter illustrates the basic concepts of fault localization using a data mining technique. It utilizes the Trityp program to illustrate the general method. Formal concept analysis and association rule are two well‐known methods for symbolic data mining. In their original inception, they both consider data in the form of an object‐attribute table. In their original inception, they both consider data in the form of an object‐attribute table. The chapter considers a debugging process in which a program is tested against different test cases. Two attributes, PASS and FAIL, represent the issue of the test case. The chapter extends the analysis of data mining for fault localization for the multiple fault situations. It addresses how data mining can be further applied to fault localization for GUI components. Unlike traditional software, GUI test cases are usually event sequences, and each individual event has a unique corresponding event handler.
ORANGE - Univ. Rennes
Contract amount: 30k€ + Phd Salary
Context. This project is a collaboration with Orange Labs Lannion about interpretable machine learning. The Orange company aims to develop the use of machine learning algorithms to enhance the services they propose to their customers (for instance, credit acceptance or attribution prediction). It ensues the development of generic approaches for providing interpretable decisions to customers or client managers.
Objective. The GDPR, implemented by the EU in 2018, stipulates the right for explanations for EU citizens in regard to decisions made from personal data. In a society where many of those decisions are computer-assisted via machine learning algorithm, interpretable ML is crucial. A promising way to convey explanations for the outcomes of ML models are counterfactual explanations. The focus of the PhD thesis financed by this project is the generation of usable and actionable counterfactual explanations for ML classifiers, which are intensively used by Orange within their services.
Additional remarks. This contract finances the PhD of Victor GUYOMARD by Orange.
Stellantis - Univ. Rennes
Contract amount: 70k€ + Phd Salary
Context. This project is a collaboration with Stellantis and focuses on the development of interpretable machine learning models for multivariate time series data. Utilizing a range of sensors integrated within vehicles, these models are designed to make real-time decisions. Providing drivers with clear explanations of these decisions is a key aspect. We specifically concentrate on counterfactual explanations, which not only clarify why a particular decision was made but also illustrate how alternative scenarios might have led to different outcomes.
Objective. Current approaches providing counterfactual explanations for time series models are limited to univariate time series. In this project, we aim to develop approaches to handle multivariate time series, which requires capturing the correlations between the series.
Additional remarks. This is the doctoral contract for the PhD of Paul Sévellec (Thèse CIFRE).
Enedis - Univ. Rennes
Contract amount: 45k€ + Phd Salary
Context. The collection of electrical consumption time series through smart meters grows with ambitious nationwide smart grid programs. This data is both highly sensitive and highly valuable: strong laws about personal data protect it while laws about open data aim at making it public after a privacy-preserving data publishing process.
Objective. We are interested in privacy-preserving data-sharing. We study the uniqueness of large-scale real-life fine-grained electrical consumption time-series, the potential privacy threats, and their mitigation.
Additional remarks. This is the doctoral contract for the PhD of Antonin Voyez (Thèse CIFRE).
ORANGE - Univ. Rennes
Contract amount: 45k€ + Phd Salary
Context. Tabular data generation is paramount when dealing with privacy-sensitive data and with missing values, which are frequent cases in the real (industrial) world and particularly at Orange. It is also used for data augmentation, a pre-processing step often needed when training data-hungry deep learning models (for example to detect anomalies in networks, study customer profiles, ...).
Objective. We study methods to tackle heterogeneous tabular data generation with deep generative models. We are particularly interested in problems where the tabular data are heterogeneous (numerical and symbolic) and when new tables should be generated from scratch based on a human prompt.
Additional remarks. This is the doctoral contract for the PhD of Charbel Kindji (Thèse CIFRE).
Elodie Germani (PhD student supervised by Elisa Fromont with EMPENN) has spent 3 months in Canada at Concordia University (Montreal) with a Mitacs Globalink Research Award (GRA) with Centre National de la Recherche Scientifique (CNRS) on the project "Improving rs-fMRI-derived biomarkers of Parkinson's Disease".
Elisa Fromont, Alexandre Termier and Luis Galárraga are all members (within Inria) of the project H2020 ICT-48 TAILOR "Foundations of Trustworthy AI - Integrating Reasoning, Learning and Optimization". Elisa Fromont is responsible for Task 3.7 and 3.8 (roadmap and synergies with industry) in WP3.
HyAIAI: Hybrid Approaches for Interpretable AI
The Inria Project Lab HyAIAI is a consortium of Inria teams (Sequel, Magnet, Tau, Orpailleur, Multispeech, and LACODAM) that work together towards the development of novel methods for machine learning, that combine numerical and symbolic approaches. The goal is to develop new machine learning algorithms such that (i) they are as efficient as current best approaches, (ii) they can be guided by means of human-understandable constraints, and (iii) their decisions can be better understood. The project ended in June 2023.
#DigitAg: Digital Agriculture
#DigitAg is a “Convergence Institute” dedicated to the increasing importance of digital techniques in agriculture. Its goal is twofold: First, making innovative research on the use of digital techniques in agriculture in order to improve competitiveness, preserve the environment, and offer correct living conditions to farmers. Second, preparing future farmers and agricultural policy makers to successfully exploit such technologies. While #DigitAg is based on Montpellier, Rennes is a satellite of the institute focused on cattle farming.
LACODAM is involved in the “data mining” challenge of the institute, which A. Termier co-leads. He is also the representative of Inria in the steering committee of the institute. The interest for the team is to design novel methods to analyze and represent agricultural data, which are challenging because they are both heterogeneous and multi-scale (both spatial and temporal).
PEPR WAIT 4
The WAIT 4 project is a part of the “Agroecology and numeric” PEPR. The goal of this project is to provide the scientific basis for significant improvements in the well-being of farm animals. Up to now, animal well-being is evaluated with indicators of the means deployed (e.g. available space, method to control building temperature, time spent outside...).
The goal of WAIT4 is to provide tools required in order to move to results indicators: can some guarantees be given on the well being of animals? Can this well (or unwell) being be correlated to management actions from the farmer, or to their general living conditions?
This requires a much finer understanding of animal mental as well as physiological state. The project is led by Inrae (Florence Gondret), which brings animal science specialists, ranging from biologists to ethologists. CEA provides expertise on blood sensors, to measure molecules linked to stress. And Inria as well as Insa Lyon provide computer science expertise for tools to analyse the data. More precisely, the Lacodam team will deal first with analyzing time series of numerical sensor data (e.g. temperature, activity), and second with categorical sequences of events produced by annotation tools from the analysis of videos. Both will help to better model animal behavior, and determine what are “normal” behaviors, and what are anomalous behaviors that may be linked to bad conditions for the animals.
Bourse IUF - Elisa FROMONT
This project supports the work of Elisa Fromont both with a reduction of teaching load, and some research money (15Keuros / year for 5 years). Elisa is currently working on designing effective data mining and machine learning algorithms for real-life data (which are scarce, heterogenous, multimodal, imbalanced, temporal, …). For the next few years, Elisa would like to focus on the interpretability of the results obtained by these algorithms. In pattern mining, her goal is to design algorithms which can directly mine a small number of relevant patterns. In the case of black box machine learning models (e.g. deep neural nets), Elisa would like to design methods to help the end user understand the decisions taken by the model.
Scikit-mine (F-WIN project of PNR-IA)
Scikit-mine (SKM for short) is a Python library of pattern mining algorithms, desiging to be compatible with the well-known scikit-learn library. It allows practitioners to use state-of-the-art pattern mining algorithm with a library that has the same usage interface as scikit-learn, and that exploits the same data types. SKM was developed by CNRS AI engineers in the context of the F-WIN project of the PNR-IA program of CNRS, which general goal is to improve the development of AI software in research teams of CNRS labs.
FAbLe: Framework for Automatic Interpretability in Machine Learning
Participants: L. Galárraga (holder), C. Largouët
How can we fully automatically choose the best explanation for a given use case in classification?.
Answering this question is the raison d’être of the JCJC ANR project FAbLe. By “best explanation” we mean an explanation that is both understandable by humans and faithful among a universe of possible explanations. We focus on local explanations, i.e., when we want to explain the answer of a black-box model for a given use case, which we call the “target instance”. We argue that the choice of the best explanation depends on the (i) data, namely the model, the explanation technique and the target instance, etc., and (ii) the recipients of the explanations. Hence our research is focused on two main questions: “What makes an explanation suitable (interpretable and faithful) for a particular instance and model?” and “What is the effect of the different AI-based explanation techniques and visual representations on users' comprehension and trust?”.
Answering these questions will help us understand and automate the selection of a particular explanation style based on the use case. Our ultimate goal is to produce a suite of algorithms that will compute suitable explanations for ML algorithms based on our insights of what is interpretable. User studies on different explanation settings (methods and visual representations) will allow us to characterize the features of explanations that make them acceptable (i.e., understandable and trustworthy) by users.
SmartFCA: A Smart Tool for Analyzing Complex Data with Formal Concept Analysis
Period: 01/01/2022 – 31/12/2025
Budget: 143k€ (Univ Rennes)
Formal Concept Analysis (FCA) is a mathematical framework based on lattice theory and aimed at data analysis and classification. FCA, which is closely related to pattern mining in knowledge discovery (KD), can be used for data mining purposes in many application domains, e.g. life sciences and linked data. Moreover, FCA is human-centered and provides means for visualization and interaction with data and patterns. Actually it is now possible to deal with complex data such as intervals, sequences, trajectories, trees, and graphs. Research in FCA is dynamic, but there is still room for extensions of the original formalism. Many theoretical and practical challenges remain. Actually there does not exist any consensual platform offering the necessary components for analyzing real-life data. This is precisely the objective of the SmartFCA project to develop the theory and practice of FCA and its extensions, to make the related components inter-operable, and to implement a usable and consensual platform offering the necessary services and workflows for KD.
In particular, for satisfying in the best way the needs of experts in many application domains, SmartFCa will offer a “Knowledge as a Service” (KaaS) component for making domain knowledge operable and reusable on demand.
MeKaNo: Search the Web with Things
Period: 01/10/2022 – 29/09/2026
Budget: 143k€ (Univ Rennes)
In MeKaNo, we aim to search the web with things, in order to get more accurate results over a wide diversity of sources. Traditional web search engines search the web with strings. However, keyword search often returns many irrelevant documents, pushing users to refine their keyword list following a trial-and-error process. To overcome such limitations, major companies allowed searching for things, not strings. Asking for the age of “James Cameron” to your vocal assistant, it locates in a Knowledge Graph (KG) a Person matching “James Cameron” where a property “age” is set to 66 years, i.e. the Thing “James Cameron”. If searching for Things is a tremendous progress and delivers exact answers, the search is done over a Knowledge Graph and not on the Web. Consequently, there may exist many answers on the web that are not part of the knowledge graph.
To summarize, searching with strings over the web offers diversity at the expense of noise. Searching for Things delivers exact answers, but we lose diversity. In MeKaNo, we aim at searching the web with Things to get diversity and avoid noisy results. To search the web with Things, we face three main scientific challenges:
As part of the scientific animation of the DKM (D7) research department at IRISA, Elisa Fromont (head of the department) co-organises monthly seminars which have featured, in 2023: Damien Eveillard, Frederic Jurie, Colin de la Higuera, Sihem Amer-Yahia, Meghyn Bienvenu, Hendrik Blockeel, Aurélien Bellet.
Romaric Gaudel was co-program chair of CAP 2023 (the French conference on Machine Learning) in Strasbourg.
Apart from
PhD. Students
Internships
Engineer