The STACK team addresses challenges related to the management and advanced usages of the Cloud to IoT continuum (infrastructures on the Cloud, Fog, Edge, and IoT). More specifically, the team is interested in delivering appropriate system abstractions to operate and use massively geo-distributed infrastructures, from the lowest to the highest levels of abstraction (i.e. system to application development, respectively), and addressing crosscutting dimensions such as energy or security. These infrastructures are critical for the emergence of new kinds of applications related to the digitalization of the industry and the public sector (a.k.a. the Industrial Internet, smart cities, e-medecine, etc. ).
Initially proposed to interconnect computers worldwide, the Internet has significantly evolved to become in two decades a key element in almost all our activities. This (r)evolution mainly relies on the progress that has been achieved in the computation and communication fields which in turn has led to the well-known and widely spread Cloud Computing paradigm. Nowadays most Internet exchanges occur between endpoints ranging from small-scale devices, such as smart-phones, to large-scale facilities, i.e., cloud computing platforms, in charge of hosting modern information systems.
With the emergence of the Internet of Things (IoT), stakeholders expect a new revolution that will push, once again, the limits of the Internet, in particular by favouring the convergence between physical and virtual worlds into an augmented world or cyber-physical world.
This convergence is about to be made possible thanks to the development of minimalist sensors as well as complex industrial physical machines that can be connected to the Internet through edge computing infrastructures. Edge computing is an extension of the cloud computing model that consists in deploying a federation (or cooperation) of smaller data centers at the edge of the network, thus closer to sensors, devices, machines, and end-users that produce and consume data 76, 74.
This new kind of digital infrastructure, which covers resources from the “center” to the extreme edge of the network, is expected to improve almost all aspects of daily life and the decision processes in various domains such as industry, transportation, health, training and education. The corresponding applications target the control and optimization of the business processes of most companies thanks to the intensive use of ICT systems and real-time data collected by geo-distributed physical devices (video, sensors, ...).
Among the obstacles to this new generation of Internet services is the development of a convenient and powerful software stack,
i.e.
In other words, this framework should allow operators, and devops, to manage the life-cycle of both the digital infrastructure and the applications deployed on top of this infrastructure, throughout the cloud to IoT continuum. These include operations such as the initial configuration but also all the reconfigurations that can be required in response to particular events (maintenance operations, equipment failures, application load variation, user mobility, energy shortage, etc.).
The existing software stacks that have been proposed to manage Cloud Computing platforms are not appropriate for handling the specifics of the next generation of digital infrastructure (in terms of scale, heterogeneity, dynamicity, security threats, and energy opportunities). For example, this infrastructure have to be operated remotely and automatically as much as possible as it will impossible to have human presence on all locations. Due to their number, it will be necessary to allow operations not on a single site but on sets defined on the fly as needed. Moreover, the management mechanisms must have been designed to cope with intermittent network access to the sites. That is to say, offering on the one hand safety properties and on the other hand autonomy in order to allow each site to remain as operational as possible in the event of network partitioning. Finally, currently existing interfaces (APIs) should be extended to turn location into a first-class citizen. In particular, the locality aspects should be reified from the core system building blocks to the high-level application programming interfaces.
The STACK activities cover the full Cloud to IoT continuum, including recent challenges related to the network dimension and urgent computing. Starting with Ass. Prof Koutsiamanis, Ass. Prof Piamrat and Dr. Balouek (who respectively joined the team in 2021, 2022 and 2023), this enlargement of STACK core activities will be further strengthened with the arrival of Orange members expected in 2024.
Through the ongoing integration of Orange members, STACK consolidates its expertise in distributed systems, networks, cyber-physical systems, IoT, device management, and software programming as well as combining significant skills in the design, practical development and evaluation of large-scale systems. More precisely, our research activities mainly rely on a set of scientific foundations detailed below.
We aim at strengthening the knowledge in these different areas through two kinds of contributions: First through scientific articles as a regular project team, and second, through concrete pieces of software that can be transferred to major opensource communities.
STACK activities have been focused on the management and programming of geo-distributed data centers with a work program defined around four research topics as depicted in Figure 1a.
The first two ones are related to the resource management mechanisms and the programming support that are mandatory to operate and use ICT geo-distributed resources (compute, storage, network). They are transversal to the three software layers that generally compose a software stack (System/Middleware/Application in Figure 1b) and nurture each other (i.e., the resource management mechanisms will leverage abstractions/concepts proposed by the programming support axis and reciprocally). The third and fourth research topics are related to the Energy and Security dimensions (both also crosscutting the three software layers). Although they could have been merged with the first two axes, we identified them as independent research directions due to their critical aspects with respect to the societal challenges they represent.
This scientific roadmap to address challenges related to the management and programming of geo-distributed infrastructures applies to the Cloud to IoT continuum and continues to have significant scientific and socio-economic impact. Hence, STACK organizes its activities around these four crosscutting lines that form a unique approach.
Additionally, our activities extend to the management of IoT devices with the ultimate goal of covering the entire Cloud-to-IoT continuum through a common software stack.
Our vision is to base this computing continuum software stack on control loops following the MAPE-K model1, which can be seen as an infinite loop that monitors the infrastructure as well as the state of the applications in order to maintain in an autonomous manner the expected objectives (in terms of performance, robustness, etc.).
Although it is largely adopted in Cloud orchestrators such as Kubernetes, delivering a MAPE-K software stack for the computing continuum faces multiple challenges. The first one is related to the diversity of resources to consider. KubeEdge 79, for instance, proposes extensions to integrate servers and IoT devices under the same framework. However, the supported operations are rather limited as they only cover communication between software components running on servers at the edge and the connected devices. In other words, the IoT devices are not considered in the control loops. From our viewpoint, this weak integration is linked to an incomplete understanding of the needs that such a system must take into account (in particular on the IoT device management side). To favor the integration of both dimensions into a common system, we aim to identify major structural properties as well as management operations necessary at the operator and DevOps levels and to implement them when needed. A second important challenge is related to the geo-distribution property (and so the intermittent network connectivity) of this type of infrastructure. This increased complexity implies revising the way control loops are designed in order to handle frequent disconnections that can occur at any time (for instance due to the low energy level of IoT devices). Here, our approach is to combine autonomous loops with well-adapted formal methods to guarantee their verification or to synthesize correct-by-design decisions. Additionally, we study their performance models to be able to automatically, safely, efficiently and in a timely manner adapt Cloud-to-IoT infrastructures and their hosted applications according to different objectives (performance, energy, security, etc.). Finally, we are working towards partitioning a Cloud-to-IoT infrastructure into several areas and delivering the illusion of a single system through a federated approach: each area is managed by an independent controller, and collaborations between areas are done through dedicated middleware. The innovative aspect relies in the way of developing this middleware so that it is reliable in spite of increasing scale and faults (collaborations will be triggered only on-demand without maintaining a global knowledge base of the entire infrastructure).
Some of the research questions we address in the medium term are:
All the aforementioned research questions are addressed through several application fields: telecommunications operators and smart buildings in the first place through this privileged partnership with the Orange colleagues who are going to join this new team but also in health, in particular, biomedical research in order to allow the execution of analyses, currently emerging, in large-scale geo-distributed environments.
Industrial/Tactile Internet/Cyber-Physical applications highlight the importance of the computing continuum model. Hence, the use-cases of STACK activities are driven and nurtured by these application domains. Besides, it is noteworthy to mention that Telecom operators such as Orange have been among the first ones to advocate the deployment of Fog/Edge infrastructure. The initial reason is that a geo-distributed infrastructure enables them to virtualize a large part of their resources and thus reduce capital and operational costs. As an example, several researchers have been investigating through the IOLab, the joint lab between Orange and Inria, how 5G networks can be managed. We highlight that while our expertise does partially include the network side, the main focus is rather on how we can deploy, locate and reconfigure the software components that are mandatory to operate next generation of network/computing infrastructure. The main challenges are related to the high dynamicity of the infrastructure, the way of defining Quality of Service of applications and how it can be guaranteed. We expect our contributions will deliver advances in location based services, optimized local content distribution (data-caching) and Mobile Edge Computing 2. In addition to bringing resources close to end-users, massively geo-distributed infrastructures should favor the development of more advanced networks as well as mobile services.
Supporting industrial actors and open-source communities in building an advanced software management stack is a key element to favor the advent of new kinds of information systems as well as web applications. Augmented reality, telemedecine and e-health services, smart-city, smart-factory, smart-transportation and remote security applications are under investigations. Although, STACK does not intend to address directly the development of such applications, understanding their requirements is critical to identify how the next generation of ICT infrastructure should evolve and what are the appropriate software abstractions for operators, developers and end-users. STACK team members have been exchanging since 2015 with a number of industrial groups (notably Orange Labs and Airbus), a few medical institutes (public and private ones) and several telecommunication operators in order to identify both opportunities and challenges in each of these domains, described hereafter.
The Industrial Internet domain gathers applications related to the
convergence between the physical and the virtual world. This convergence
has been made possible by the development of small, lightweight
and cheap sensors as well as complex industrial physical machines that
can be connected to the Internet. It is expected to improve most
processes of daily life and decision processes in all societal
domains, affecting all corresponding actors, be they individuals and
user groups, large companies, SMEs or public institutions.
The corresponding applications cover: the improvement of business
processes of companies and the management of institutions (e.g., accounting, marketing, cloud manufacturing, etc.); the development of
large “smart” applications handling large amounts of geo-distributed
data and a large set of resources (video analytics, augmented
reality, etc.); the advent of future medical prevention and treatment
techniques thanks to the intensive use of ICT systems, etc.
We expect our contributions favor the rise of efficient, correct
and sustainable massively geo-distributed infrastructure that are
mandatory to design and develop such applications.
The Internet of Skills is an extension of the Industrial Internet to
human activities. It can be seen as the ability to deliver physical
experiences remotely (i.e., via the Tactile Internet). Its main
supporters advocate that it will revolutionize the way we teach,
learn, and interact with pervasive resources. As most applications of
the Internet of Skills are related to real time experiences, latency
may be even more critical than for the Industrial Internet and raise
the locality of computations and resources as a priority. In addition
to identifying how an Utility Computing infrastructure can cope with
this requirement, it is important to determine
how the quality of service of such applications should be defined and how
latency and bandwidth constraints can be guaranteed at the
infrastructure level.
The e-Health domain constitutes an important societal application domain of the two previous areas. The STACK teams is investigating distribution, security and privacy issues in the fields of systems and personalized (aka. precision) medicine. The overall goal in these fields is the development of medication and treatment methods that are tailored towards small groups or even individual patients.
We have been working on different projects since the beginning of STACK (e.g., PrivGen CominLabs). In general, we are applying and developing corresponding techniques for the medical domains of genomics, immunobiology and transplantalogy (see Section 10).
The STACK team continue to contribute to the e-Health domain by harnessing advanced architectures, applications and infrastructure for the Fog/Edge.
Telecom operators have been among the first to advocate the deployment
of massively geo-distributed infrastructure, in particular through
working groups such as the Mobile Edge Computing at the European
Telecommunication Standards Institute.
The initial reason is that a geo-distributed infrastructure enables
Telecom operators to virtualize a large part of their resources and
thus reduces capital and operational costs. As an example, we are
investigating through the I/O Lab, the joint lab between Orange and
Inria, how can a Centralized Radio Access Networks (a.k.a. C-RAN or
Cloud-RAN) be supported for 5G networks. We highlight that our
expertise is not on the network side but rather on where and how we
can deploy, allocate and reconfigure software components, which are
mandatory to operate a C-RAN infrastructure, in order to guarantee
the quality of service expected by the end-users.
Finally, working with actors from the network community is a valuable
advantage for a distributed system research group such as STACK.
Indeed, achievements made within one of the two communities serve the
other.
Urgent Computing refers to a class of time-critical scientific applications that leverage distributed data sources to facilitate important decision-making in a timely manner. The overall goal of Urgent Computing is to predict the outcome of scenarios early enough to prevent critical situations or to mitigate their negative consequences. Motivating use cases refers to rapid response scenarios across the Cloud-to-IoT Continuum, such as in natural disaster management, which implies to gather the local state of each device, transform it into a global knowledge of the network, characterize the observed phenomenon according to an applied model, and finally, trigger appropriate actions. The STACK team investigates Urgent Computing through the realization of a fluid ecosystem where distributed computing resources and services are aggregated on-demand to support delay-sensitive and data-driven workflows.
In addition to the international travels, the environmental footprint of our research activities is linked to our intensive use of large-scale testbeds such as Grid'5000 (STACK members are often in the top 10 list of the largest consumers). Although the access to such facilities is critical to move forward in our research roadmap, it is important to recognize that they have a strong environmental impact as decribed in the next paragraph.
The environmental impact of digital technology is a major scientific and societal challenge. Even though the software looks to be virtual objects, it is executed on very real hardware contributing to the carbon footprint. This impact materializes during the manufacturing and destruction of hardware infrastructure (estimated at 45% of digital consumption in 2018 by the Shift Project) and during the software use phase via terminals, networks and data centers (estimated at 55%). Stack members have been studying various approaches for several years to reduce the energy footprint of digital infrastructures during the use phase. The work carried out revolves around two main axes: (i) reducing the energy footprint of infrastructures and (ii) adapting the software applications hosted by these infrastructures according to the energy available. More precisely, this second axis investigates possible improvements that could be made by the end-users of the software themselves. At scale, involving end-users in decision-making processes concerning energy consumption would lead to more frugal Cloud computing.
In 2023, the team has taken part in two Inria Challenges:
Last but not least, STACK started in 2022 a new platform project to design an innovative hardware infrastructure for the scientific study of the cross-cutting issues of computing infrastructures supporting artificial intelligence and their energy autonomy (see the Samurai project Section in 10).
Regarding scientific contributions, the team has produced major results on the management of large-scale infrastructures, in particular one publication in the prestigious journal ACM Computing Surveys 9. The team also continued to consolidate its presence in the network community (e.g. in Elsevier Computer Networks 11)
On the software side, the team has pursued its efforts on the development of the EnosLib library and the resulting artifacts to help researchers perform experiment campaigns with direct contributions from several research engineers of the team.
Finally, on the platform side, we continued our effort and took part in the different actions around the SLICES and SLICES-FR, see Section 10.
The team has received the 3rd Best paper award at the 16th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2023) for its paper 24.
Finally, it is noteworthy to mention the new position of Adrien Lebre as the leader of the Cloud, Network and system program at Inria since December 2023 as well as the co-leader of the French Cloud PEPR action.
Enos workflow :
A typical experiment using Enos is the sequence of several phases: - enos up : Enos will read the configuration file, get machines from the resource provider and will prepare the next phase - enos os : Enos will deploy OpenStack on the machines. This phase rely highly on Kolla deployment. - enos init-os : Enos will bootstrap the OpenStack installation (default quotas, security rules, ...) - enos bench : Enos will run a list of benchmarks. Enos support Rally and Shaker benchmarks. - enos backup : Enos will backup metrics gathered, logs and configuration files from the experiment.
EnOSlib is a library to help you with your distributed application experiments. The main parts of your experiment logic is made reusable by the following EnOSlib building blocks:
- Reusable infrastructure configuration: The provider abstraction allows you to run your experiment on different environments (locally with Vagrant, Grid’5000, Chameleon and more) - Reusable software provisioning: In order to configure your nodes, EnOSlib exposes different APIs with different level of expressivity - Reusable experiment facilities: Tasks help you to organize your experimentation workflow.
EnOSlib is designed for experimentation purpose: benchmark in a controlled environment, academic validation …
OpenStack is the de facto open-source management system to operate and use Cloud Computing infrastructure. Started in 2012, the OpenStack foundation gathers 500 organizations including groups such as Intel, AT&T, RedHat, etc. The software platform relies on tens of services with a 6-month development cycle. It is composed of more than 2 millions of lines of code, mainly in Python, just for the core services. While these aspects make the whole ecosystem quite swift, they are also good signs of maturity for this community.
We created and animated between 2016 and 2018 the Fog/Edge/Massively Distributed (FEMDC) Special Interest Group and have been contributing to the Performance working group since 2015. The former investigates how OpenStack can address Fog/Edge Computing use cases whereas the latter addresses scalability, reactivity and high-availability challenges. In addition to releasing white papers and guidelines 54, the major result from the academic view point is the aforementioned EnOS solution, a holistic framework to conduct performance evaluations of OpenStack (control and data plane). In May 2018, the FEMDC SiG turned into a larger group under the control of the OpenStack foundation. This group gathers large companies such as Verizon, ATT, etc. Although our involvment has been less important since 2020, our participation is still signficant. For instance, we co-signed the second white paper delivered by the edge working group in 2020 55 and are still taking part to the working group, investigating new use-cases as well as resulting challenges.
Grid'5000 is a large-scale and versatile testbed for experiment-driven research in all areas of computer science, with a focus on parallel and distributed computing including Cloud, HPC and Big Data. It provides access to a large amount of resources: 12000 cores, 800 compute-nodes grouped in homogeneous clusters, and featuring various technologies (GPU, SSD, NVMe, 10G and 25G Ethernet, Infiniband, Omni-Path) and advanced monitoring and measurement features for traces collection of networking and power consumption, providing a deep understanding of experiments. It is highly reconfigurable and controllable. STACK members are strongly involved into the management and the supervision of the testbed, notably through the steering committee or the SeDuCe testbed described hereafter.
The SeDuCe Project aims to deliver a research testbed dedicated to holistic research studies on energetical aspects of datacenters. Part of the Grid'5000 Nantes site, this infrastructure is composed of probes that measure the power consumption of each server, each switch and each cooling system, and also measure the temperature at the front and the back of each servers. These sensors enable reasearch to cover the full spectrum of the energetical aspect of datacenters, such as cooling and power consumption depending of experimental conditions.
The testbed is connected to renewable energy sources
(solar panels). This “green” datacenter enables researchers
to perform real experiment-driven studies on fields such as
temperature based scheduling or “green” aware software (i.e., software that take into account renewable energies and weather
conditions).
In 2023, we consolidated the development of PiSeDuCe, a deployment and reservation system for Edge Computing infrastructures composed of multiple Raspberry Pi Cluster started in 2020. Typically, a cluster of 8 Raspberry Pi costs less than 900 euros and only needs an electrical outlet and a wifi connection for its installation and configuration. Funded by the CNRS through the Kabuto project, and in connection with the SLICES-FR initiative, we have extended PiSeduce to propose a device to cloud deployment system (from devices on Fit IoTLab to servers in Grid’5000). PiSeDuCe and SeDuce led us to submit the Samurai CPER proposal.
In 2022 the SAMURAI (Sustainable And autonoMoUs gReen computing for AI) project was accepted, as a part of the energy and digital transition theme. The project aims at reinforcing an innovative hardware infrastructure for the scientific study of the intersecting problems of computing infrastructure that supports artificial intelligence and its energy autonomy. SAMURAI will extend SeDuCe into energy autonomy by adding a smart and clean energy storage system. Additionally, it will extend the capabilities of the platform by adding AI computing nodes (GPUs) for the scientific study of AI tools. Finally, it will also add new sensor nodes within the Nantes connected object platform (Nantes nodes of the national FIT IoT-Lab platform) to support future work on embedded AI. In 2023, We built the IoT platform and network integration with the SeDuCe and G5K platform.
STACK Members are involved in the definition and bootstrapping of the SLICES-FR infrastructure. This infrastructure can be seen as a merge of the Grid'5000 and FIT testbeds with the goal of providing a common platform for experimental Computer Science (Next Generation Internet, Internet of things, clouds, HPC, big data, etc.). In 2022, STACK contributions are mainly related to SLICES, the European initiative relative. In particular, members have taken part to the SLICES-DS project (Design Study) as well as the SLICES-PP project (Preparatory Phase) of the SLICES-RI action .
The evolution of the cloud computing paradigm in the last decade has amplified the access to on-demand services (economically attractive, easy-to-use manner, etc.). However, the current model, built upon a few large datacenters (DCs), may not be suited to guarantee the needs of new use cases, notably the boom of the Internet of Things (IoT). To better respond to the new requirements (in terms of delay, traffic, etc.), compute and storage resources should be deployed closer to the end-user, forming with the national and regional data centers a new computing continuum. The question is then how to manage such a continuum to provide end-users the same services that made the current cloud computing model so successful. In 2023, we have continued our effort to answer this question and delivered multiple contributions. We also start new activities around Urgent Computing (see Section 4).
The team has continued activities covering the Cloud–IoT continuum. First, an architecture that cover the whole continuum has been proposed in 14. It deploys hierachical federated learning (HFL) to cope with limitations of traditional FL, which suffers from several issues such as (i) robustness, due to a single point of failure, (ii) latency, as it requires a significant amount of communication resources, and (iii) convergence, due to system and statistical heterogeneity. HFL adds the edge servers as an intermediate layer for sub-model aggregation, several iterations will be performed before the global aggregation at the cloud server takes place, thus making the overall process more efficient, especially with non-independent and identically distributed (non-IID) data. Moreover, using traditional Artificial Neural Networks (ANNs) with HFL consumes a significant amount of energy, further hindering the application of decentralized FL on energy-constrained mobile devices. Therefore, this paper presents HFedSNN: an energy-efficient and fast-convergence model by incorporating Spike Neural Networks (SNNs) within HFL. SNN is a generation of neural networks, which promises tremendous energy and computation efficiency improvements. Taking advantage of HFL and SNN, numerical results demonstrate that HFedSNN outperforms FL with SNN (FedSNN) in terms of accuracy and communication overhead by 4.48% and 26×, respectively. Furthermore, HFedSNN significantly reduces energy consumption by 4.3× compared to FL with ANN (FedANN).
Moreover, as the complexity and scale of modern networks continue to grow, the need for efficient, secure management, and optimization becomes increasingly vital. Digital twin (DT) technology has emerged as a promising approach to address these challenges by providing a virtual representation of the physical network, enabling analysis, diagnosis, emulation, and control. The emergence of Software-defined network (SDN) has facilitated a holistic view of the network topology, enabling the use of Graph neural network (GNN) as a data-driven technique to solve diverse problems in future networks. The survey 10 explores the intersection of GNNs and Network digital twins (NDTs), providing an overview of their applications, enabling technologies, challenges, and opportunities. We discuss how GNNs and NDTs can be leveraged to improve network performance, optimize routing, enable network slicing, and enhance security in future networks. Additionally, we highlight certain advantages of incorporating GNNs into NDTs and present two case studies. Finally, we address the key challenges and promising directions in the field, aiming to inspire further advancements and foster innovation in GNN-based NDTs for future networks.
With an orientation specific to 5G networks and cloud computing, in 18 we introduce the Mobile Edge Slice Broker (MESB), a novel solution for orchestrating the deployment and operation of mobile edge slices in multi-cloud environments. We focus on reconciling the challenges faced by mobile edge slice tenants in selecting Cloud Infrastructure Providers (CIPs), ensuring seamless integration of network functions, services, and applications while maintaining quality of service (QoS). Our MESB architectural design enables dynamic monitoring, reconfiguration, and optimization of mobile edge slices to meet evolving requirements. This work represents a first step in enhancing the efficiency and effectiveness of resource management in the convergence of telecom and cloud services in 5G networks.
In line with the previous works in network resource management, in 11 we continue to focus on efficient and secure network management by applying federated learning specifically to 5G cellular traffic prediction. In this work we emphasize the viability of various neural network models, including RNNs and CNNs, in a federated setting for time-series forecasting, while also exploring the effectiveness of different federated aggregation functions in handling non-identically distributed data. Additionally, we demonstrate that preprocessing techniques at local base stations significantly improve prediction accuracy, and that federated learning offers substantial benefits in reducing carbon emissions, computational demands, and communication costs, making it a viable and efficient approach for large-scale and intelligent resource management in network systems.
19 introduces CloudFactory, an IaaS workload generator, as a solution to address the representativeness problem. This tool includes a library for sharing IaaS statistics and a generator for producing realistic VM workloads based on these statistics. CloudFactory is offered as open-source software for use by Cloud providers and researchers to facilitate the evaluation of new contributions. To demonstrate the relevance of CloudFactory, we discuss an analysis of the scheduling evolution for different IaaS workload intensities of two different cloud providers (Microsoft Azure and Chameleon). OVHcloud statistics, computed from CloudFactory, are also reported, and a comparison is made with other Cloud providers.
The Continuum Computing encompasses the integration of diverse infrastructures, including cloud, edge, and fog, to facilitate seamless migration of applications based on their specific needs, ensuring optimal satisfaction of their requirements. In particular, Urgent applications aim at processing distributed data sources while considering the uncertainty of infrastructure and timeliness constraints of data-driven workflows. The primary obstacles in this particular context mostly pertain to the incapacity to promptly respond to changes in the environment or the quality of service (QoS) constraints of the application, as well as the incapability to maintain an application in a stateless manner, hence impeding its relocation without the risk of data loss. This line of work is highly motivated by natural disaster case studies. In 17, we leverage three different platforms to combine smoke detection at the edge using camera imagery, wildfire simulations in the cloud, and finally heavy HPC models for generating pollution concentration maps. The resulting workflow is a prime example of a Cloud-to-IoT continuum use case that monitors and manages the air quality impacts of remote wildfires. In this context, we frame challenges and discuss how the R-Pulsar programming system can support urgent data-processing pipelines that tradeoff the content of data, cost of computation, and urgency of the results.
In 24, we tackle the aforementioned issues through the introduction of a framework based on a Function-as-a-Service (FaaS) and event-driven architecture. This framework enables the decomposition, localization, and relocation of applications inside a Continuum infrastructure, facilitated by a rule engine that is both system and data-aware. Results highlight the use the execution history as a basis for determining the optimal location for executing a function, hence guaranteeing the desired Quality of Service (QoS). The evaluation was carried out by simulating a smoke detection use case as a representative scenario for Urgent Computing. We extended this approach in the context of the TEMA Horizon Europe project that aims at addressing Natural Disaster Management 25. We introduced the Urgent Function Enabler (UFE) platform, a fully distributed architecture able to define, spread, and manage FaaS functions, using local IoT data and a computing infrastructure composed of mobile and stable nodes.
This approach presents significant potential in numerous fields. We are currently exploring a climate risk management framework based on real data from the Danish Meteorological and Hydrological Institutes (DMI, and DHI), coupled with road condition (pavement bearing capacity) data from authorities 23. The aim of this project is to enhance a risk assessment model for evaluating pavement structure functionality during extreme weather events.
In 6, we propose a detailed overview of the current state-of-the-art in terms of Fog modeling languages. We relied on our long-term experience in Cloud Computing and Cloud Modeling to contribute a feature model describing what we believe to be the most important characteristics of Fog modeling languages. We also performed a systematic scientific literature search and selection process to obtain a list of already existing Fog modeling languages. Then, we evaluated and compared these Fog modeling languages according to the characteristics expressed in our feature model. As a result, we discuss in this paper the main capabilities of these Fog modeling languages and propose a corresponding set of open research challenges in this area.
The size, complexity, and heterogeneity of Fog systems throughout their life cycle poses challenges and potential costs. To address this, 15 proposes a generic model-based approach called VeriFog, utilizing a customizable Fog Modeling Language (FML) for verifying Fog systems during the design phase. The approach was tested with three use cases from different application domains, considering various non-functional properties for verification (energy, performance, security). Developed in collaboration with the industrial partner Smile, this approach represents a crucial step towards a comprehensive model-based support for the entire life cycle of Fog systems.
In the context of the OTPaaS project, we have started to study the anatomy of configuration languages with the objective of unveiling the building blocks of such languages. In particular, we have focused on a very expressive, and declarative, new configuration language, called CUE (CUE stands for Configure Unify Execute). CUE is open source, inspired by previous work at Google. It represents configurations in a way inspired by typed feature structures, a concept inherited from computational linguistics. This is not the first time that a language is built on top of this concept: LIFE, built in the 80s, revolves around psi-terms, a variant of types feature structures. Whereas CUE is a domain-specific language, LIFE is a generic language, which combines logic programming, functional programming and object-oriented programming. As such, it potentially provides a general framework to relate configuration as data and computations. We have presented our initial findings on the relationship between CUE and LIFE at CONFLANG '23, a SPLASH workshop 29.
For a few years, the team has been working on deployment and dynamic reconfiguration of distributed software systems through the tool Concerto 42. By offering a programmable lifecycle of resources (services, devices etc.), the team has shown multiple times that the level of parallelism and asynchronism when executing reconfigurations can be increased, thus reducing the execution time of reconfigurations. The accumulated knowledge on component-based distributed software reconfiguration has made possible the publication in 2023 of a survey in ACM Computing Surveys 9, a great success. In particular, this survey analyzes the state of the art on reconfiguration through the spectrum of verification. Indeed, given the complexity of distributed software and the adverse effects of incorrect reconfigurations, a suitable methodology is needed to ensure the correctness of reconfigurations in component-based systems. This survey gives the reader a global perspective over existing formal techniques that pursue this goal. It distinguishes different ways in which formal methods can improve the reliability of reconfigurations, and lists techniques that contribute to solving each of these particular scientific challenges.
In addition to this survey, last year the team has worked on the decentralized version of Concerto, namely Concerto-D. Such decentralized reconfiguration tool is very useful when building global knowledge on the state of the system is difficult, impossible, or very costly because of faults, disconnection or scale. If a publication on Concerto-D itself is not yet published, the usage of Concerto-D has been explored and published on three use cases describe below.
First, Cyber-Physical Systems (CPS) deployed in scarce resource environments like the Arctic Tundra (AT) face extreme conditions. Nodes deployed in such environments have to carefully manage a limited energy budget, forcing them to alternate long sleeping and brief uptime periods. During uptimes, nodes can collaborate for data exchanges or computations by providing services to other nodes. Deploying or updating such services on nodes requires coordination to prevent failures (e.g., sending new/updated API, waiting for service activation/deactivation, etc.). In 20, we study analytically and experimentally, to what extent using relay nodes for communications reduces the deployment and update durations (reconfiguration), and which factors influence this reduction. Intuitively, as dealing with sleeping nodes with low chances of uptime overlaps (no uptime synchronization), a coordinated deployment or update takes very long to finish if nodes have to communicate directly. In 21 we evaluate and study nodes' energy consumption during reconfiguration tasks coordination according to different CPS configurations (i.e., number of nodes, uptime duration, radio technology, or relay node availability). Results show high energy consumption in scenarios where nodes wake up specifically to deploy/update. It is shown that it is beneficial to do adaptation tasks while overlapping with existing uptimes (i.e., reserved for observation activities). This paper also evaluates and studies how nodes' uptime duration and relay node availability influence energy consumption. Increasing uptime duration can reduce energy consumption, up to 12
Second, Concerto-D has been studied and used in the context of the SeMaFoR ANR project. The context of this project is Fog Computing, a paradigm aiming to decentralize the Cloud by geographically distributing away computation, storage and network resources as well as related services. This notably reduces bottlenecks and data movement. However, managing Fog resources is a major challenge because the targeted systems are large, geographically distributed, unreliable and very dynamic. Cloud systems are generally managed via centralized autonomic controllers automatically optimizing both application QoS and resource usage. To leverage the self-management of Fog resources, we propose to orchestrate a fleet of autonomic controllers in a decentralized manner, each with a local view of its own resources. In 13, we present our SeMaFoR (Self-Management of Fog Resources) vision that aims at collaboratively operating Fog resources. SeMaFoR is a generic approach made of three cornerstones: an Architecture Description Language for the Fog, a collaborative and consensual decision-making process, and an automatic coordination mechanism for reconfiguration (based on Concerto-D).
Finally, the usage of Concerto-D has been studied on the particular case of Urgent Computing. This field considers sudden situations of danger such as climate disasters, fires and air pollution, which happen without control in a possibly short window frame, and puts people's lives in danger. In such situation, some applications to help solve the situation have to be dynamically deployed, or existing applications have to be updated in a short period of time and according to new unanticipated constraints. The study in 16 provides a justification for the necessity of dynamic adaptation in applications that are time-sensitive. Furthermore, we present our viewpoint on time-sensitive applications and undertake a thorough analysis of the underlying principles and challenges that need to be resolved in order to accomplish this goal.
Federated Learning (FL) collaboratively trains machine learning models on the data of local devices without having to move the data itself: a central server aggregates models, with privacy and performance benefits but also scalability and resilience challenges. We have presented FDFL citedelacour:hal-03946638, a new fully decentralized FL model and architecture that improves standard FL scalability and resilience with no loss of convergence speed. FDFL provides an aggregator-based model that enables scalability benefits and features an election process to tolerate node failures. Simulation results show that FDFL scales well with network size in terms of computing, memory, and communication compared to related FL approaches
The activities on this axis are mainly related to the design, development and deployment of the SAMURAI project(7.2.5), a testbed that will allow researchers to investigate energy related challenges over the computing continuum (from the Cloud to IoT devices/cyber physical systems).
Additionally, in 22, we aim to address the sustainability of machine learning (ML) models in the context of cellular traffic forecasting, specifically in the emerging 5G networks. We introduce a novel sustainability indicator to assess the trade-off between energy consumption and predictive accuracy of deep learning (DL) models in a federated learning (FL) framework. We then use it to demonstrate that while larger, more complex models offer slightly better accuracy, their significantly higher energy consumption and environmental impact make simpler models more viable for real-world applications, highlighting the importance of energy-aware computing in the development of distributed AI technologies for network management.
This year the STACK team has provided new results on security and privacy issues in the networking and biomedical domains. We have developed new AI-based methods that support new means of property analysis in both domains.
The rapid development of network communication along with the drastic increase in the number of smart devices has triggered a surge in network traffic, which can contain private data and in turn affect user privacy. Recently, Federated Learning (FL) has been proposed in Intrusion Detection Systems (IDS) to ensure attack detection, privacy preservation, and cost reduction, which are crucial issues in traditional centralized machine-learning-based IDS. However, FL-based approaches still exhibit vulnerabilities that can be exploited by adversaries to compromise user data. At the same time, meta-models (including the blending models) have been recognized as one of the solutions to improve generalization for attack detection and classification since they enhance generalization and predictive performances by combining multiple base models. Therefore, we have proposed in 7 a Federated Blending model-driven IDS framework for the Internet of Things (IoT) and Industrial IoT (IIoT), called F-BIDS, in order to further protect the privacy of existing ML-based IDS. It consists of a Decision Tree (DT) and Random Forest (RF) as base classifiers to first produce the meta-data. Then, the meta-classifier, which is a Neural Networks (NN) model, uses the meta-data during the federated training step, and finally, it makes the final classification on the test set. Specifically, in contrast to the classical FL approaches, the federated meta-classifier is trained on the meta-data (composite data) instead of user-sensitive data to further enhance privacy. To evaluate the performance of F-BIDS, we used the most recent and open cyber-security datasets, called Edge-IIoTset (published in 2022) and InSDN (in 2020). We chose these datasets because they are recent datasets and contain a large amount of network traffic including both malicious and benign traffic.
In Wilmer Garzon's PhD thesis 28, he has shown that researchers require new tools and techniques to address the restrictions and needs of global scientific collaborations over geo- distributed biomedical data. He has identified three kinds of constraints related to global collaborations, namely, technical, legal, and socioeconomic constraints.
He has proposed Fully Distributed Collaborations (FDC), which are research endeavors that provide means to analyze massive biomedical information collaboratively while respecting legal and socioeconomic restrictions.
A new random-forest based AI-algorithm MuSiForest improves over other existing AI approaches by ameliorating computation time and reducing the amount of shared data while having a training precision close to that of cen- tralized random forest techniques.
Finally, he has proposed FeDeRa, a language to specify, deploy, and execute FDC-compliant multi-site scientific workflows.
The ArchOps 2 Chair (for Architecture, Deployment and Administration of Agile IT Infrastructures) is an industrial chair of IMT Atlantique, in partnership with Kelio, an SME specialized in solutions for time and attendance management. It is dedicated to all IMT Atlantique students in the field of IT. It is also a channel for the transfer of high-level skills: researchers, experts and industrials.
In 2023, several activities were conducted, such as a hackathon with 100 participants, an event promoting the integration of international students at IMT Atlantique and preliminary discussions on the topic of a joint PhD thesis with the Stack team.
In 2020, during the preparation of the ANR SeMaFoR project, we started a cooperation with Alterway/Smile, an SME specialized in Cloud and DevOps technologies. This cooperation resulted in a joint PhD thesis (called Cifre) entitled "A model-based approach for dynamic, multi-scale distributed systems" started in Nov. 2021.
The first publication to emerge from this collaboration will be presented at next year's SAC conference 15.
In 2021, INRIA and OVHcloud have signed an agreement to jointly study the problem of a more frugal Cloud. They have identified 3 axes : (i) Software eco-design of Cloud services and applications; (ii) Energy efficiency leverages; (iii) Impact reduction and support for Cloud users. The Stack team obtained a PhD grant and a 24-month post-doc grant.
Pierre Jacquet started his PhD in October 2021, under a co-supervision with the Spirals team with the subject "Fostering the Frugal Design of Cloud Native Applications". Pierre presented a poster, ”Sampling Resource Requirements to Optimize Virtual Machines Sizing”, at the EuroSys conference in April 2022.
Yasmina Bouziem started her postdoc in September 2022 for two years, under a co-supervision with the Avalon team in Lyon with the subject "pulling the energy cost of a reconfiguration execution within reconfiguration decisions".
Since 2022, Orange Labs and the Stack team have been launched several PhD grants.
Dan Freeman Mahoro has initiated a new collaboration on the theme of "digital twins". This cooperation has led to a joint doctoral thesis entitled "Digital twins of complex cyber-physical systems", which began in April 2022, co-supervised with the Orange team in Rennes, and has a first publication 26. However, Dan Freeman Mahoro stopped his PhD for several reasons in October, preferring to join the industry as an engineer.
Paul Bori started his PhD in January 2023, with the subject "Container application security: a programmable OS-level approach to monitoring network flows and process executions".
Duc-Thinh Ngo started his PhD in December 2022, on the subject "Dynamic graph learning algorithms for the digital twin in edge-cloud continuum", under a co-supervision with Orange team in Rennes.
Divi de Lacour started his PhD in January 2022, on the subject "Architecture et services pour la protection des données poursy stèmes coopératifs autonomes", under a co-supervision with the Orange team in Chatillon (Paris south region).
Samia Boutalbi started her PhD in January 2022, on the subject "Secure deployment of micro-services in a shared Cloud RAN/MEC environment", under a co-supervision with the Ericsson team in Paris.
SLICES-RI (Research Infrastructure), which was recently included in the 2021 ESFRI roadmap, aims at building a large infrastructure needed for the experimental research on various aspects of distributed computing, networking, IoT and 5/6G networks. It will provide the resources needed to continuously design, experiment, operate and automate the full lifecycle management of digital infrastructures, data, applications, and services. Based on the two preceding projects within SLICES-RI, SLICES-DS (Design Study) and SLICES-SC (Starting Community), the SLICES-PP (Preparatory Phase) project will validate the requirements to engage into the implementation phase of the RI lifecycle. It will set the policies and decision processes for the governance of SLICES-RI: i.e., the legal and financial frameworks, the business model, the required human resource capacities and training programme. It will also settle the final technical architecture design for implementation. It will engage member states and stakeholders to secure commitment and funding needed for the platform to operate. It will position SLICES as an impactful instrument to support European advanced research, industrial competitiveness and societal impact in the digital era.
The involvement of the group is rather low if we consider the allocated budget (4K€) but the group is strongly involved at the national level taking part to different discussions related to SLICES-FR, the French node of SLICES.
Decentralised systems face challenges from sophisticated cyber-attacks that evolve and propagate to disrupt different parts. Additionally, communication overhead makes implementing authentication and access control complex. Existing approaches unlikely provide effective access control and multi-stage attack detection due to limited event capture and information correlation. This project offers a framework to improve security and privacy of decentralised systems through cross-domain access control, collaborative intrusion detection, and dynamic risk management considering resource consumption. It facilitates subsystem collaboration to prevent widespread disruption from attacks and share threat awareness. The project will develop methods and prototypes utilizing blockchain, federated learning, and multi-agent architecture to enhance access control, detection, risk management, and response capabilities.
The consortium is composed of four partners: Nantes University, LUT University (Finland), Universidad de Castilla - La Mancha (Spain), and Firat University (Turkey). LUT is the coordinator of the project
DI4SPDS has been accepted in July 2023, will be running from March 2024 for 36 months, with an allocated budget of 874k€ (230K€ for Stack).
Fog Computing is a paradigm that aims to decentralize the Cloud at the edge of the network to geographically distribute computing/storage resources and their associated services. It reduces bottlenecks and data movement. But managing a Fog is a major challenge: the system is larger, unreliable, highly dynamic and does not offer a global view for decision making. The objective of the SeMaFoR project is to model, design and develop a generic and decentralized solution for the self-management of Fog resources.
The consortium is composed of three partners: LS2N-IMT Atlantique (Stack, NaoMod, TASC), LIP6-Sorbonne Université (Delys), Alter way/Smile (SME). The Stack team supervises the project.SeMaFoR is running for 42 months (starting in March 2021 with an allocated budget of 506k€, 230K€ for Stack). See the Semafor web site for more information.
The main results of the year 2023 were the survey "A feature-based survey of Fog modeling languages" 6 in the FGCS journal and a position paper 1 in the SEAMS conference ranked 'A'.
Large dataset transfer from one datacenter to another is still an open issue. Currently, the most efficient solution is the exchange of a hard drive with an express carrier, as proposed by Amazon with its SnowBall offer. Recent evolutions regarding datacenter interconnects announce bandwidths from 100 to 400 Gb/s. The contention point is not the network anymore, but the applications which centralize data transfers and do not exploit parallelism capacities from datacenters which include many servers (and especially many network interfaces – NIC). The PicNic project addresses this issue by allowing applications to exploit network cards available in a datacenter, remotely, in order to optimize transfers (hence the acronym PicNic). The objective is to design a set of system services for massive data transfer between datacenters, exploiting distribution and parallelisation of networks flows.
The consortium is composed of several partners: Laboratoire d'Informatique du Parallélisme, Institut de Cancérologie de l’Ouest / Informatique, Institut de Recherche en Informatique de Toulouse, Laboratoire des Sciences du Numérique de Nantes, Laboratoire d'Informatique de Grenoble, and Nutanix France.
PiCNiC will be running for 42 months (starting in Sept 2021 with an allocated budget of 495k€, 170k€ for STACK).
The OTPaaS project targets the design and development of a complete software stack to administrate and use edge infrastructures for the industry sector. The consortium brings together national and user technology suppliers from major groups (Atos / Bull, Schneider Electric, Valeo) and SMEs / ETIs (Agileo Automation, Mydatamodels, Dupliprint, Solem, Tridimeo, Prosyst, Soben), with a strong support from major French research institutes (CEA, Inria, IMT, CAPTRONIC). The project started in October 2021 for a period of 36 months with an overall budget of 56M€ (1.2M€ for STACK).
The OTPaaS platform objectives are:
The SAMURAI (Sustainable And autonoMoUs gReen computing for AI) infrastructure aims to design an innovative hardware infrastructure for the scientific study of the cross-cutting issues of computing infrastructures supporting artificial intelligence and their energy autonomies.
This project paves the way toward a larger infrastructure at the natonial level in the context of the SLICES-FR initiative.
The project started in 2022 for a period of 5 years with an overall budget of 730K€ (500K€ for STACK).
SysMics is an integrated cluster of research that is part of the Nantes Excellence Initiative in Medecine and Engineering. Its main objective is the development of new methods for precision medecine, in particular, based on genomic analyses. In this context, we have worked on new large-scale distributed biomedical analyses and provided several results on how to distributed popular statistical analyses, such as FAMD-based and EM-based analyses.
A joint collaboration between Inria and OVHcloud company on the challenge of frugal cloud has been launched in October 2021 with a budget of 2 M€. It addresses several scientific challenges on the eco-design of cloud frameworks and services for large scale energy and environmental impact reduction, across three axes: i) Software eco-design of services and applications; ii) Efficiency leverages; iii) Reducing the impact and supporting users of the Cloud.
The joint challenge between Inria and Qarnot computing is called PULSE, for "PUshing Low-carbon Services towards the Edge". It aims to develop and promote best practices in geo-repaired hardware and software infrastructures for more environmentally friendly intensive computing.
The challenge is structured around two complementary research axes to address this technological and environmental issue:
Axis 1: "Holistic analysis of the environmental impact of intensive computing".
Axis 2: "Implementing more virtuous edge services"
The STACK group is mainly involved in the second axis, addressing the challenges related to data management.
As a team mainly composed of Associate Prof. and Prof., the amount of teaching activities is significant (around 150hours per person). We present here only the management activities.
Thomas Ledoux was co-leader of the Ecolog initiative. This national project is the result of an association between the Institut du Numérique Responsable (INR) and the IT sector in Nantes, which includes the University of Nantes, IMT Atlantique and Centrale Nantes. Its ambition was to create a body of training courses available in open source and dedicated to sustainable computing. A webinar presenting the results was held on March 23, 2023 (ecolog.isit-europe.org).
Thomas Ledoux took part in a round table related to digital sustainability/green computing organized by ADN Ouest (Nov. 14, 2023).