In TADaaM, we propose a new approach where we allow the
application to explicitly express its resource needs about its execution. The
application needs to express its behavior, but in a different way from
the compute-centric approach, as the additional information is not necessarily
focused on computation and on instructions execution, but follows a high-level
semantics (needs of large memory for some processes, start of a
communication phase, need to refine the granularity, beginning of a
storage access phase, description of data affinity, etc.). These needs will be
expressed to a service layer though an API. The service layer will be
system-wide (able to gather a global knowledge) and stateful
(able to take decision based on the current request but also on previous
ones). The API shall enable the application to access this service layer through
a well-defined set of functions, based on carefully designed abstractions.
Hence, the goal of TADaaM is to design a stateful
system-wide service layer for HPC systems, in order to optimize
applications execution according to their needs.
This layer will abstract low-level details of the architecture and the software stack, and will allow applications to register their needs. Then, according to these requests and to the environment characteristics, this layer will feature an engine to optimize the execution of the applications at system-scale, taking into account the gathered global knowledge and previous requests.
This approach exhibits several key characteristics:
Firstly, in order for applications to make the best possible use of the available resources, it is impossible to expose all the low-level details of the hardware to the program, as it would make impossible to achieve portability. Hence, the standard approach is to add intermediate layers (programming models, libraries, compilers, runtime systems, etc.) to the software stack so as to bridge the gap between the application and the hardware. With this approach, optimizing the application requires to express its parallelism (within the imposed programming model), organize the code, schedule and load-balance the computations, etc. In other words, in this approach, the way the code is written and the way it is executed and interpreted by the lower layers drives the optimization. In any case, this approach is centered on how computations are performed. Such an approach is therefore no longer sufficient, as the way an application is executing does depend less and less on the organization of computation and more and more on the way its data is managed.
Secondly, modern large-scale parallel platforms comprise tens to hundreds of thousand nodes 1. However, very few applications use the whole machine. In general, an application runs only on a subset of the nodes 2. Therefore, most of the time, an application shares the network, the storage and other resources with other applications running concurrently during its execution. Depending on the allocated resources, it is not uncommon that the execution of one application interferes with the execution of a neighboring one.
Lastly, even if an application is running alone, each element of
the software stack often performs its own optimization
independently. For instance, when considering an hybrid MPI/OpenMP
application, one may realize that threads are concurrently used within the
OpenMP runtime system, within the MPI library for communication
progression, and possibly within the computation library (BLAS) and
even within the application itself (pthreads). However, none of these
different classes of threads are aware of the existence of the others.
Consequently, the way they are executed, scheduled, prioritized does
not depend on their relative roles, their locations in the software
stack nor on the state of the application.
The above remarks show that in order to go beyond the state-of-the-art, it is necessary to design a new set of mechanisms allowing cross-layer and system-wide optimizations so as to optimize the way data is allocated, accessed and transferred by the application.
In TADaaM, we will tackle the problem of efficiently
executing an application, at system-scale, on an HPC machine. We
assume that the application is already optimized (efficient data
layout, use of effective libraries, usage of state-of-the-art
compilation techniques, etc.). Nevertheless, even a statically
optimized application will not be able to be executed at scale without
considering the following dynamic constraints: machine
topology, allocated resources, data movement and contention, other
running applications, access to storage, etc. Thanks to the proposed
layer, we will provide a simple and efficient way for already existing
applications, as well as new ones, to express their needs in terms of
resource usage, locality and topology, using a high-level semantic.
It is important to note that we target the optimization of each application independently but also several applications at the same time and at system-scale, taking into account their resource requirement, their network usage or their storage access. Furthermore, dealing with code-coupling application is an intermediate use-case that will also be considered.
Several issues have to be considered. The first one consists in providing
relevant abstractions and models to describe the topology of the
available resources and the application behavior.
Therefore, the first question we want to answer is: “How to build
scalable models and efficient abstractions enabling to
understand the impact of data movement, topology and locality
on performance?”
These models must be sufficiently precise to grasp the reality, tractable enough
to enable efficient solutions and algorithms, and simple enough to remain
usable by non-hardware experts. We will work on
(1) better describing the memory hierarchy, considering new memory
technologies;
(2) providing an integrated view of the nodes, the network and the storage;
(3) exhibiting qualitative knowledge;
(4) providing ways to express the multi-scale properties of the machine.
Concerning abstractions, we will work on providing general concepts to
be integrated at the application or programming model layers.
The goal is to offer means, for the application, to
express its high-level requirements in terms of data access, locality and
communication, by providing abstractions on the notion of hierarchy, mesh,
affinity, traffic metrics, etc.
In addition to the abstractions and the aforementioned models we need
to define a clean and expressive API in a scalable way, in
order for applications to express their needs (memory usage, affinity,
network, storage access, model refinement, etc.).
Therefore, the second question we need to answer is: “how to
build a system-scale, stateful, shared layer that can gather
applications needs expressed with a high-level semantic?”. This work
will require not only to define a clean API where applications will
express their needs, but also to define how such a layer will be
shared across applications and will scale on future systems. The
API will provide a simple yet effective way to express different needs
such as: memory usage of a given portion of the code; start of a
compute intensive part; phase where the network is accessed
intensively; topology-aware affinity management; usage of storage
(in read and/or write mode); change of the data layout after mesh
refinement, etc. From an engineering point of view, the layer will
have a hierarchical design matching the hardware hierarchy, so as to
achieve scalability.
Once this has been done, the service layer, will have all the
information about the environment characteristics and application
requirements. We therefore need to design a set of mechanisms to
optimize applications execution: communication, mapping, thread
scheduling, data partitioning / mapping / movement, etc.
Hence, the last scientific question we will address is: “How
to design fast and efficient algorithms, mechanisms and tools to enable
execution of applications at system-scale, in full a HPC ecosystem,
taking into account topology and locality?”
A first set of research is related to thread and process placement according to
the topology and the affinity. Another large field of study is related to data
placement, allocation and partitioning: optimizing the way data is accessed and
processed especially for mesh-based applications. The issues of transferring
data across the network will also be tackled, thanks to the global knowledge we
have on the application behavior and the data layout. Concerning the interaction
with other applications, several directions will be tackled. Among these
directions we will deal with matching process placement with resource
allocation given by the batch scheduler or with the storage
management: switching from a best-effort application centric strategy
to global optimization scheme.
TADaaM targets scientific simulation applications on large-scale
systems, as these applications present huge challenges in terms of
performance, locality, scalability, parallelism and data management.
Many of these HPC applications use meshes as the basic model for their
computation. For instance, PDE-based simulations using finite
differences, finite volumes, or finite elements methods operate on meshes
that describe the geometry and the physical properties of the
simulated objects.
Mesh-based applications not only represent the majority of HPC applications running on existing supercomputing systems, yet also feature properties that should be taken into account to achieve scalability and performance on future large-scale systems. These properties are the following:
All these features make mesh-based applications a very interesting and challenging use-case for the research we want to carry out in this project. Moreover, we believe that our proposed approach and solutions will contribute to enhance these applications and allow them to achieve the best possible usage of the available resources of future high-end systems.
Team members make common use of small to large-scale high performance computing platforms, which are energy consuming.
For this reason, recent research in the team 6
leveraged an existing consolidated simulation tool — SimGrid — for
the bulk of experiments, using an experimental platform for validation
only. For comparison, the validation experiments required
The digital sector is an ever-growing consumer of energy. Hence, it is of the utmost importance to increase the efficiency of use of digital tools. Our work on performance optimization, whether for high-end, energy consuming supercomputers, or more modest systems, aims at reducing the footprint of computations.
Because the aim of these machines is to be used at their maximum capacity, given their high production cost to amortize, we consider that our research results will not lead to a decrease in the overall use of computer systems; however, we expect them to lead to better modeling the energy consumption of application and hence a usage of their energy, hence resulting in “more science per watt”. Of course it is always hard to evaluate the real impact as a possible rebound effect is for more users to run on these machines, or users deciding to run extra experiments “because it is possible”.
Members of the team participated to the writing of the Inria global Action
plan on F/M professional equality for 2021-2024.
Philippe Swartvagher received the accessit prize for the “prix
de thèse GDR RSD – Édition 2023”
The I/O Performance Evaluation Suite is a tool being developed in the TADaaM team to simplify the process of benchmark execution and results analysis in HPC systems. It uses bechmark tools to run experiments with different parameters. The goal of IOPS is to automatize the performance evaluation process described in 37, where we first explored number of nodes, processes and file size to find a configuration that reaches the system's peak performance, and then used these parameters to study the impact of the number of OSTs
Hardware Locality (hwloc) is a library and set of tools aiming at discovering and exposing the topology of machines, including processors, cores, threads, shared caches, NUMA memory nodes and I/O devices. It builds a widely-portable abstraction of these resources and exposes it to applications so as to help them adapt their behavior to the hardware characteristics. They may consult the hierarchy of resources, their attributes, and bind task or memory on them.
hwloc targets many types of high-performance computing applications, from thread scheduling to placement of MPI processes. Most existing MPI implementations, several resource managers and task schedulers, and multiple other parallel libraries already use hwloc.
NewMadeleine is the fourth incarnation of the Madeleine communication library. The new architecture aims at enabling the use of a much wider range of communication flow optimization techniques. Its design is entirely modular: drivers and optimization strategies are dynamically loadable software components, allowing experimentations with multiple approaches or on multiple issues with regard to processing communication flows.
The optimizing scheduler SchedOpt targets applications with irregular, multi-flow communication schemes such as found in the increasingly common application conglomerates made of multiple programming environments and coupled pieces of code, for instance. SchedOpt itself is easily extensible through the concepts of optimization strategies (what to optimize for, what the optimization goal is) expressed in terms of tactics (how to optimize to reach the optimization goal). Tactics themselves are made of basic communication flows operations such as packet merging or reordering.
The communication library is fully multi-threaded through its close integration with PIOMan. It manages concurrent communication operations from multiple libraries and from multiple threads. Its MPI implementation MadMPI fully supports the MPI_THREAD_MULTIPLE multi-threading level.
TopoMatch embeds a set of algorithms to map processors/cores in order to minimize the communication cost of the application.
Important features are : the number of processors can be greater than the number of applications processes , it assumes that the topology is a tree and does not require valuation of the topology (e.g. communication speeds) , it implements different placement algorithms that are switched according to the input size.
Some core algorithms are parallel to speed-up the execution. Optionally embeds scotch for fix-vertex mapping. enable exhaustive search if required. Several metric mapping are computed. Allow for oversubscribing of ressources. multithreaded.
TopoMatch is integrated into various software such as the Charm++ programming environment as well as in both major open-source MPI implementations: Open MPI and MPICH2.
SCOTCH has many interesting features:
- Its capabilities can be used through a set of stand-alone programs as well as through the libSCOTCH library, which offers both C and Fortran interfaces.
- It provides algorithms to partition graph structures, as well as mesh structures defined as node-element bipartite graphs and which can also represent hypergraphs.
- The SCOTCH library dynamically takes advantage of POSIX threads to speed-up its computations. The PT-SCOTCH library, used to manage very large graphs distributed across the nodes of a parallel computer, uses the MPI interface as well as POSIX threads.
- It can map any weighted source graph onto any weighted target graph. The source and target graphs may have any topology, and their vertices and edges may be weighted. Moreover, both source and target graphs may be disconnected. This feature allows for the mapping of programs onto disconnected subparts of a parallel architecture made up of heterogeneous processors and communication links.
- It computes amalgamated block orderings of sparse matrices, for efficient solving using BLAS routines.
- Its running time is linear in the number of edges of the source graph, and logarithmic in the number of vertices of the target graph for mapping computations.
- It can handle indifferently graph and mesh data structures created within C or Fortran programs, with array indices starting from 0 or 1.
- It offers extended support for adaptive graphs and meshes through the handling of disjoint edge arrays.
- It is dynamically parametrizable thanks to strategy strings that are interpreted at run-time.
- It uses system memory efficiently, to process large graphs and meshes without incurring out-of-memory faults,
- It is highly modular and documented. Since it has been released under the CeCILL-C free/libre software license, it can be used as a testbed for the easy and quick development and testing of new partitioning and ordering methods.
- It can be easily interfaced to other programs..
- It provides many tools to build, check, and display graphs, meshes and matrix patterns.
- It is written in C and uses the POSIX interface, which makes it highly portable.
Raisin has been designed to solve the problem of circuit placement onto multi-FPGA architectures. It models the circuit to map as a set of red-black, directed, acyclic hypergraphs (DAHs). Hypergraph vertices can be either red vertices (which represent registers and external I/O ports) or black vertices (which represent internal combinatorial circuits). Vertices bear multiple weights, which define the types of resources needed to map the circuit (e.g., registers, ALUs, etc.). Every hyper-arc comprises a unique source vertex, all other ends of the hyper-arcs being sinks (which models the transmission of signals through circuit wiring). A circuit is consequently represented as set of DAHs that share some of their red vertices.
Target architectures are described by their number of target parts, the maximum resource capacities within each target part, and the connectivity between target parts.
The main metric to minimize is the length of the longest path between two red vertices, that is, the critical path that signals have to traverse during a circuit compute cycle, which correlates to the maximum frequency at which the circuit can operate on the given target architecture.
Raisin computes a partition in which resource capacity constraints are respected and the critical path length is kept as small as possible, while reducing the number of cut hyper-arcs. It produces an assignment list, which describes, for each vertex of the hypergraphs, the part to which the vertex is assigned.
Raisin has many interesting features:
- It can map any weighted source circuit (represented as a set of red-black DAHs) onto any weighted target graph.
- It is based on a set of graph algorithms, including a multi-level scheme and local optimization methods of the “Fiduccia-Mattheyses” kind.
- It contains two greedy initial partitioning algorithms that have a computation time that is linear in the number of vertices. Each algorithm can be used for a particular type of topology, which can make them both complementary and efficient, depending on the problem instances.
- It takes advantage of the properties of DAHs to model path lengths with a weighting scheme based on the computation of local critical paths. This weighting scheme allows to constrain the clustering algorithms to achieve better results in smaller time.
- It can combine several of its algorithms to create dedicated mapping strategies, suited to specific types of circuits.
- It provides many tools to build, check and convert red-black DAHs to other hypergraph and graph formats.
- It is written in C.
PlaFRIM is an experimental platform for research in modeling, simulations and high performance computing. This platform has been set up from 2009 under the leadership of Inria Bordeaux Sud-Ouest in collaboration with computer science and mathematics laboratories, respectively LaBRI and IMB with a strong support in the region Aquitaine.
It aggregates different kinds of computational resources for research and development purposes. The latest technologies in terms of processors, memories and architecture are added when they are available on the market. As of 2023, it contains more than 6,000 cores, 50 GPUs and several large memory nodes that are available for all research teams of Inria Bordeaux, Labri and IMB.
Brice Goglin is in charge of PlaFRIM since June 2021.
Not applicable for the team
Over the past decades, the performance gap between the memory subsystem and compute capabilities continued to spread. However, scientific applications and simulations show increasing demand for both memory speed and capacity. To tackle these demands, new technologies such as high-bandwidth memory (HBM) or non-volatile memory (NVM) emerged, which are usually combined with classical DRAM. The resulting architecture is a heterogeneous memory system in which no single memory is ‘‘best’’. HBM is smaller but offers higher bandwidth than DRAM, whereas NVM provides larger capacity than DRAM at a reasonable cost and less energy consumption. Despite that, in several cases, DRAM still offers the best latency out of all three technologies.
In order to use different kinds of memory, applications typically have to be modified to a great extent. Consequently, vendor-agnostic solutions are desirable. First, they should offer the functionality to identify kinds of memory, and second, to allocate data on it. In addition, because memory capacities may be limited, decisions about data placement regarding the different memory kinds have to be made. Finally, in making these decisions, changes over time in data that is accessed, and the actual access pattern, should be considered for initial data placement and be respected in data migration at run-time.
In this paper, we introduce a new methodology that aims to provide portable tools and methods for managing data placement in systems with heterogeneous memory. Our approach allows programmers to provide traits (hints) for allocations that describe how data is used and accessed. Combined with characteristics of the platforms’ memory subsystem, these traits are exploited by heuristics to decide where to place data items. We also discuss methodologies for analyzing and identifying memory access characteristics of existing applications, and for recommending allocation traits.
In our evaluation, we conduct experiments with several kernels and two proxy applications on Intel Knights Landing (HBM + DRAM) and Intel Ice Lake with Intel Optane DC Persistent Memory (DRAM + NVM) systems. We demonstrate that our methodology can bridge the performance gap between slow and fast memory by applying heuristics for initial data placement.
This work 14 is performed in collaboration with RWTH Aachen and Université of Reims Champagne Ardenne in the context of the H2M ANR-DFG project.
Heterogeneous memory will be involved in several upcoming platforms on the way to exascale. Combining technologies such as HBM, DRAM and/or NVDIMM allows to tackle the needs of different applications in terms of bandwidth, latency or capacity. And new memory interconnects such as CXL bring easy ways to attach these technologies to the processors. High-performance computing developers must prepare their runtimes and applications for these architec- tures, even before they are actually available. Hence, we survey software solutions for emulating them. First, we list many ways to modify the performance of platforms so that developers may test their code under different memory performance profiles. This is required to identify kernels and data buffers that are sensitive to memory performance. Then, we present several techniques for exposing fake heterogeneous memory information to the software stack. This is useful for adapting runtimes and applications to heterogeneous memory so that different kinds of memory are detected at runtime and so that buffers are allocated in the appropriate one.
This work 10 is performed in collaboration with RWTH Aachen in the context of the H2M ANR-DFG project.
In HPC, network are programmed directly from user space, since system call have a significant cost with low latency networks. Usually, the user performs polling: the network is polled at regular intervall to check whether a new message has arrived. However, it wastes some resources. Another solution is to rely on interrupts instead of polling, but since interrupts are managed by the kernel, they involve system calls we are precisely willing to avoid.
Intel introduced user-level interrupts on its lates Sapphire Rapids CPUs, allowing to use interrupts from user space. These user space interrupts may be a viable alternative to polling, by using interrupts without the cost of systems calls.
We have performed 23 prelimnary work by using these user-space interrupts for inter-process intra-node communication in NewMadeleine. We have added a driver that relies on such user-space interrupts, and have extended NewMadeleine core to allow a driver to perform upcalls. The preliminary results are encouraging.
For future works, we will extend Atos BXI network to make it trigger user-space interrupts so as to benefit from uintr in inter-node communications.
With the addition of interrupt-based communication in NewMadeleine, synchronization issues have emerged in some data structures. NewMadeleine relies on lock-free queues for a lot of its activities: progression through Pioman, submission queue, completion queue, deferred tasks. However, our implementation of lock-free queues was not non-blocking and was not suitable for use in an interrupt handler.
Other implementations found in the litterature target scalability but exhibit high latency in the uncontended case. We have shown that, since latency of network and queues are different by several orders of magnitude, even highly contented network operation do not impose a high pressure on queues.
We have proposed a new non-blocking queue algorithm that is optimized for low contention, while degrading nicely in case of higher contention. We have shown that it exhibits the best performance in NewMadeleine when compared to 15 other queue designs on four different architectures.
This work has been submitted for publication in the ACM
Symposium on Parallelism in Algorithms and Architectures.
Parallel runtime systems such as MPI or task-based libraries provide models to manage both computation and communication by allocating cores, scheduling threads, executing communication algorithms. Efficiently implementing such models is challenging due to their interplay within the runtime system. In 38, 44, 43, 45, we assess interferences between communications and computations when they run side by side. We study the impact of communications on computations, and conversely the impact of computations on communication performance. We consider two aspects: CPU frequency, and memory contention. We have designed benchmarks to measure these phenomena. We show that CPU frequency variations caused by computation have a small impact on communication latency and bandwidth. However, we have observed on Intel, AMD and ARM processors, that memory contention may cause a severe slowdown of computation and communication when they occur at the same time. We have designed a benchmark with a tunable arithmetic intensity that shows how interferences between communication and computation actually depend on memory pressure of the application. Finally we have observed up to 90 % performance loss on communications with common HPC kernels such as the conjugate gradient and general matrix multiplication.
Then we proposed 7 a model to predict memory bandwidth for computations and for communications when they are executed side by side, according to data locality and taking contention into account. Elaboration of the model allowed to better understand locations of bottleneck in the memory system and what are the strategies of the memory system in case of contention. The model was evaluated on many platforms with different characteristics, and showed a prediction error in average lower than 4 %.
Fine tuning MPI meta parameters is a critical task for HPC systems, but measuring the impact of each parameters takes a lot of time. Leveraging the LLVM infrastructure, this tool adresses the issue by automatically extracting a standalone mini-app (called skeleton) from any MPI application. Said skeleton preserves the communication pattern while removing other compute instructions, allowing it to faithfully represent the original program's communication behavior while being significantly faster. It can then be used as a proxy during the optimization phase, reducing its duration by 95%. When paired with a generic optimization tool called ShaMAN 42, it allows to generate a MPI tuning configuration that exhibit the same performances of the configuration obtained through exhaustive benchmarking.
Given the complexity of current supercomputers and applications, being able to trace application executions to understand their behaviour is not a luxury. As constraints, tracing systems have to be as little intrusive as possible in the application code and performances, and be precise enough in the collected data.
We present 8 how we set up a tracing system
to be used with the task-based runtime system StarPU. We study the
different sources of performance overhead coming from the tracing system and
how to reduce these overheads. Then, we evaluate the accuracy of distributed
traces with different clock synchronization techniques. Finally, we summarize
our experiments and conclusions with the lessons we learned to efficiently
trace applications, and the list of characteristics each tracing system should
feature to be competitive.
The reported experiments and implementation details comprise a feedback of integrating into a task-based runtime system state-of-the-art techniques to efficiently and precisely trace application executions. We highlight the points every application developer or end-user should be aware of to seamlessly integrate a tracing system or just trace application executions.
In April 2023, F. Zanon-Boito participated of a Dagstuhl seminar about improving HPC infrastructures by using monitored data. From this seminar, a group (informally called WAFVR) has been formed, with a mailing list, a channel on a chat system, and regular Zoom meetings. We have also published a position paper 22. Our goal is to advertise to the community our vision of a smart HPC system that can adapt and help applications achieve the best performance, while detecting and handling problems. We are in a position to do so because the group consists of many researchers from all over the world, including people from industry (such as Paratools and HPE) and from many large HPC infrastructures.
In high performance, computing concurrent applications are sharing the same file system. However, the bandwidth which provides access to the storage is limited. Therefore, too many I/O operations performed at the same time lead to conflicts and performance loss due to contention. This scenario will become more common as applications become more data intensive. To avoid congestion, job schedulers have to play an important role in selecting which application run concurrently. However I/O-aware mapping strategies need to be simple, robust and fast. Hence, in this work 12, we discussed two plain and practical strategies to mitigate I/O congestion. They are based on the idea of scheduling I/O access so as not to exceed some prescribed I/O bandwidth. More precisely, we compared two approaches: one grouping applications into packs that will be run independently (i.e pack scheduling), the other one scheduling greedily applications using a predefined order (i.e. list scheduling). Results showed that performances depend heavily on the I/O load and the homogeneity of the underlying workload. Finally, we introduced the notion of characteristic time, that represent information on the average time between consecutive I/O transfers. We showed that it could be important to the design of schedulers and that we expect it to be easily obtained by analysis tools.
I/O scheduling strategies try to decide algorithmically which application(s) are prioritized (e.g. first-come-first-served or semi-round-robin) when accessing the shared PFS.
Previous work 41 thoroughly demonstrated that existing approaches based on either exclusivity or fair-sharing heuristics showed inconsistent results, with exclusivity sometimes outperforming fair-sharing for particular cases, and vice versa. Based on these observations, in 6 we researched an approach capable of combining both by grouping applications according to their I/O frequency. As a result, we proposed IO-Sets, a novel method for I/O management in HPC systems.
In IO-Sets, applications are categorized into sets based on their characteristic time, representing the mean time between I/O phases. Applications within the same set perform I/O exclusively, one at a time. However, applications from different sets can simultaneously access the PFS and share the available bandwidth. Each set is assigned a priority determining the portion of the I/O bandwidth applications receive when performing I/O concurrently. In 6, we present the potential of IO-Sets through a scheduling heuristic called SET-10, which is simple and requires only minimal information. Our extensive experimental campaign shows the importance of IO-Sets and the robustness of SET-10 under various workloads. We also provide insights on using our proposal in practice.
IO-Sets was proposed in 2022 and published in 2023 in TPDS. From the original proposition, we have added two new contributions: firstly, an extensive test campaign based on simulation and on a prototype; and secondly, a study on the viability of IO-Sets based on one year of I/O traces of a real platform representing 4,088 applications (or jobs). The viability study is discussed in [6, Section 8] and is also available as supplementary material here. To summarize, this study demonstrated that:
Therefore, this study shows that the base assumption of IO-Sets, that concurrently running applications usually belong to different sets, is supported by the analyzed data. Moreover, we use the applications' data to generate other simulations, and we demonstrated that SET-10 achieves better results even when considering execution cases with more jobs and more sets.
Parallel file systems cut files into fixed-size stripes and distribute them across a number of storage targets (OSTs) for parallel access. Moreover, a layer of I/O nodes is often placed between compute nodes and the PFS. In this context, it is important to notice both OST and I/O nodes are potentially shared by running applications, which may lead to contention and low I/O performance.
Contention-mitigation approaches usually see the shared I/O infrastructure as a single resource capable of a certain bandwidth, whereas in practice it is a distributed set of resources from which each application can use a subset. In addition, using X% of the OSTs, for example, does not grant a job X% of the PFS’ peak performance. Indeed, depending on their characteristics, each application will be impacted differently by the number of used I/O resources.
We conducted a comprehensive study of the problem of scheduling shared I/O resources — I/O nodes, OSTs, etc — to HPC applications. We tackled this problem by proposing heuristics to answer two questions: 1) how many resources should we give each application (allocation heuristics), and 2) which resources should be given to each application (placement heuristics). These questions are not independent, as using more resources often means sharing them. Nonetheless, our two-step approach allows for simpler heuristics that would be usable in practice.
In addition to overhead, an important aspect that impacts how “implementable” algorithms are is their input regarding applications’ characteristics, since this information is often not available or at least imprecise. Therefore, we proposed heuristics that use different input and studied their robustness to inaccurate information.
This work was submitted to CCGrid 2024 and is currently under review 30.
As evidenced by the work on IO-Sets, discussed in Section 8.10, knowing the periodicity of applications' I/O phases is useful to improve I/O performance and mitigate contention. However, describing the temporal I/O behavior in terms of I/O phases is a challenging task. Indeed, the HPC I/O stack only sees a stream of issued requests and does not provide I/O behavior characterization. Contrary, the notion of an I/O phase is often purely logical, as it may consist of a set of independent I/O requests, issued by one or more processes and threads during a particular time window, and popular APIs do not require that applications explicitly group them.
Thus, a major challenge is to draw the borders of an I/O phase. Consider, for example, an application with 10 processes that writes 10 GB by generating a sequence of two 512 MB write requests per process, then performs computation and communication for a certain amount of time, after which it writes again 10 GB. How do we assert that the first 20 requests correspond to the first I/O phase and the last 20 to a second one? An intuitive approach is to compare the time between consecutive requests with a given threshold to determine whether they belong to the same phase. Naturally, the suitable threshold should depend on the system. The reading or writing method can make this an even more complex challenge, as accesses can occur, e.g., during computational phases in the absence of barriers. Hence, the threshold would not only be system dependent but also application dependent, making this intuitive approach more complicated than initially expected.
Even assuming that one is able to find the boundaries of various I/O phases, this might still not be enough. Consider for example an application that periodically writes large check- points with all processes. In addition, a single process writes, at a different frequency, only a few bytes to a small log file. Although both activities clearly constitute I/O, only the period of the checkpoints is relevant to contention-avoidance techniques. If we simply see I/O activity as belonging to I/O phases, we may observe a profile that does not reflect the behavior of interest very well.
In this research 34, we proposed FTIO, a tool for characterizing the temporal I/O behavior of an application using frequency techniques such as DFT and autocorrelation. FTIO imposes generate only a modest amount of information and hence imposes minimal overhead. We also proposed metrics that quantify the confidence in the obtained results and further characterize the I/O behavior based on the identified period.
This work, which is currently under review for IPDPS 2024, is a collaboration with Ahmad Tarraf and Felix Wolf from the Technical University of Darmstadt, Germany, in the context of the ADMIRE project.
This work 3911 introduces and assesses novel strategies to schedule firm real-time jobs on an overloaded server. The jobs are released periodically and have the same relative deadline. Job execution times obey an arbitrary probability distribution and can take unbounded values (no WCET). We introduce three control parameters to decide when to start or interrupt a job. We couple this dynamic scheduling with several admission policies and investigate several optimization criteria, the most prominent being the Deadline Miss Ratio (DMR). Then we derive a Markov model and use its stationary distribution to determine the best value of each control parameter. Finally we conduct an ex- tensive simulation campaign with 14 different probability distributions; the results nicely demonstrate how the new control parameters help improve system performance compared with traditional approaches. In particular, we show that (i) the best admission policy is to admit all jobs; (ii) the key control parameter is to upper bound the start time of each job; (iii) the best scheduling strategy decreases the DMR by up to 0.35 over traditional competitors.
The parallelization of the graph partitioning algorithms implemented
in branch v7.0 of the Scotch software has been
pursued. This cumulative work, implemented in version v7.0.3,
has been presented in 17.
The work of Julien Rodriguez concerns the placement of
digital circuits onto a multi-FPGA platform, in the context of a PhD
directed by François Pellegrini, in collaboration with
François Galea and Lilia Zaourar at CEA Saclay. Its
aim is to design and implement mapping algorithms that do not minimize
the cut, as it is the case in most partitioning toolboxes, but the
length of the longest path between sets of vertices. This metric
strongly correlates to the critical path that signals have to traverse
during a circuit compute cycle, hence to the maximum frequency at
which a circuit can operate.
To address this problem, we defined a dedicated hypergraph model, in the form of red-black Directed Acyclic Hypergraphs (DAHs). Subsequently, a hypergraph partitioning framework has been designed and implemented, consisting of initial partitioning and refinement algorithms 21.
A common procedure for partitioning very large circuits is to apply the most expensive algorithms to smaller instances that are assumed to be representative of the lager initial problem.
One of the most widely used methods for partitioning graphs and hypergraphs is the multilevel scheme, in which a hypergraph is successively coarsened into hypergraphs of smaller sizes, after which an initial partition is computed on the smallest hypergraph, and the initial solution is successively prolonged to each finer graph and locally refined, up to the initial hypergraph. In this context, we have studied the computation of exact solutions for the initial partitioning of the coarsest hypergraph, by way of linear programming 15. These results are promising, but evidence the risk of information loss during the coarsening stage. Indeed, coarsening can result in the creation of paths that did not exist in the initial hypergraph, which can mislead the linear programming algorithm. Hence, clustering algorithms must be specifically designed to avoid distorting the linear program.
Circuit clustering is a more direct method, in which bigger clusters
(merging more than two vertices) can be created by a single round of
the algorithm. We have studied clustering algorithms such as heavy
edge matching, for which we have developed a new weighting function
that favors the grouping of vertices along the critical path,
i.e., the longest path in the red-black hypergraph. We also
developed our own clustering algorithm 25,
which gives better results than heavy edge matching. In fact, since
heavy edge matching groups vertices by pairs, it is less efficient
than the direct grouping approach we propose.
All the aforementioned algorithms have been integrated into the
Raisin software 7.2.7.
With the recent availability of Noisy Intermediate-Scale Quantum
(NISQ) devices, quantum variational and annealing-based methods have
received increased attention. To evaluate the efficiency of these
methods, we compared Quantum Annealing (QA) and the Quantum
Approximate Optimization Algorithm (QAOA) for solving Higher Order
Binary Optimization (HOBO) problems 20. This
case study considered the hypergraph partitioning problem, which is
used to generate custom HOBO problems. Our experiments show that
D-Wave systems quickly reach limits when solving dense HOBO
problems. Although the QAOA algorithm exhibits better performance on
exact simulations, noisy simulations evidence that gate error rates
should remain below
However, the qubit interconnections of a quantum chip are typically limited, and finding a good mapping of the Ising problem onto the quantum chip can be challenging. In fact, even defining what constitutes a high-quality embedding is not trivial. In 40, we presented a brief review of existing embedding methods, and we proposed several experiments in order to identify important criteria to consider when mapping problems onto quantum annealers.
The balance between performance and energy consumption is a critical challenge in HPC systems. This study focuses on this challenge by exploring and modeling different MPI parameters (e.g., number of processes, process placement across NUMA nodes) across different code patterns (e.g., stencil pattern, memory footprint, communication protocol, strong/weak scalabilty). A key take away is that optimizing MPI codes for time performance can lead to poor energy consumption: energy consumption of the MiniGhost proto-application could be optimized by more than five times by considering different execution options.
A correct evaluation of scheduling algorithms and a good understanding of their optimization criterias are key components of resource management in HPC. In 19, 31, we discuss bias and limitations of the most frequent optimization metrics from the literature. We provide elements on how to evaluate performance when studying HPC batch scheduling. We experimentally demonstrate these limitations by focusing on two use-cases: a study on the impact of runtime estimates on scheduling performance, and the reproduction of a recent high impact work that designed an HPC batch scheduler based on a network trained with reinforcement learning. We demonstrate that focusing on quantitative optimization criterion ("our work improve the literature by X%") may hide extremely important caveat, to the point that the results obtained are opposed to the actual goals of the authors. Key findings show that mean bounded slowdown and mean response time are irrelevant objectives in the context of HPC. Despite some limitations, mean utilization appears to be a good objective. We propose to complement it with its standard deviation in some pathologic cases. Finally, we argue for a larger use of area-weighted response time, that we find to be a very relevant objective.
The main objective of the ADMIRE project is the creation of an active I/O stack that dynamically adjusts computation and storage requirements through intelligent global coordination, the elasticity of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy, while offering quality- of-service (QoS), energy efficiency, and resilience for accessing extremely large data sets in very heterogeneous computing and storage environments. We have developed a framework prototype that is able to dynamically adjust computation and storage requirements through intelligent global coordination, separated control, and data paths, the malleability of computation and I/O, the scheduling of storage resources along all levels of the storage hierarchy, and scalable monitoring techniques. The leading idea in ADMIRE is to co-design applications with ad-hoc storage systems that can be deployed with the application and adapt their computing and I/O 16
High-performance computing is not only a race towards the fastest supercomputers but also the science of using such massive machines productively to acquire valuable results-outlining the importance of performance modelling and optimization. However, it appears that more than punctual optimization is required for current architectures, with users having to choose between multiple intertwined parallelism possibilities, dedicated accelerators, and I/O solutions. Witnessing this challenging context, our paper establishes an automatic feedback loop between how applications run and how they are launched, with a specific focus on I/O. One goal is to optimize how applications are launched through moldability (launch-time malleability). As a first step in this direction, we proposed in 18 a new, always-on measurement infrastructure based on state-of-the-art cloud technologies adapted for HPC. We presented the measurement infrastructure and associated design choices. Moreover, we leverage an existing performance modelling tool to generate I/O performance models. We outline sample modelling capabilities, as derived from our measurement chain showing the critical importance of the measurement in future HPC systems, especially concerning resource configurations. Thanks to this precise performance model infrastructure, we can improve moldability and malleability on HPC systems.
Intel granted $30k and provided information about future many-core
platforms and memory architectures to ease the design and development
of the hwloc software with early support for next generation hardware.
ADMIRE project on cordis.europa.eu
The growing need to process extremely large data sets is one of the main drivers for building exascale HPC systems today. However, the flat storage hierarchies found in classic HPC architectures no longer satisfy the performance requirements of data-processing applications. Uncoordinated file access in combination with limited bandwidth make the centralised back-end parallel file system a serious bottleneck. At the same time, emerging multi-tier storage hierarchies come with the potential to remove this barrier. But maximising performance still requires careful control to avoid congestion and balance computational with storage performance. Unfortunately, appropriate interfaces and policies for managing such an enhanced I/O stack are still lacking.
The main objective of the ADMIRE project is to establish this control by creating an active I/O stack that dynamically adjusts computation and storage requirements through intelligent global coordination, malleability of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy. To achieve this, we will develop a software-defined framework based on the principles of scalable monitoring and control, separated control and data paths, and the orchestration of key system components and applications through embedded control points.
Our software-only solution will allow the throughput of HPC systems and the performance of individual applications to be substantially increased – and consequently energy consumption to be decreased – by taking advantage of fast and power-efficient node-local storage tiers using novel, European ad-hoc storage systems and in-transit/in-situ processing facilities. Furthermore, our enhanced I/O stack will offer quality-of-service (QoS) and resilience. An integrated and operational prototype will be validated with several use cases from various domains, including climate/weather, life sciences, physics, remote sensing, and deep learning.
Emmanuel Jeannot is the leader of WP6, concerned with the design and the implementation of the “intelligent controller”, an instantiation of the service-layer envisioned at the beginning of the project. Clément Barthélémy has been hired in August 2021 as a research engineer to work specifically on this task. He has taken part in different ADMIRE activities, meetings and workshops, remotely and in-person, including general assemblies in Torino (Italy) in June 2023 and Barcelona (Spain) in December 2023. The intelligent controller has been extended to use the Redis database more thoroughly, including its message queue capability. Communication with the monitoring modules developed in WP5 has been refined and extended with an alert interface. The Slurm command-line interface developed in collaboration with WP4 have been improved and moved under the supervision of partner BSC.
The EUPEX pilot brings together academic and commercial stakeholders to co-design a European modular Exascale-ready pilot system. Together, they will deploy a pilot hardware and software platform integrating the full spectrum of European technologies, and will demonstrate the readiness and scalability of these technologies, and particularly of the Modular Supercomputing Architecture (MSA), towards Exascale.
EUPEX’s ambition is to support actively the European industrial ecosystem around HPC, as well as to prepare applications and users to efficiently exploit future European exascale supercomputers.
Abstract:
Though significant efforts have been devoted to the implementation and optimization of several crucial parts of a typical HPC software stack, most HPC experts agree that exascale supercomputers will raise new challenges, mostly because the trend in exascale compute-node hardware is toward heterogeneity and scalability: Compute nodes of future systems will have a combination of regular CPUs and accelerators (typically GPUs), along with a diversity of GPU architectures. Meeting the needs of complex parallel applications and the requirements of exascale architectures raises numerous challenges which are still left unaddressed. As a result, several parts of the software stack must evolve to better support these architectures. More importantly, the links between these parts must be strengthened to form a coherent, tightly integrated software suite. Our project aims at consolidating the exascale software ecosystem by providing a coherent, exascale- ready software stack featuring breakthrough research advances enabled by multidisciplinary collaborations between researchers. The main scientific challenges we intend to address are: productivity, performance portability, heterogeneity, scalability and resilience, performance and energy efficiency.
Abstract:
The advent of future Exascale supercomputers raises multiple data-related challenges. To enable applications to fully leverage the upcoming infrastructures, a major challenge concerns the scalability of techniques used for data storage, transfer, processing and analytics. Additional key challenges emerge from the need to adequately exploit emerging technologies for storage and processing, leading to new, more complex storage hierarchies. Finally, it now becomes necessary to support more and more complex hybrid workflows involving at the same time simulation, analytics and learning, running at extreme scales across supercomputers interconnected to clouds and edgebased systems. The Exa-DoST project will address most of these challenges, organized in 3 areas: 1. Scalable storage and I/O; 2. Scalable in situ processing; 3. Scalable smart analytics. As part of the NumPEx program, Exa-DoST will address the major data challenges by proposing operational solutions co-designed and validated in French and European applications. This will allow filling the gap left by previous international projects to ensure that French and European needs are taken into account in the roadmaps for building the data-oriented Exascale software stack.
Emmanuel Jeannot jointly with Olivier Beaumont from
Topal, organized
the 15th JLESC
workshop
in Talence from March 21st to March 23rd. It gathered 128 participants from
the different JLESC institutions (Inria, BSC, Jülisch, Riken, ANL,
U.Tennessee, NCSA). It featured discussions and exchanges on:
Artificial intelligence, Big Data, I/O and in-situ visualization,
Numerical methods and algorithms, Resilience, Performance tools,
Programming Languages, Advanced architectures, among others.
TADaaM attended the MPI Forum meetings on behalf of Inria (where the MPI standard for communication
in parallel applications is developed and maintained). Topologies
working group that now encompasses both physical and virtual topologies and participates also in serveral
other Working Groups. He's also an editor of the MPI Standard.
This year, the proposals made last years wre discussed, modified and finally voted in the 4.1 revision
of the MPI standard. The additions are the following :
TADaaM is a member of the Administrative Steering Committee of PMIx standard
focused on orchestration of application launch and execution.
Members of the TADaaM project gave hundreds of hours of teaching at
Université de Bordeaux and the Bordeaux INP engineering school, covering a
wide range of topics from basic use of computers, introduction to algorithmic
and C programming to advanced topics such as probabilities and statistics,
scheduling, computer networks, computer architecture, operating systems, big data, parallel programming and
high-performance runtime systems, as well as software law and personal
data law.