Introduction
Hello, I'm Hiroki Ohtsuji from the Computing Laboratory at Fujitsu Research. Today, I would like to share my report on attending the ACM/IEEE SC24 (officially known as The International Conference for High Performance Computing, Networking, Storage, and Analysis) held in Atlanta, Georgia. This conference is the world's largest international conference on supercomputing, where we showcased and presented technologies such as the AI Computing Broker (blog post, video, PDF) and Interactive HPC (video, PDF).
What is ACM/IEEE SC24?
ACM/IEEE SC24 is the largest international conference on supercomputing, held annually in North America. It is a venue where various supercomputing rankings are announced, and it features not only paper presentations but also a vast exhibition space. This year, the event was more vibrant than ever, with over 18,000 participants, the highest number to date.
Trends at the Conference
Ranking Trends
In this year's Top500 ranking, the newly introduced El Capitan from the Lawrence Livermore National Laboratory in the United States took the top spot. Equipped with AMD's new accelerator, the MI300A, El Capitan recorded a benchmark performance (HPL) of 1.742 EFlop/s with a power consumption of 29.6MW. Additionally, from Japan, SoftBank's CHIE-2 and CHIE-3 debuted at 16th and 17th, and JCAHPC's (The University of Tokyo and University of Tsukuba) Miyabi-G appeared at 28th. Fugaku, which debuted at number one in June 2020, ranked sixth this time.
Exhibition Trends
The boom in generative AI has led to an increase in exhibits related to power and cooling systems. Accelerators used in generative AI can consume nearly 1000W each, and servers equipped with multiple such accelerators are becoming common. As a result, the power and heat density per rack is increasing, necessitating advanced power and cooling technologies. Consequently, manufacturers of power supplies and water cooling systems and components have become more prominent in recent years.
Additionally, the application of CXL (Compute Express Link) as a memory expansion technology was observed. CXL is one of the standards for connecting various devices, and its application in memory expansion is gaining attention. As generative AI models grow larger, memory sharing is expected to become increasingly important to address memory capacity challenges.
Fujitsu's Technology Exhibits and Presentations
Introduction to the Fujitsu Booth
At the Fujitsu booth, we showcased middleware technology ACB (AI Computing Broker) for efficient GPU utilization, Interactive HPC technology for immediate execution of large-scale jobs on supercomputers, Fujitsu's ARM processor MONAKA, and quantum computing. We displayed a half-size mockup of the quantum computer, which attracted many visitors who took photos.
Introduction to Computing Infrastructure Technology
Let me introduce the computing infrastructure technology that I explained during the exhibition.
- Interactive HPC
Generative AI, including LLMs, is rapidly scaling up to enhance performance. As a result, even AI researchers and developers unfamiliar with large-scale computing environments are increasingly needing parallel computing systems, including supercomputers. However, supercomputers, which typically execute tasks in order (batch processing), inevitably involve waiting times, making them less user-friendly for these new users accustomed to dynamically securing resources in the cloud. To address this, I have been working on the research and development of Interactive HPC technology, which enables immediate execution of large-scale tasks, aiming to make supercomputers more efficiently accessible to a wider audience. The exhibition demonstrated the time-sharing operation of multiple parallel programs on a 1000-node cluster and showed how digital twin applications can be accelerated. The technology was well-received, with high expectations from those operating supercomputers for the benefits of implementing Interactive HPC technology.
Exhibition content: video, PDF
- ACB
GPUs are generally not released once allocated to an application until the application ends. However, GPUs are not always used at 100% during execution, leading to idle time. ACB is middleware technology that enables GPU sharing at the AI framework level, such as Pytorch, to effectively utilize this idle time. The demo showed how multiple processes, such as AlphaFold2 and LLM inference, can be efficiently executed with a limited number of GPUs. Given that GPU shortages are a challenge for many companies and computing centers, we received numerous questions and inquiries about this technology.
Exhibition content: video, PDF
Industry-Academia Collaboration Initiatives
- Presentation at Science Tokyo Booth
At the Science Tokyo booth, I gave a presentation on Interactive HPC technology and its application case involving the operation of digital twin applications on TSUBAME. During this presentation, I explained the details of each technology and showcased a case study where Interactive HPC technology was deployed on Science Tokyo's TSUBAME 4.0 to run digital twin applications. This demonstrated that immediate execution of processes on supercomputers is possible, allowing for more efficient strategy exploration than before.
- Presentation at PDSW24 and Exhibition at the University of Tsukuba Booth
As AI continues to scale up, the demands on data input and output are also increasing. To meet these demands, we are conducting joint research with the University of Tsukuba on high-performance data stores (joint research details). As part of this initiative, we presented our research on high-speed RPC (Remote Procedure Call) technology at the PDSW (Parallel Data Systems Workshop) 2024. High-performance data stores, composed of multiple computers, require network-based processing for external interactions and internal data operations. We believe that a new mechanism that tightly integrates communication and data operations is necessary to achieve a data store system with overwhelming performance compared to conventional systems. We are researching and developing high-speed RPC technology equipped with RDMA (Remote Direct Memory Access) and high-parallel processing capabilities. Additionally, at the University of Tsukuba's booth, we displayed a poster on our joint research, highlighting our industry-academia collaboration efforts and achievements in high-performance data store systems.
Conclusion
The spread of generative AI technology has led to an increase in exhibits not only related to computing itself but also in areas such as cooling and power supply, giving us more chances to see the wide range of technologies involved. The rankings continue to see changes at the top, reflecting ongoing active research and development. Japan continues to maintain its presence in the field of supercomputing, and I will continue to contribute to this field through my ongoing research and development efforts in computing infrastructure.