Built an AMD GPU Cluster for Large-Scale AI and LLM Research and Development

Introduction

I'm Hiroki Ohtsuji, working on parallel computing infrastructure technology at Fujitsu Research. Today, I'd like to introduce our GPU cluster "Ashitaka," designed for AI and LLM research and development.

The evolution of AI technology is rapid, and securing computational resources is crucial in the competitive landscape of research and development. GPUs (Graphics Processing Units) are indispensable for training and inference of AI models. While NVIDIA GPUs are well-known and widely used in AI research and development at Fujitsu, we have also built a cluster utilizing AMD GPUs. In this post, I will discuss why we chose AMD GPUs for AI research and development and the technical efforts behind this decision.

Importance of GPU Computational Resources

In AI research and development, GPUs provide high computational performance, enabling the training of models with large datasets. While computational performance often garners attention, overall performance also depends on the memory capacity and performance where models are deployed. Larger memory capacity allows for training and inference of larger models, and higher memory access performance particularly enhances inference performance.

Reasons for Choosing AMD GPUs

The decision to adopt AMD GPUs for our AI system was driven by their superior memory performance. At the time of planning, AMD GPUs offered 36% more memory capacity and 10% higher memory bandwidth compared to NVIDIA's H200 generation GPUs. This provides a significant advantage, allowing for the training and inference of large models with fewer GPUs.

Leveraging Supercomputer Construction Expertise for GPU Cluster Development

The adoption of AMD GPUs in systems is less common in Japan's supercomputer market compared to overseas. However, AMD has been building a strong track record internationally, as evidenced by the recent announcement of the Top500 supercomputer rankings in the United States last month. In these rankings, AMD secured five systems in the Top 10, including a new acquisition at the top position. Additionally, Fujitsu has a proven track record of constructing supercomputers using cutting-edge commodity architectures. For instance, leveraging the expertise gained from projects such as JCAHPC's Oakforest-PACS and its successor Miyabi, as well as AIST's ABCI 1.0/2.0, Fujitsu is capable of building practical systems using state-of-the-art equipment.

We have already achieved sufficient performance for large-scale Japanese LLM inference on the AMD GPU system, paving the way for a cluster system suitable for large-scale AI models.

Middleware Technology for Mixing Training and On-Demand Inference Processing

In AI research and development, training processes do not always utilize 100% of resources. By introducing a middleware environment leveraging Fujitsu's proprietary technology, we aim to maximize computational resource utilization by coexisting inference processes requiring immediacy with batch-like training processes.

Specifically, by combining ACB (Movie, PDF) and Interactive HPC (Movie, PDF) technologies, we can share a limited number of GPUs among multiple programs and dynamically allocate GPU resources. This allows for the parallel execution of applications exceeding GPU memory capacity, enabling efficient execution of large-scale AI applications.

Moving forward, we will optimize computational resources in AI research, development, and operations using these technologies, providing more advanced AI solutions.