Whatâs new with HPC and AI infrastructure at Google Cloud
Annie Ma-Weaver
Group Product Manager, HPC, Google Cloud
Wyatt Gorman
Solutions Manager, HPC & AI Infrastructure, Google Cloud
At Google Cloud, weâre rapidly advancing our high-performance computing (HPC) capabilities, providing researchers and engineers with powerful tools and infrastructure to tackle the most demanding computational challenges. Here's a look at some of the key developments driving HPC innovation on Google Cloud, as well as our presence at Supercomputing 2024.
You can also stay apprised of our HPC and AI advances by joining the new Google Cloud Advanced Computing Community (details below).Â
Next-generation HPC VMs
We began our H-series with H3 VMs, specifically designed to meet the needs of demanding HPC workloads. Now, weâre excited to share some key features of the next generation of the H family, bringing even more innovation and performance to the table. The upcoming VMs will feature:
-
Improved workload scalability via RDMA-enabled 200 Gbps networking
-
Native support to directly provision full, tightly-coupled HPC clusters on demandÂ
-
Dynamic Workload Scheduler to provision fixed-lifetime clusters now or in the future
-
Titanium technology that delivers superior performance, reliability, and securityÂ
We provide system blueprints for setting up turnkey, pre-configured HPC clusters on our H series VMs.
The next generation of H series is coming in early 2025.
Parallelstore: Worldâs first fully-managed DAOS offering
Parallelstore is a fully managed, scalable, high-performance storage solution based on next-generation DAOS technology, designed for demanding HPC and AI workloads. It is now generally available and provides:
-
Up to 6x greater read throughput performance compared to competitive Lustre scratch offerings
-
Low latency (<0.5ms at p50) and high throughput (>1GiB/s per TiB) to access data with minimal delays, even at massive scale
-
High IOPS (30K IOPS per TiB) for metadata operations
-
Simplified management that reduces operational overhead with a fully managed service Â
Parallelstore is great for applications requiring fast access to large datasets, such as:
-
Analyzing massive genomic datasets for personalized medicine
-
Training large language models (LLMs) and other AI applications efficiently Â
-
Running complex HPC simulations with rapid data access
A3 Ultra VMs with NVIDIA H200 Tensor Core GPUs
For GPU-based HPC workloads, we recently announced A3 Ultra VMs, which feature NVIDIA H200 Tensor Core GPUs. A3 Ultra VMs offer a significant leap in performance over previous generations. They are built on servers with our new Titanium ML network adapter, optimized to deliver a secure, high-performance cloud experience for AI workloads, and powered by NVIDIA ConnectX-7 networking. Combined with our datacenter-wide 4-way rail-aligned network, A3 Ultra VMs deliver non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE).Â
Compared with A3 Mega, A3 Ultra offers:Â
-
2x the GPU-to-GPU networking bandwidth, powered by Google Cloudâs Titanium ML network adapter and backed by our Jupiter data center network
-
Up to 2x higher LLM inferencing performance with nearly double the memory capacity and 1.4x more memory bandwidth
-
Ability to scale to tens of thousands of GPUs in a dense, performance-optimized cluster for large AI and HPC workloads
With system blueprints, available through Cluster Toolkit, customers can quickly and easily create turnkey, pre-configured HPC clusters with Slurm support on A3 VMs.
A3 Ultra VMs will also be available through Google Kubernetes Engine (GKE), which provides an open, portable, extensible, and highly-scalable platform for large-scale training and serving of AI workloads.
Trillium: Ushering in a new era of TPU performance for AI
Tensor Processing Units, or TPUs, power our most advanced AI models such as Gemini, popular Google services like Search, Photos, and Maps, as well as scientific breakthroughs like AlphaFold 2 â which led to a Nobel Prize this year!
We recently announced that Trillium, our sixth-generation TPU, is available to Google Cloud customers in preview.Â
Compared with TPU v5e, Trillium delivers:Â
-
Over 4x improvement in training performanceÂ
-
Up to 3x increase in inference throughputÂ
-
67% increase in energy efficiency
-
4.7x increase in peak compute performance per chipÂ
-
Double the high bandwidth memory capacityÂ
-
Double the interchip interconnect bandwidthÂ
Cluster Toolkit: Streamlining HPC deployments
We continue to improve Cluster Toolkit, providing open-source tools for deploying and managing HPC environments on Google Cloud. Recent updates include:
-
Slurm-gcp V6 is now generally available, providing faster deployments and robust reconfiguration among other benefits.
-
Google Cloud Customer Care is now available for Toolkit. You can find more information here on how to get support via the Cloud Customer Care console.
-
HPC VM Image Rocky Linux 8 is now generally available, making it easy to build an HPC-ready VM instance, incorporating our best practices running HPC on Google Cloud.Â
GKE: Container orchestration with scale and performance
GKE continues to lead the way for containerized workloads with the support of the largest Kubernetes clusters in the industry. With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.
At the same time, we continue to invest in automating and simplifying the building of HPC and AI platforms, with:
-
Secondary boot disk, which provides faster workload startups through container image cachingÂ
-
Fully-managed DCGM metrics for improved accelerator monitoringÂ
-
Custom compute classes, offering greater control over compute resource allocation and scaling
-
Extensive innovations in Kueue.sh, which is becoming the de facto standard for job queueing on Kubernetes with topology-aware scheduling, priority and fairness in queueing, multi-cluster support (see demo by Google and CERN engineers), and more
Customer success stories: Atommap and beyond
Atommap, a company specializing in atomic-scale materials design, is using Google Cloud HPC to accelerate its research and development efforts. With H3 VMs and Parallelstore, Atommap has achieved:Â Â
-
Significant speedup in simulations: Reduced time-to-results by more than half, enabling faster innovationÂ
-
Improved scalability: Easily scaled resources for 1,000s to 10,000s of molecular simulations, to meet growing computational demandsÂ
-
Better cost-effectiveness: Optimized infrastructure costs, with savings of up to 80%, while achieving high performanceÂ
Atommap's success story highlights the transformative potential of Google Cloud HPC for organizations pushing the boundaries of scientific discovery and technological advancement.
Looking ahead
Google Cloud is committed to continuous innovation for HPC. Expect further enhancements to HPC VMs, Parallelstore, Cluster Toolkit, Slurm-gcp, and other HPC products and solutions. With a focus on performance, scalability, compatibility, and ease of use, weâre empowering researchers and engineers to tackle the world's most complex computational challenges.
Google Cloud Advanced Computing Community
Weâre excited to announce the launch of the Google Cloud Advanced Computing Community, a new kind of community of practice for sharing and growing HPC, AI, and quantum computing expertise, innovation, and impact.
This community of practice will bring together thought leaders and experts from Google, its partners, and HPC, AI, and quantum computing organizations around the world for engaging presentations and panels on innovative technologies and their applications. The Community will also leverage Googleâs powerful, comprehensive, and cloud-native tools to create an interactive, dynamic, and engaging forum for discussion and collaboration.
The Community launches now, with meetings starting in December 2024 and a full rollout of learning and collaboration resources in early 2025. To learn more, register here.Â
Google Cloud at Supercomputing 2024
The annual Supercomputing Conference series brings together the global HPC community to showcase the latest advancements in HPC, networking, storage and data analysis. Google Cloud is excited to return to Supercomputing 2024 in Atlanta with our largest presence ever.Â
Visit Google Cloud at booth #1730 to jump in and learn about our HPC, AI infrastructure, and quantum solutions. The booth will feature a Trillium TPU board, NVIDIA H200 GPU and ConnectX-7 NIC, hands-on labs, a full schedule of talks, a comfortable lounge space, and plenty of great swag!
The booth theater will include talks from ARM, Altair, Ansys, Intel, NAG, SchedMD, Siemens, Sycomp, Weka, and more. Booth labs will get you deploying Slurm clusters to fine-tune the Llama2 model or run GROMACS using Cloud Batch to run microbenchmarks or quantum simulations, and more.
Weâre also involved in several parts of SC24's technical program, including BoFs, User Groups, and Workshops. Googlers will participate in the following technical sessions:Â
-
Converged HPC and Cloud Computing in the Era of Generative AI (Bill Magro speaking)
-
HPC & Cloud Convergence: drivers, triggers, and constraints (Felix Schürmann speaking)
-
DAOS User Group (DUG) â24 (Dean Hildebrand speaking)
-
DAOS BoF (Dean Hildebrand speaking)
-
9th International Parallel Data Systems Workshop (PDSW) (Dean Hildebrand speaking)
-
IO500: The High-Performance Storage Community BoF (Dean Hildebrand speaking)
-
High-Performance Object Storage: I/O for the Exascale Era Tutorial (Dean Hildebrand speaking)
Google is also hosting or sponsoring the following exciting events during SC24. Weâre looking forward to seeing you there!
Finally, weâll be holding private meetings and roadmap briefings with our HPC leadership throughout the conference. To schedule a meeting, please contact [email protected].