high performance computing on graphics processing units: hgpu.org

Posts

Jan, 6

Finding Missed Code Size Optimizations in Compilers using LLMs

Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production compilers. The majority of effort has been expended on validating that a compiler produces correct code for a given input, while less […]

Jan, 6

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Supervised machine learning techniques have shown promising results in code analysis and optimization problems. However, a learning-based solution can be brittle because minor changes in hardware or application workloads — such as facing a new CPU architecture or code pattern — may jeopardize decision accuracy, ultimately undermining model robustness. We introduce Prom, an open-source library […]

OpenCL

Jan, 6

Debunking the CUDA Myth Towards GPU-based AI Systems

With the rise of AI, NVIDIA GPUs have become the de facto standard for AI system design. This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs for AI model serving. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves […]

CUDA

Jan, 6

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While […]

CUDA

Jan, 6

A comparison of HPC-based quantum computing simulators using Quantum Volume

This paper compares quantum computing simulators running on a single CPU or GPU-based HPC node using the Quantum Volume benchmark commonly proposed for comparing NISQ systems. As simulators do not suffer from noise, the metric used in the comparison is the time required to simulate a set Quantum Volume. The results are important to estimate […]

CUDA

•

OpenCL

Dec, 29

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

The exponential growth of data-intensive scientific simulations and deep learning workloads presents significant challenges for high-performance computing~(HPC) systems. These workloads generate massive data volumes at unprecedented velocities, straining the capabilities of existing memory hierarchies, I/O subsystems, and scheduling mechanisms. This dissertation addresses critical challenges in data management and workload scheduling to enhance performance, scalability, and […]

CUDA

Dec, 29

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX

Dynamic and adaptive mesh refinement is pivotal in high-resolution, multi-physics, multi-model simulations, necessitating precise physics resolution in localized areas across expansive domains. Today’s supercomputers’ extreme heterogeneity presents a significant challenge for dynamically adaptive codes, highlighting the importance of achieving performance portability at scale. Our research focuses on astrophysical simulations, particularly stellar mergers, to elucidate early […]

CUDA

Dec, 29

TorchQC – A framework for efficiently integrating machine and deep learning methods in quantum dynamics and control

Machine learning has been revolutionizing our world over the last few years and is also increasingly exploited in several areas of physics, including quantum dynamics need for a framework that brings together machine learning models and quantum simulation methods has been quite high within the quantum control field, with the ultimate goal of exploiting these […]

Dec, 29

A survey on FPGA-based accelerator for ML models

This paper thoroughly surveys machine learning (ML) algorithms acceleration in hardware accelerators, focusing on Field-Programmable Gate Arrays (FPGAs). It reviews 287 out of 1138 papers from the past six years, sourced from four top FPGA conferences. Such selection underscores the increasing integration of ML and FPGA technologies and their mutual importance in technological advancement. Research […]

Dec, 29

Development of a new framework for high performance volunteer computing

The majority of Volunteer Computing (VC) projects are based on the Berkeley Open Infrastructure for Network Computing (BOINC) framework. BOINC is an opensource middleware system designed to support a variety of volunteer computing projects across multiple scientific disciplines, including molecular biology, mathematics, cryptography, linguistics, and astrophysics. However, it is worth noting that BOINC primarily supports […]

Dec, 24

Utilizing Tensor Cores in Futhark

Modern hardware has become more heterogeneous, and with the AI boom, specialized hardware for especially performing matrix multiplication has become readily available. In NVIDIA graphical processing units (GPUs), Tensor Cores allow for efficient execution of matrix multiplication routines that can significantly speed up AI and deep learning operations, as well as other programs containing matrix […]

CUDA

Dec, 24

CPPJoules: An Energy Measurement Tool for C++

With the increasing complexity of modern software and the demand for high performance, energy consumption has become a critical factor for developers and researchers. While much of the research community is focused on evaluating the energy consumption of machine learning and artificial intelligence systems — often implemented in Python — there is a gap when […]

CUDA

PROM: an open-source software framework to enhance the robustness and performance of predictive models against such changes during deployment

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

SW#SYCL

Analyzing the Performance Portability of SYCL across CPUs, GPUs, and Hybrid Systems with Protein Database Search

tdg-benchs: benchmarks used to test the performance of taskgraph

Leveraging the potential of task-based programming with OpenMP task graphs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Finding Missed Code Size Optimizations in Compilers using LLMs

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Debunking the CUDA Myth Towards GPU-based AI Systems

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

A comparison of HPC-based quantum computing simulators using Quantum Volume

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX

TorchQC – A framework for efficiently integrating machine and deep learning methods in quantum dynamics and control

A survey on FPGA-based accelerator for ML models

Development of a new framework for high performance volunteer computing

Utilizing Tensor Cores in Futhark

CPPJoules: An Energy Measurement Tool for C++

Recent source codes

PROM: an open-source software framework to enhance the robustness and performance of predictive models against such changes during deployment

DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective

HPX: a C++ Standard Library for Concurrency and Parallelism

TorchQC: Quantum Dynamics and Machine Learning

Matrix multiplication using Tensor Cores in CUDA

CPP Joules: Energy Measurement tool for CPP/CUDA programs

HPC-Coder-v2

Reproducible Study and Performance Analysis of GPU Programming Paradigms: OpenACC vs. CUDA in Key Linear Algebra Computations

SW#SYCL

tdg-benchs: benchmarks used to test the performance of taskgraph

Most viewed papers (last 30 days)