High-performance Triton-based GPU kernels for accelerating core deep learning operations, from matrix multiplication to convolutions and activation functions.
Modern deep learning frameworks rely on highly optimized GPU operations for training and inference efficiency. However, understanding how these operators work under the hood remains essential for researchers and engineers aiming to innovate in custom model architectures, hardware acceleration, and optimization research.
This project implements custom GPU kernels using Triton — a language designed for writing efficient GPU programs with minimal effort. Each kernel in this repository re-implements a fundamental deep learning operation (e.g., convolution, ReLU, softmax) using first principles and mathematical rigor, bridging the gap between theoretical understanding and practical GPU implementation.
| Module | Description |
|---|---|
1d_convolution.py |
Implements 1D convolution using sliding window and dot-product formulation. |
2d_convolution.py |
Extends convolution to 2D spatial data, enabling basic image filtering. |
matrix_multiplication.py |
Performs block-wise matrix multiplication — foundation for linear layers. |
matrix_transpose.py |
Implements optimized transposition using tile-based memory access. |
matrix_copy.py |
GPU memory copy operation using block-parallel execution. |
relu.py |
Standard ReLU activation using elementwise thresholding. |
leaky_relu.py |
Parametric activation with controllable negative slope. |
silu.py |
Sigmoid Linear Unit activation — ( \text{SiLU}(x) = x \cdot \sigma(x) ). |
softmax.py |
Numerically stable softmax using running maximum and exponential normalization. |
Each operator is derived from the core mathematical formulations of deep learning operations:
The Triton kernel implements a sliding window approach over the input ( x ), multiplying and summing with the kernel ( w ) efficiently across GPU threads.
Each output pixel is computed as a localized inner product, parallelized across GPU blocks with memory tiling to enhance cache reuse and throughput.
A block-level GEMM (General Matrix Multiplication) algorithm is implemented using shared memory tiling, minimizing global memory reads and maximizing compute density.
| Function | Formula | Description |
|---|---|---|
| ReLU | ( f(x) = \max(0, x) ) | Introduces non-linearity by zeroing negatives. |
| Leaky ReLU | ( f(x) = \max(\alpha x, x) ) | Retains small gradient for negative values. |
| SiLU | ( f(x) = x \cdot \sigma(x) ) | Smooth variant of ReLU with sigmoid gating. |
The Triton kernel ensures numerical stability by subtracting the maximum value before exponentiation — a critical step to prevent floating-point overflow on GPUs.
-
Tiling Strategy: Operations like matrix multiplication and 2D convolution are block-tiled, meaning each GPU block handles a tile of the output to maximize data locality.
-
Memory Coalescing: Access patterns are designed to align with GPU memory architecture, ensuring minimal latency during global memory fetches.
-
Masking & Edge Handling: Triton’s mask mechanism is leveraged to safely compute boundary elements, preventing out-of-bound memory access.
-
Numerical Stability: The softmax kernel implements log-sum-exp trick, an essential numerical stabilization used in real-world neural networks.