Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
-
Updated
Dec 23, 2025 - C++
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
row-major matmul optimization
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
Energinets Model Testbench. Automate gridcompliance studies in PSCAD and Powerfactory.
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
Binary Ninja plugin for reverse engineering PTX -- the virtual instruction set architecture of CUDA-based GPUs.
This is my 🔥 100 Days of GPU — a wild, hands-on journey through CUDA/CUTLASS kernels, Triton spells, and PTX sorcery.
Web knowledge is fragmented — duplicated across fonts, embeddings, metadata, and renderings. Humans see pixels, AI sees tokens, neither shares the source. Knowledge3D: a sovereign GPU-native reference implementation for W3C PM-KR, where humans and AI consume the same procedural knowledge from one source.
GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving
Set of examples written for hardware acceleration via TornadoVM
Inline PTX Assembly in CUDA example
Add a description, image, and links to the ptx topic page so that developers can more easily learn about it.
To associate your repository with the ptx topic, visit your repo's landing page and select "manage topics."