Blog

Technical writeups on GPU programming, ML systems, and low-level engineering.

CUDA Shared Memory Tiling for Matrix Multiplication

A deep dive into how shared memory tiling dramatically reduces global memory bandwidth and achieves near-cuBLAS throughput on NVIDIA A100 GPUs.

How the new WebGPU API exposes compute shaders in the browser and why it matters for running ML workloads client-side.