Technical writeups on GPU programming, ML systems, and low-level engineering.
A deep dive into how shared memory tiling dramatically reduces global memory bandwidth and achieves near-cuBLAS throughput on NVIDIA A100 GPUs.
How the new WebGPU API exposes compute shaders in the browser and why it matters for running ML workloads client-side.