LAMB go brrr
-
Updated
Apr 11, 2024 - Python
LAMB go brrr
High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.
Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.
To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."