HubLens › Trending › deepseek/ai-FlashMLA
deepseek

ai-FlashMLA

AIDeepSeekAttentionCUDAPyTorchLLM
View on GitHub
12,550
+340

// summary

FlashMLA is a library of optimized attention kernels developed by DeepSeek to power their V3 and V3.2-Exp models. It provides specialized implementations for both sparse and dense attention across prefill and decoding stages, including support for FP8 KV cache. The library is designed for high-performance execution on SM90 and SM100 GPU architectures.

// use cases

01
Token-level sparse attention for efficient prefill and decoding stages
02
Dense attention kernels for standard prefill and decoding operations
03
FP8 KV cache support to optimize memory usage and performance during decoding