What is Flash Attention?
Flash Attention is an optimized attention algorithm that computes exact attention with O(N) memory instead of O(NΒ²), enabling longer context lengths and 2-4x faster training and inference on modern GPUs.
The Attention Bottleneck
Standard attention has two major issues:
- Memory: Stores full NΓN attention matrix (quadratic memory)
- Speed: Memory bandwidth becomes the bottleneck, not compute
For a 32K context model, the attention matrix alone would require 4GB of memory per layer!
How Flash Attention Works
Flash Attention uses tiling and recomputation:
- Tiling: Process attention in blocks that fit in GPU SRAM
- Kernel Fusion: Combine operations to minimize memory transfers
- Recomputation: Recompute values during backward pass instead of storing
This achieves the same mathematical result with dramatically less memory.
Performance Impact
| Metric | Standard Attention | Flash Attention |
|---|---|---|
| Memory | O(NΒ²) | O(N) |
| Speed | 1x | 2-4x faster |
| Max Context | ~8K | 128K+ |
| Training Throughput | Baseline | 2-3x higher |
Flash Attention Versions
| Version | Key Features |
|---|---|
| v1 | Tiling, online softmax |
| v2 | Better parallelism, 2x faster |
| v3 | Hopper GPU optimizations (H100) |
Enabling Flash Attention
Most modern frameworks support Flash Attention:
# Hugging Face Transformers
model = AutoModelForCausalLM.from_pretrained(
"model_name",
attn_implementation="flash_attention_2"
)
Requirements
- GPU: NVIDIA Ampere (A100) or newer recommended
- CUDA: 11.6+
- PyTorch: 2.0+
Related Optimizations
| Technique | Description |
|---|---|
| PagedAttention | Used in vLLM for serving |
| Ring Attention | Distributes attention across GPUs |
| Sliding Window | Limits attention to local context |
Related Concepts
- Context Length - Input sequence limits
- Transformer - Architecture overview
- VRAM - GPU memory management