Advanced ⏱️ 6 min

πŸŽ“ What is GGUF?

GGUF (GPT-Generated Unified Format) is a file format for storing quantized LLMs optimized for efficient CPU and GPU inference.

Overview

GGUF is the standard format for running LLMs locally. Created by Georgi Gerganov for llama.cpp, it enables efficient inference on CPUs and Apple Silicon.

Key Benefits

Feature Benefit
CPU Inference Run models without expensive GPUs
Apple Silicon Optimized for M1/M2/M3 chips
Multiple Quants Choose quality vs. memory tradeoff
Wide Support Works with Ollama, LM Studio, llama.cpp

Quantization Levels

Quant Bits Quality Memory (7B)
Q8_0 8.5 Best ~8 GB
Q5_K_M 5.7 Great ~5 GB
Q4_K_M 4.8 Good ~4 GB
Q2_K 2.6 Lossy ~3 GB

How to Use

With Ollama:

ollama run model-name

With llama.cpp:

./llama-cli -m model.gguf -p "Your prompt"

Related Concepts

  • Model Quantization
  • VRAM Requirements
  • Local Inference

πŸ•ΈοΈ Knowledge Mesh

🧬 Grounded Entities