Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving

Jan 1, 2025·

Juntao Zhao

Jiuru Li

Chuan Wu

· 0 min read

PDF

Abstract

CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints—existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges: (1) seamless phase-wise plan switching to eliminate cross-phase interference; (2) TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation; (3) fast-start-then-finetune dynamic-shape tensor program generation. Evaluated on five x86/ARM CPU platforms, Sandwich achieves average 2.01× end-to-end speedup and up to 3.40× latency reduction over SOTA systems. Its kernels match static compiler performance with three orders of magnitude less tuning cost.

Type

Conference paper

Publication

DAC 2026

Last updated on Jan 1, 2025

← MegaScale-Data: Scaling DataLoader for Multisource Large Foundation Model Training Jan 1, 2026

SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization Jan 1, 2025 →

No results found

Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving