SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization Jan 1, 2025· Juntao Zhao , Borui Wan , Chuan Wu , Yanghua Peng , Haibin Lin · 0 min read Link Cite Last updated on Jan 1, 2025 ← Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving Jan 1, 2025 Cdmpp: A device-model agnostic framework for latency prediction of tensor programs Jan 1, 2024 →