Llm-pq: Serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization

Jan 1, 2024·

Juntao Zhao

,

Borui Wan

,

Yanghua Peng

,

Haibin Lin

,

Chuan Wu

· 0 min read

Last updated on Jan 1, 2024

← Cdmpp: A device-model agnostic framework for latency prediction of tensor programs Jan 1, 2024

QSpec: Speculative Decoding with Complementary Quantization Schemes Jan 1, 2024 →