Publications

(2026). MegaScale-Data: Scaling DataLoader for Multisource Large Foundation Model Training. EuroSys 2026.
PDF
(2026). Efficient LLM Serving on Hybrid Real-time and Best-effort Requests. IEEE INFOCOM 2026.
PDF
(2025). SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization. PPoPP 2024 Poster; IEEE Cluster 2025.
Link
(2025). Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving. DAC 2026.
PDF
(2024). QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices. 38th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2024).
Code
(2024). QSpec: Speculative Decoding with Complementary Quantization Schemes. EMNLP 2025 Main.
PDF
(2024). Llm-pq: Serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. arXiv preprint arXiv:2403.01136.
(2024). Cdmpp: A device-model agnostic framework for latency prediction of tensor programs. Proceedings of the Nineteenth European Conference on Computer Systems.
(2023). Adaptive message quantization and parallelization for distributed full-graph gnn training. Proceedings of Machine Learning and Systems.
(2022). Synchronization in games sound: an audiovisual study on player experience and performance. Proceedings of the 2nd Workshop on Games Systems.