Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference arXiv AI Infra · 2026-05-12 · AI Infra GPUHBMMLIRTensorRT-LLMmemoryvLLM 100 重要 65 热度 91 综合
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC arXiv AI Infra · 2026-04-08 · AI Infra GPUNPURDMASGLangTensorRT-LLMmemory 100 重要 65 热度 91 综合
Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI arXiv AI Infra · 2025-11-17 · AI Infra GPUPagedAttentionTGImemoryvLLM推理系统 100 重要 65 热度 91 综合
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference arXiv AI Infra · 2025-05-16 · AI Infra GPUNVLinkSGLangTensorRT-LLMvLLM 100 重要 65 热度 91 综合
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving arXiv AI Infra · 2025-01-02 · AI Infra DiTFlashInferGPUSGLangmemoryvLLM 100 重要 65 热度 91 综合
COMET: Towards Partical W4A4KV4 LLMs Serving arXiv AI Infra · 2024-10-16 · AI Infra DiTGPUKV cacheTensorRT-LLMmemoryquantization 100 重要 65 热度 91 综合
SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines arXiv AI Infra · 2024-08-08 · AI Infra TensorRT-LLMvLLM 100 重要 65 热度 91 综合
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving arXiv AI Infra · 2024-05-07 · AI Infra DiTGPUKV cacheTensorRT-LLMmemoryquantization 100 重要 65 热度 91 综合
SGLang: Efficient Execution of Structured Language Model Programs arXiv AI Infra · 2023-12-12 · AI Infra KV cacheNPUSGLang 100 重要 65 热度 91 综合