stereoplegic 's Collections KV Cache
updated
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
• 2306.06000
• Published
• 1
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM
Inference
Paper
• 2405.12532
• Published
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via
Layer-wise Optimal Budget
Paper
• 2404.04793
• Published
• 1
MiniCache: KV Cache Compression in Depth Dimension for Large Language
Models
Paper
• 2405.14366
• Published
• 3
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless
Generative Inference of LLM
Paper
• 2403.05527
• Published
• 1
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache
Generation
Paper
• 2405.05329
• Published
Effectively Compress KV Heads for LLM
Paper
• 2406.07056
• Published
• 1
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context
Large Language Models
Paper
• 2406.05678
• Published
• 1
Retaining Key Information under High Compression Ratios: Query-Guided
Compressor for LLMs
Paper
• 2406.02376
• Published
• 2
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
• 2407.12077
• Published
• 57
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Paper
• 2407.15891
• Published
Beyond KV Caching: Shared Attention for Efficient LLMs
Paper
• 2407.12866
• Published
• 1
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Paper
• 2408.08454
• Published
Efficient LLM Training and Serving with Heterogeneous Context Sharding
among Attention Heads
Paper
• 2407.17678
• Published
Post-Training Sparse Attention with Double Sparsity
Paper
• 2408.07092
• Published
Palu: Compressing KV-Cache with Low-Rank Projection
Paper
• 2407.21118
• Published
• 1
InfiniGen: Efficient Generative Inference of Large Language Models with
Dynamic KV Cache Management
Paper
• 2406.19707
• Published
Inference-Friendly Models With MixAttention
Paper
• 2409.15012
• Published