Low-Rank Adapters Meet Neural Architecture Search for LLM Compression Paper • 2501.16372 • Published Jan 23 • 9
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models Paper • 2501.16937 • Published Jan 28 • 6
Identifying Sensitive Weights via Post-quantization Integral Paper • 2503.01901 • Published Feb 28 • 8
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test Paper • 2503.01840 • Published Mar 3 • 5
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models Paper • 2503.07605 • Published Mar 10 • 69
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs Paper • 2503.07067 • Published Mar 10 • 32
Efficient Distillation of Classifier-Free Guidance using Adapters Paper • 2503.07274 • Published Mar 10 • 4
RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories Paper • 2503.07699 • Published Mar 10 • 5
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge Paper • 2503.16709 • Published Mar 20
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation Paper • 2503.19950 • Published Mar 25 • 11
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation Paper • 2503.19693 • Published Mar 25 • 75
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models Paper • 2503.24377 • Published Mar 31 • 17
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers Paper • 2504.00502 • Published Apr 1 • 22
Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking Paper • 2504.03947 • Published Apr 4 • 4
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference Paper • 2504.05897 • Published Apr 8 • 18
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence Paper • 2503.20533 • Published Mar 26 • 12
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning Paper • 2504.07891 • Published Apr 10 • 5
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Paper • 2504.08791 • Published Apr 7 • 132
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float Paper • 2504.11651 • Published Apr 15 • 28
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Paper • 2504.16083 • Published Apr 22 • 9
Taming the Titans: A Survey of Efficient LLM Inference Serving Paper • 2504.19720 • Published Apr 28 • 10
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency Paper • 2505.01658 • Published May 3 • 35
ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations Paper • 2505.02819 • Published May 5 • 24
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale Paper • 2505.03005 • Published May 5 • 31
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone Paper • 2505.12781 • Published 18 days ago • 2
QVGen: Pushing the Limit of Quantized Video Generative Models Paper • 2505.11497 • Published 21 days ago • 4
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference Paper • 2505.19427 • Published 12 days ago • 9
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction Paper • 2505.23416 • Published 8 days ago • 9
CLaSp: In-Context Layer Skip for Self-Speculative Decoding Paper • 2505.24196 • Published 8 days ago • 13
Accelerating Diffusion LLMs via Adaptive Parallel Decoding Paper • 2506.00413 • Published 7 days ago • 6
DLP: Dynamic Layerwise Pruning in Large Language Models Paper • 2505.23807 • Published 10 days ago • 4
POSS: Position Specialist Generates Better Draft for Speculative Decoding Paper • 2506.03566 • Published 3 days ago • 6