inference optimization - a zzfive Collection

zzfive 's Collections

dLLM

RAG

ssm

safety

inference optimization

RL+reason model

medical

3d

image

LLMs

video

agent

cv

audio

robot

inference optimization

updated about 17 hours ago

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

Paper • 2501.16372 • Published Jan 23 • 9
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Paper • 2501.16937 • Published Jan 28 • 6
Matryoshka Quantization

Paper • 2502.06786 • Published Feb 10 • 30
Identifying Sensitive Weights via Post-quantization Integral

Paper • 2503.01901 • Published Feb 28 • 8
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3 • 5
SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models

Paper • 2503.07605 • Published Mar 10 • 69
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Paper • 2503.07067 • Published Mar 10 • 32
Efficient Distillation of Classifier-Free Guidance using Adapters

Paper • 2503.07274 • Published Mar 10 • 4
RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Paper • 2503.07699 • Published Mar 10 • 5
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

Paper • 2503.16709 • Published Mar 20
xKV: Cross-Layer SVD for KV-Cache Compression

Paper • 2503.18893 • Published Mar 24 • 4
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

Paper • 2503.19950 • Published Mar 25 • 11
Optimal Stepsize for Diffusion Sampling

Paper • 2503.21774 • Published Mar 27 • 13
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Paper • 2503.19693 • Published Mar 25 • 75
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Paper • 2503.24377 • Published Mar 31 • 17
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Paper • 2504.00502 • Published Apr 1 • 22
Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking

Paper • 2504.03947 • Published Apr 4 • 4
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Paper • 2504.05897 • Published Apr 8 • 18
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Paper • 2503.20533 • Published Mar 26 • 12
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Paper • 2504.07891 • Published Apr 10 • 5
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7 • 132
Adaptive Computation Pruning for the Forgetting Transformer

Paper • 2504.06949 • Published Apr 9 • 3
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Paper • 2504.11651 • Published Apr 15 • 28
Learning Adaptive Parallel Reasoning with Language Models

Paper • 2504.15466 • Published Apr 21 • 42
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

Paper • 2504.16083 • Published Apr 22 • 9
Taming the Titans: A Survey of Efficient LLM Inference Serving

Paper • 2504.19720 • Published Apr 28 • 10
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Paper • 2505.01658 • Published May 3 • 35
ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations

Paper • 2505.02819 • Published May 5 • 24
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Paper • 2505.03005 • Published May 5 • 31
An Empirical Study of Qwen3 Quantization

Paper • 2505.02214 • Published May 4 • 23
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Paper • 2505.12781 • Published 18 days ago • 2
QVGen: Pushing the Limit of Quantized Video Generative Models

Paper • 2505.11497 • Published 21 days ago • 4
Scaling Law for Quantization-Aware Training

Paper • 2505.14302 • Published 17 days ago • 72
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

Paper • 2505.19427 • Published 12 days ago • 9
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Paper • 2505.23416 • Published 8 days ago • 9
CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Paper • 2505.24196 • Published 8 days ago • 13
Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Paper • 2506.00413 • Published 7 days ago • 6
DLP: Dynamic Layerwise Pruning in Large Language Models

Paper • 2505.23807 • Published 10 days ago • 4
POSS: Position Specialist Generates Better Draft for Speculative Decoding

Paper • 2506.03566 • Published 3 days ago • 6