CLaSp: In-Context Layer Skip for Self-Speculative Decoding
Abstract
CLaSp, an in-context layer-skipping strategy for self-speculative decoding, accelerates Large Language Model decoding without additional modules or training, achieving a 1.3x to 1.7x speedup on LLaMA3 models.
Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.
Community
We propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model.
great work!
great work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding (2025)
- KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization (2025)
- Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design (2025)
- SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences (2025)
- PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding (2025)
- Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs (2025)
- Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper