ReplaceMe
Collection
Pruning with a training-free approach
β’
5 items
β’
Updated
β’
2
ReplaceMe is a novel method for transformer model compression that enables training-free block/layer pruning while maintaining model performance through linear transformations. The approach:
Method | Approach | num_pruned_layers | Dataset | State | race π | winogrande π² | piqa π§ | boolq β | openbookqa π | sciq π¬ | lambada_openai π¦ | ppl | Avg-acc π |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
acc | acc | acc_norm | acc | acc_norm | acc_norm | acc | |||||||
Llama 3 70B (baseline) | - | - | - | - | 0.470 | 0.834 | 0.822 | 0.875 | 0.426 | 0.942 | 0.759 | 2.731 | 0.750 |
ReplaceMe* (LS) β | LS | 20 | slim_orca | no training | 0.455 | 0.792 | 0.777 | 0.874 π | 0.404 π | 0.894 | 0.535 | 9.277 | 0.724 |
ReplaceMe (Ours) β | Cosine | 20 | slim_orca | no training | 0.467 π | 0.792 π | 0.779 π | 0.872 | 0.394 | 0.918 π | 0.634 π | 5.232 π | 0.727 π |
Key:
Metrics Explained:
π₯ Our training-free methods achieve 96.6% of baseline performance while other approaches require expensive retraining!
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml
# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
There are many parameters you can play with, visit our repo and dscover π₯π₯
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MTSAIR/Llama3-53B-ReplaceMe"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is ReplaceME pruning method?!"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(
**model_inputs,
max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
If you use ReplaceMe in your research, please cite our paper:
@article{shopkhoev2025replaceme0,
title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
year = {2025},
journal = {arXiv preprint arXiv: 2505.02819}
}