Scaling Reasoning without Attention
π Overview
PromptCoT-Mamba establishes the first attention-free foundation model capable of surpassing strong Transformer baselines across a broad suite of competition-level math and code reasoning tasks. Built on the Mamba-2 architecture and trained through a structured, two-stage curriculum using the PromptCoT pipeline, it delivers high accuracy with constant-memory inference, eliminating the need for KV caching.
π Key Results
πΉ General Performance
Model | MATH-500 | AIME 24 | AIME 25 | OlympiadBench | HumanEval | HumanEval+ | Livecodebench |
---|---|---|---|---|---|---|---|
PromptCoT-Mamba-7B | 84.6 | 35.2 | 24.6 | 50.7 | 81.7 | 75.0 | 29.9 |
Gemma3-27B | 89.0 | 32.6 | 24.0 | 54.2 | 86.0 | 78.0 | 26.9 |
Gemma3-12B | 83.8 | 22.9 | 19.2 | 49.9 | 81.1 | 73.2 | 22.2 |
Sky-T1-7B | 85.0 | 19.2 | 19.2 | 49.2 | 41.5 | 37.2 | 18.3 |
S1.1-7B | 82.0 | 19.2 | 17.5 | 43.1 | 64.0 | 56.7 | 13.3 |
Bespoke-Stratos-7B | 81.2 | 18.3 | 16.3 | 45.0 | 73.2 | 68.3 | 8.6 |
Nemotron-H-8B | 77.6 | -- | -- | -- | 79.3 | 74.4 | -- |
M1-3B | 81.7 | 23.0 | 22.0 | 43.6 | -- | -- | -- |
π PromptCoT-Mamba-7B consistently outperforms all 7B-scale Transformer and hybrid Mamba-Transformer baselines across all tasks.
πΉ Math Specialization vs. Generalist
Model | MATH-500 | AIME 24 | AIME 25 | OlympiadBench | HumanEval | HumanEval+ | Livecodebench |
---|---|---|---|---|---|---|---|
PromptCoT-Mamba-Math-7B | 88.0 | 42.9 | 30.8 | 52.1 | 71.3 | 66.5 | 20.3 |
PromptCoT-Mamba-7B | 84.6 | 35.2 | 24.6 | 50.7 | 81.7 | 75.0 | 29.9 |
π― The math-specialized variant improves AIME 24 by +7.7% and AIME 25 by +6.2%, with a slight trade-off in code-related performance.
β‘ Inference Efficiency
Using vLLM
under constrained memory, PromptCoT-Mamba-7B demonstrates substantial speedups over the S1.1-7B Transformer baseline:
- π‘ 3.66Γ faster at long-sequence generation on 24GB GPU
- π‘ 1.69Γ faster under 72GB memory
βοΈ Practical for cost-sensitive or long-context inference workloads at scale.
π§ͺ Quick Start
π§ Install Requirements
pip install transformers vllm torch accelerate
π§ Load and Run the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "xl-zhao/PromptCoT-Mamba-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
problem_statement = (
"A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?"
)
prompt = (
f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_length=65536, temperature=0.8)
generated_solution = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_solution)
β‘ Fast Inference with vLLM
from vllm import LLM, SamplingParams
model_name = "xl-zhao/PromptCoT-Mamba-7B"
llm = LLM(model=model_name, tensor_parallel_size=1)
problem_statement = (
"A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?"
)
prompt = (
f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n"
"<|im_start|>assistant\n"
)
sampling_params = SamplingParams(temperature=0.8, max_tokens=65536)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
π Citation
@article{zhao2025scaling,
author = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
title = {Scaling Reasoning without Attention},
journal = {arXiv preprint arXiv:2505.22425},
year = {2025},
url = {https://arxiv.org/abs/2505.22425}
}
- Downloads last month
- 6