Scaling Reasoning without Attention

🚀 Overview

PromptCoT-Mamba establishes the first attention-free foundation model capable of surpassing strong Transformer baselines across a broad suite of competition-level math and code reasoning tasks. Built on the Mamba-2 architecture and trained through a structured, two-stage curriculum using the PromptCoT pipeline, it delivers high accuracy with constant-memory inference, eliminating the need for KV caching.

📈 Key Results

🔹 General Performance

Model	MATH-500	AIME 24	AIME 25	OlympiadBench	HumanEval	HumanEval+	Livecodebench
PromptCoT-Mamba-7B	84.6	35.2	24.6	50.7	81.7	75.0	29.9
Gemma3-27B	89.0	32.6	24.0	54.2	86.0	78.0	26.9
Gemma3-12B	83.8	22.9	19.2	49.9	81.1	73.2	22.2
Sky-T1-7B	85.0	19.2	19.2	49.2	41.5	37.2	18.3
S1.1-7B	82.0	19.2	17.5	43.1	64.0	56.7	13.3
Bespoke-Stratos-7B	81.2	18.3	16.3	45.0	73.2	68.3	8.6
Nemotron-H-8B	77.6	--	--	--	79.3	74.4	--
M1-3B	81.7	23.0	22.0	43.6	--	--	--

🔍 PromptCoT-Mamba-7B consistently outperforms all 7B-scale Transformer and hybrid Mamba-Transformer baselines across all tasks.

🔹 Math Specialization vs. Generalist

Model	MATH-500	AIME 24	AIME 25	OlympiadBench	HumanEval	HumanEval+	Livecodebench
PromptCoT-Mamba-Math-7B	88.0	42.9	30.8	52.1	71.3	66.5	20.3
PromptCoT-Mamba-7B	84.6	35.2	24.6	50.7	81.7	75.0	29.9

🎯 The math-specialized variant improves AIME 24 by +7.7% and AIME 25 by +6.2%, with a slight trade-off in code-related performance.

⚡ Inference Efficiency

Using vLLM under constrained memory, PromptCoT-Mamba-7B demonstrates substantial speedups over the S1.1-7B Transformer baseline:

💡 3.66× faster at long-sequence generation on 24GB GPU
💡 1.69× faster under 72GB memory

⚙️ Practical for cost-sensitive or long-context inference workloads at scale.

🧪 Quick Start

🔧 Install Requirements

pip install transformers vllm torch accelerate

🧠 Load and Run the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "xl-zhao/PromptCoT-Mamba-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

problem_statement = (
    "A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?"
)

prompt = (
    f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n"
    "<|im_start|>assistant\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_length=65536, temperature=0.8)

generated_solution = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_solution)

⚡ Fast Inference with vLLM

from vllm import LLM, SamplingParams

model_name = "xl-zhao/PromptCoT-Mamba-7B"
llm = LLM(model=model_name, tensor_parallel_size=1)

problem_statement = (
    "A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?"
)

prompt = (
    f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n"
    "<|im_start|>assistant\n"
)

sampling_params = SamplingParams(temperature=0.8, max_tokens=65536)
outputs = llm.generate([prompt], sampling_params)

print(outputs[0].outputs[0].text)

📜 Citation

@article{zhao2025scaling,
  author    = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
  title     = {Scaling Reasoning without Attention},
  journal   = {arXiv preprint arXiv:2505.22425},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.22425}
}

xl-zhao
/

PromptCoT-Mamba-7B