Elastic model: Mistral-Small-3.1-24B-Instruct-2503. Fastest and most flexible models for self-serving.

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

XL: Mathematically equivalent neural network, optimized with our DNN compiler.
L: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
M: Faster model, with accuracy degradation less than 1.5%.
S: The fastest model, with accuracy degradation less than 2%.

Goals of elastic models:

Provide flexibility in cost vs quality selection for inference
Provide clear quality and latency benchmarks
Provide interface of HF libraries: transformers and diffusers with a single line of code
Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
Provide the best models and service for self-hosting.

It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.

Inference

To infer our models, you just need to replace transformers import with elastic_models.transformers:

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")

System requirements:

GPUs: H100, L40s
CPU: AMD, Intel
Python: 3.10-3.12

To work with our models just run these lines in your terminal:

pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex

Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:

thestage config set --api-token <YOUR_API_TOKEN>

Congrats, now you can use accelerated models!

Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8 column indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!

Quality benchmarks

Metric/Model	S	M	L	XL	Original	W8A8, int8
arc_challenge	65.30	66.30	66.70	66.80	66.80	51.10
gsm8k	87.70	88.40	87.70	88.86	88.86	13.49
mmlu	79.00	79.40	79.70	80.20	80.20	60.45
piqa	82.90	83.10	82.60	83.00	83.00	75.35
winogrande	78.20	79.40	79.30	79.50	79.50	71.19

MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
Arc Challenge: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
Winogrande: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
GSM8K: GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems.

Latency benchmarks

Performance by Context Size

The tables below show performance (tokens per second) for different input context sizes across different GPU models and batch sizes:

H100:

Batch Size 1:

Context	Input Tokens	S	M	L	XL	Original
Small	256	90.3	82.5	72.2	54.4	41.2
Medium	1024	90.1	82.2	71.8	-	38.8
Large	4096	88.2	81.0	70.4	-	33.8

Batch Size 8:

Context	Input Tokens	S	M	L	XL	Original
Small	256	86.5	79.9	69.1	-	36.7
Medium	1024	80.3	74.9	65.1	-	29.0
Large	4096	63.3	59.5	53.1	-	15.5

Batch Size 16:

Context	Input Tokens	S	M	L	XL	Original
Small	256	84.7	78.1	68.0	-	32.2
Medium	1024	79.8	73.3	64.1	-	21.8
Large	4096	62.5	58.1	52.7	-	9.7

L40S:

Batch Size 1:

Context	Input Tokens	S	M	L	XL	Original
Small	256	26.0	24.0	21.0	-	-
Medium	1024	25.8	23.8	20.9	-	-
Large	4096	25.1	23.3	20.5	-	-

Batch Size 8:

Context	Input Tokens	S	M	L	XL	Original
Small	256	25.2	23.2	20.4	-	-
Medium	1024	24.3	22.4	19.8	-	-
Large	4096	-	-	-	-	-

Batch Size 16:

Context	Input Tokens	S	M	L	XL	Original
Small	256	24.5	22.6	19.9	-	-
Medium	1024	22.8	20.9	-	-	-
Large	4096	-	-	-	-	-

Note: Results show tokens per second (TPS) for text generation with 100 new tokens output. Performance varies based on GPU model, context size, and batch size.

TheStageAI
/

Elastic-Mistral-Small-3.1-24B-Instruct-2503

Elastic model: Mistral-Small-3.1-24B-Instruct-2503. Fastest and most flexible models for self-serving.

Inference

Benchmarks

Quality benchmarks

Latency benchmarks

Performance by Context Size

Links

Model tree for TheStageAI/Elastic-Mistral-Small-3.1-24B-Instruct-2503

Collection including TheStageAI/Elastic-Mistral-Small-3.1-24B-Instruct-2503

Elastic Transformers