QwQ-32B-ArliAI-RpR-v4 GGUF Models

Model Generation Details

This model was generated using llama.cpp at commit f5cd27b7.

Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)

Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on Llama-3-8B. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.

Benchmark Context

All tests conducted on Llama-3-8B-Instruct using:

  • Standard perplexity evaluation pipeline
  • 2048-token context window
  • Same prompt set across all quantizations

Method

  • Dynamic Precision Allocation:
    • First/Last 25% of layers β†’ IQ4_XS (selected layers)
    • Middle 50% β†’ IQ2_XXS/IQ3_S (increase efficiency)
  • Critical Component Protection:
    • Embeddings/output layers use Q5_K
    • Reduces error propagation by 38% vs standard 1-2bit

Quantization Performance Comparison (Llama-3-8B)

Quantization Standard PPL DynamicGate PPL Ξ” PPL Std Size DG Size Ξ” Size Std Speed DG Speed
IQ2_XXS 11.30 9.84 -12.9% 2.5G 2.6G +0.1G 234s 246s
IQ2_XS 11.72 11.63 -0.8% 2.7G 2.8G +0.1G 242s 246s
IQ2_S 14.31 9.02 -36.9% 2.7G 2.9G +0.2G 238s 244s
IQ1_M 27.46 15.41 -43.9% 2.2G 2.5G +0.3G 206s 212s
IQ1_S 53.07 32.00 -39.7% 2.1G 2.4G +0.3G 184s 209s

Key:

  • PPL = Perplexity (lower is better)
  • Ξ” PPL = Percentage change from standard to DynamicGate
  • Speed = Inference time (CPU avx2, 2048 token context)
  • Size differences reflect mixed quantization overhead

Key Improvements:

  • πŸ”₯ IQ1_M shows massive 43.9% perplexity reduction (27.46 β†’ 15.41)
  • πŸš€ IQ2_S cuts perplexity by 36.9% while adding only 0.2GB
  • ⚑ IQ1_S maintains 39.7% better accuracy despite 1-bit quantization

Tradeoffs:

  • All variants have modest size increases (0.1-0.3GB)
  • Inference speeds remain comparable (<5% difference)

When to Use These Models

πŸ“Œ Fitting models into GPU VRAM

βœ” Memory-constrained deployments

βœ” Cpu and Edge Devices where 1-2bit errors can be tolerated

βœ” Research into ultra-low-bit quantization

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

  • A 16-bit floating-point format designed for faster computation while retaining good precision.
  • Provides similar dynamic range as FP32 but with lower memory usage.
  • Recommended if your hardware supports BF16 acceleration (check your device's specs).
  • Ideal for high-performance inference with reduced memory footprint compared to FP32.

πŸ“Œ Use BF16 if:
βœ” Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
βœ” You want higher precision while saving memory.
βœ” You plan to requantize the model into another format.

πŸ“Œ Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.


F16 (Float 16) – More widely supported than BF16

  • A 16-bit floating-point high precision but with less of range of values than BF16.
  • Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
  • Slightly lower numerical precision than BF16 but generally sufficient for inference.

πŸ“Œ Use F16 if:
βœ” Your hardware supports FP16 but not BF16.
βœ” You need a balance between speed, memory usage, and accuracy.
βœ” You are running on a GPU or another device optimized for FP16 computations.

πŸ“Œ Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.


Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

  • Lower-bit models (Q4_K) β†’ Best for minimal memory usage, may have lower precision.
  • Higher-bit models (Q6_K, Q8_0) β†’ Better accuracy, requires more memory.

πŸ“Œ Use Quantized Models if:
βœ” You are running inference on a CPU and need an optimized model.
βœ” Your device has low VRAM and cannot load full-precision models.
βœ” You want to reduce memory footprint while keeping reasonable accuracy.

πŸ“Œ Avoid Quantized Models if:
❌ You need maximum accuracy (full-precision models are better for this).
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).


Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.

  • IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.

    • Use case: Best for ultra-low-memory devices where even Q4_K is too large.
    • Trade-off: Lower accuracy compared to higher-bit quantizations.
  • IQ3_S: Small block size for maximum memory efficiency.

    • Use case: Best for low-memory devices where IQ3_XS is too aggressive.
  • IQ3_M: Medium block size for better accuracy than IQ3_S.

    • Use case: Suitable for low-memory devices where IQ3_S is too limiting.
  • Q4_K: 4-bit quantization with block-wise optimization for better accuracy.

    • Use case: Best for low-memory devices where Q6_K is too large.
  • Q4_0: Pure 4-bit quantization, optimized for ARM devices.

    • Use case: Best for ARM-based devices or low-memory environments.

Summary Table: Model Format Selection

Model Format Precision Memory Usage Device Requirements Best Use Case
BF16 Highest High BF16-supported GPU/CPUs High-speed inference with reduced memory
F16 High High FP16-supported devices GPU inference when BF16 isn't available
Q4_K Medium Low Low CPU or Low-VRAM devices Best for memory-constrained environments
Q6_K Medium Moderate CPU with more memory Better accuracy while still being quantized
Q8_0 High Moderate CPU or GPU with enough VRAM Best accuracy among quantized models
IQ3_XS Very Low Very Low Ultra-low-memory devices Extreme memory efficiency and low accuracy
Q4_0 Low Low ARM or low-memory devices llama.cpp can optimize for ARM devices

Included Files & Details

QwQ-32B-ArliAI-RpR-v4-bf16.gguf

  • Model weights preserved in BF16.
  • Use this if you want to requantize the model into a different format.
  • Best if your device supports BF16 acceleration.

QwQ-32B-ArliAI-RpR-v4-f16.gguf

  • Model weights stored in F16.
  • Use if your device supports FP16, especially if BF16 is not available.

QwQ-32B-ArliAI-RpR-v4-bf16-q8_0.gguf

  • Output & embeddings remain in BF16.
  • All other layers quantized to Q8_0.
  • Use if your device supports BF16 and you want a quantized version.

QwQ-32B-ArliAI-RpR-v4-f16-q8_0.gguf

  • Output & embeddings remain in F16.
  • All other layers quantized to Q8_0.

QwQ-32B-ArliAI-RpR-v4-q4_k.gguf

  • Output & embeddings quantized to Q8_0.
  • All other layers quantized to Q4_K.
  • Good for CPU inference with limited memory.

QwQ-32B-ArliAI-RpR-v4-q4_k_s.gguf

  • Smallest Q4_K variant, using less memory at the cost of accuracy.
  • Best for very low-memory setups.

QwQ-32B-ArliAI-RpR-v4-q6_k.gguf

  • Output & embeddings quantized to Q8_0.
  • All other layers quantized to Q6_K .

QwQ-32B-ArliAI-RpR-v4-q8_0.gguf

  • Fully Q8 quantized model for better accuracy.
  • Requires more memory but offers higher precision.

QwQ-32B-ArliAI-RpR-v4-iq3_xs.gguf

  • IQ3_XS quantization, optimized for extreme memory efficiency.
  • Best for ultra-low-memory devices.

QwQ-32B-ArliAI-RpR-v4-iq3_m.gguf

  • IQ3_M quantization, offering a medium block size for better accuracy.
  • Suitable for low-memory devices.

QwQ-32B-ArliAI-RpR-v4-q4_0.gguf

  • Pure Q4_0 quantization, optimized for ARM devices.
  • Best for low-memory environments.
  • Prefer IQ4_NL for better accuracy.

πŸš€ If you find these models useful

❀ Please click "Like" if you find this useful!
Help me test my AI-Powered Network Monitor Assistant with quantum-ready security checks:
πŸ‘‰ Free Network Monitor

πŸ’¬ How to test:
Choose an AI assistant type:

  • TurboLLM (GPT-4o-mini)
  • HugLLM (Hugginface Open-source)
  • TestLLM (Experimental CPU-only)

What I’m Testing

I’m pushing the limits of small open-source models for AI network monitoring, specifically:

  • Function calling against live network services
  • How small can a model go while still handling:
    • Automated Nmap scans
    • Quantum-readiness checks
    • Network Monitoring tasks

🟑 TestLLM – Current experimental model (llama.cpp on 2 CPU threads):

  • βœ… Zero-configuration setup
  • ⏳ 30s load time (slow inference but no API costs)
  • πŸ”§ Help wanted! If you’re into edge-device AI, let’s collaborate!

Other Assistants

🟒 TurboLLM – Uses gpt-4o-mini for:

πŸ”΅ HugLLM – Latest Open-source models:

  • 🌐 Runs on Hugging Face Inference API

πŸ’‘ Example commands to you could test:

  1. "Give me info on my websites SSL certificate"
  2. "Check if my server is using quantum safe encyption for communication"
  3. "Run a comprehensive security audit on my server"
  4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Free Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution!

QwQ-32B-ArliAI-RpR-v4

clickbait

Image generated using Arli AI Image Generation https://www.arliai.com/image-generation

RpR v4 Changes:

The best RP/creative model from ArliAI yet again.

  • Reduced repetitions and impersonation

    To add to the creativity and out of the box thinking of RpR v3, a more advanced filtering method was used in order to remove examples where the LLM repeated similar phrases or talked for the user. Any repetition or impersonation cases that happens will be due to how the base QwQ model was trained, and not because of the RpR dataset.

  • Increased training sequence length

    The training sequence length was increased to 16K in order to help awareness and memory even on longer chats.

RpR Series Overview: Building on RPMax with Reasoning

RpR (RolePlay with Reasoning) is a new series of models from ArliAI. This series builds directly upon the successful dataset curation methodology and training methods developed for the RPMax series.

RpR models use the same curated, deduplicated RP and creative writing dataset used for RPMax, with a focus on variety to ensure high creativity and minimize cross-context repetition. Users familiar with RPMax will recognize the unique, non-repetitive writing style unlike other finetuned-for-RP models.

With the release of QwQ as the first high performing open-source reasoning model that can be easily trained, it was clear that the available instruct and creative writing reasoning datasets contains only one response per example. This is type of single response dataset used for training reasoning models causes degraded output quality in long multi-turn chats. Which is why Arli AI decided to create a real RP model capable of long multi-turn chat with reasoning.

In order to create RpR, we first had to actually create the reasoning RP dataset by re-processing our existing known-good RPMax dataset into a reasoning dataset. This was possible by using the base QwQ Instruct model itself to create the reasoning process for every turn in the RPMax dataset conversation examples, which is then further refined in order to make sure the reasoning is in-line with the actual response examples from the dataset.

Another important thing to get right is to make sure the model is trained on examples that present reasoning blocks in the same way as it encounters it during inference. Which is, never seeing the reasoning blocks in it's context. In order to do this, the training run was completed using axolotl with manual template-free segments dataset in order to make sure that the model is never trained to see the reasoning block in the context. Just like how the model will be used during inference time.

The result of training QwQ on this dataset with this method are consistently coherent and interesting outputs even in long multi-turn RP chats. This is as far as we know the first true correctly-trained reasoning model trained for RP and creative writing.

You can access the model at https://arliai.com and we also have a models ranking page at https://www.arliai.com/models-ranking

Ask questions in our new Discord Server https://discord.com/invite/t75KbPgwhk or on our subreddit https://www.reddit.com/r/ArliAI/

Model Description

QwQ-32B-ArliAI-RpR-v4 is the third release in the RpR series. It is a 32-billion parameter model fine-tuned using the RpR dataset based on the curated RPMax dataset combined with techniques to maintain reasoning abilities in long multi-turn chats.

Recommended Samplers

  • RpR models does not work well with repetition penalty type of samplers, even more advanced ones such as XTC or DRY.
  • It works best with simple sampler settings and also being allowed to reason for a long time (high max tokens).
  • You can download the ST master export uploaded in the files section of this repo as well.

Recommended to first start with:

  • Temperature: 1.0
  • MinP: 0.02
  • TopK: 40
  • Response Tokens: 2048+

Specs

  • Base Model: QwQ-32B
  • Max Context Length: Max 128K with Yarn (Same as base QwQ it is Natively 32K)
  • Parameters: 32B
  • Reasoning Model: Yes

Training Details

  • Sequence Length: 16384
  • Epochs: 1 epoch training (Inherited from RPMax methods)
  • Fine-tuning Method: RS-QLORA+ (Rank-Stabilized LoRA + LoRA Plus 8x)
  • Rank/Alpha: 128-rank 128-alpha
  • Learning Rate: 0.00001
  • Scheduler: Rex
  • Gradient accumulation: 32

Very Nice Training graphs :)

Train Loss Eval Loss

Quantization

How to use reasoning models correctly in ST

RpR ST Settings

For any reasoning models in general, you need to make sure to set:

  • Prefix is set to ONLY <think> and the suffix is set to ONLY </think> without any spaces or newlines (enter)

  • Reply starts with <think>

  • Always add character names is unchecked

  • Include names is set to never

  • As always the chat template should also conform to the model being used

Note: Reasoning models work properly only if include names is set to never, since they always expect the eos token of the user turn followed by the <think> token in order to start reasoning before outputting their response. If you set include names to enabled, then it will always append the character name at the end like "Seraphina:<eos_token>" which confuses the model on whether it should respond or reason first.

The rest of your sampler parameters can be set as you wish as usual.

If you don't see the reasoning wrapped inside the thinking block, then either your settings is still wrong and doesn't follow my example or that your ST version is too old without reasoning block auto parsing.

If you see the whole response is in the reasoning block, then your <think> and </think> reasoning token suffix and prefix might have an extra space or newline. Or the model just isn't a reasoning model that is smart enough to always put reasoning in between those tokens.

If you set everything up correctly, it should look like this:

RpR example response
Details: The RPMax Foundation (Dataset & Training Philosophy)

The following sections detail the core philosophy behind the dataset and training methodology originally developed for RPMax, which serves as the foundation for the RpR series.

The Goal: Reduced Repetition and Higher Creativity

The goal of the dataset curation used for both RPMax and RpR is to reduce repetitions and increase the models ability to creatively write in different situations presented to it. What this means is it is a model that will output responses very differently without falling into predictable tropes across different situations.

What is repetition and creativity?

First of all, creativity should mean the variety in output that the model is capable of creating. You should not confuse creativity with writing prose. When a model writes in a way that can be said to be pleasant like writers would write in a novel, this is not creative writing. This is just a model having a certain pleasant type of writing prose. So a model that writes nicely is not necessarily a creative model.

Repetition and creativity are essentially intertwined with each other, so if a model is repetitive then a model can also be said to be un-creative as it cannot write new things and can only repeat similar responses that it has created before. For repetition there are actually two very different forms of repetition.

In-context repetition: When people mention a model is repetitive, this usually mean a model that likes to repeat the same phrases in a single conversation. An example of this is when a model says that a character "flicks her hair and...." and then starts to prepend that "flicks her hair and..." into every other action that character does.

It can be said that the model is boring, but even in real people's writing it is possible that this kind of repetition could be intentional to subtly prove a point or showcase a character's traits in some scenarios. So this type of repetition is not always bad and completely discouraging a model from doing this does not always lead to improve a model's writing ability.

In this regard, RPMax and RpR is not yet focused on eliminating this type of repetition so there might be some in-context repetition that can be seen in the outputs. Eliminating this will be the next big step of the RPMax and RpR series of models.

Cross-context repetition: A second worse type of repetition is a model's tendency to repeat the same phrases or tropes in very different situations. An example is a model that likes to repeat the infamous "shivers down my spine" phrase in wildly different conversations that don't necessarily fit with that phrase.

This type of repetition is ALWAYS bad as it is a sign that the model has over-fitted into that style of "creative writing" that it has often seen in the training dataset. A model's tendency to have cross-context repetition is also usually visible in how a model likes to choose similar repetitive names when writing stories. Such as the infamous "elara" and "whispering woods" names.

The primary goal of the dataset curation for RPMax and RpR is to create a highly creative model by reducing cross-context repetition, as that is the type of repetition that follows you through different conversations. This is combated by making sure the dataset does not have repetitions of the same situations or characters in different example entries.

Dataset Curation

The success of models trained on this dataset (including RPMax and now RpR) is thanks to the training method and the unique dataset created for fine-tuning. It contains as many open source creative writing and RP datasets that can be found (all from Hugging Face), from which have been curated to weed out datasets that are purely synthetic generations as they often only serve to dumb down the model and make the model learn GPT-isms (slop) rather than help.

Then Llama 3.1 8B (or a similarly capable model) is used to create a database of the characters and situations that are portrayed in these datasets, which is then used to de-dupe these datasets to make sure that there is only a single entry of any character or situation.

The Golden Rule of Fine-Tuning

Unlike the initial pre-training stage where the more data you throw at it the better it becomes for the most part, the golden rule for fine-tuning models isn't quantity, but instead quality over quantity. So the dataset used here is actually orders of magnitude smaller than it would be if it included repeated characters and situations in the dataset, but the end result is a model that does not feel like just another "in-breed" of another creative writing/RP model.

Training Parameters and Unconventional Approach

The usual way is to have a low learning rate and high gradient accumulation for better loss stability, and then run multiple epochs of the training run until the loss is acceptable.

The RPMax and RpR methodology, however, uses only one single epoch, a low gradient accumulation, and a higher than normal learning rate. The loss curve during training is actually unstable and jumps up and down a lot, but if it is smoothed out, it is steadily decreasing over time. The theory is that this allows the models to learn from each individual example in the dataset much more, and by not showing the model the same example twice using multiple epochs, it stops the model from latching on and reinforcing a single character or story trope.

The jumping up and down of loss during training is because as the model gets trained on a new entry from the dataset, the model will have never seen a similar example before and therefore can't really predict an answer similar to the example entry. While the relatively high end loss of 1.0 or slightly above is actually acceptable because the goal was never to create a model that can output exactly like the dataset that is being used to train it. Rather to create a model that is creative enough to make up it's own style of responses.

This is different from training a model in a particular domain and needing the model to reliably be able to output like the example dataset, such as when training a model on a company's internal knowledge base.


Try It Out!

Model preference is subjective, so please do try QwQ-32B-ArliAI-RpR-v4 for yourself. Your feedback both good and bad is always valueable and will help us improve the future RPMax and RpR models.

Downloads last month
1,019
GGUF
Model size
32.8B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mungert/QwQ-32B-ArliAI-RpR-v4-GGUF

Base model

Qwen/Qwen2.5-32B
Finetuned
Qwen/QwQ-32B
Quantized
(165)
this model