---
library_name: transformers
base_model: Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod
tags:
- axolotl
- generated_from_trainer
datasets:
- PocketDoc/Dans-Reasoningmaxx-NaturalReasoning
- PocketDoc/Dans-Reasoningmaxx-WebInstruct
- PocketDoc/Dans-Benchmaxx-COT
- PocketDoc/Dans-Logicmaxx-SAT-AP
- PocketDoc/Dans-Assistantmaxx-Opus-Merge
- PocketDoc/Dans-Assistantmaxx-sonnetorca-subset
model-index:
- name: 12b-mn-dans-reasoning-test-4
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.8.0.dev0`
```yaml
base_model: Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code:

# wandb configuration
wandb_project: 12b-mn-dans-reasoning-test
wandb_watch:

wandb_run_id: V0.0.3-1-3 # V{Version}-{Run Number}-{Attempt Number}
wandb_log_model:

# push checkpoints to hub
hub_model_id: Dans-DiscountModels/12b-mn-dans-reasoning-test-4
# how to push checkpoints to hub
# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
hub_strategy: "every_save"
# Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets
# Required to be true when used in combination with `push_dataset_to_hub`
hf_use_auth_token: true

# where to save the finished model to
output_dir: ./12b-mn-dans-reasoning-test

save_safetensors: true

# dataset settings (local or huggingface repo)
datasets:
  - path: PocketDoc/Dans-Reasoningmaxx-NaturalReasoning
    type: dan-chat-advanced
  - path: PocketDoc/Dans-Reasoningmaxx-WebInstruct
    type: dan-chat-advanced
  - path: PocketDoc/Dans-Benchmaxx-COT
    type: dan-chat-advanced
  - path: PocketDoc/Dans-Logicmaxx-SAT-AP
    type: dan-chat-advanced
  - path: PocketDoc/Dans-Assistantmaxx-Opus-Merge
    type: dan-chat-advanced
  - path: PocketDoc/Dans-Assistantmaxx-sonnetorca-subset
    type: dan-chat-advanced

plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

adapter:
lora_model_dir:

lora_r: 128
lora_alpha: 128
lora_dropout: 0.1
lora_target_linear: True
lora_target_modules:
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_fan_in_fan_out:
peft_use_rslora: true

dataset_prepared_path: ./12b-mn-dans-reasoning-test-data
val_set_size: 0.005

sequence_len: 8192

sample_packing: true
eval_sample_packing: true

pad_to_sequence_len: true

gradient_checkpointing: true
# gradient_checkpointing_kwargs:
# use_reentrant: false

gradient_accumulation_steps: 1
micro_batch_size: 4

num_epochs: 2

optimizer: came_pytorch

lr_scheduler: rex
learning_rate: 0.0000015
cosine_min_lr_ratio: 0.1

weight_decay: 0.1

max_grad_norm: 0.1

train_on_inputs: false
group_by_length: true

bf16: true
fp16: false
tf32: false

early_stopping_patience:

resume_from_checkpoint:
auto_resume_from_checkpoints: true

local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.05

evals_per_epoch: 16
eval_table_size:
eval_max_new_tokens:

saves_per_epoch: 8
save_total_limit: 1

debug: false

deepspeed: deepspeed_configs/zero3_bf16.json

fsdp:
fsdp_config:

special_tokens:

```

</details><br>

# 12b-mn-dans-reasoning-test-4

This model is a fine-tuned version of [Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod](https://huggingface.co/Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod) on the PocketDoc/Dans-Reasoningmaxx-NaturalReasoning, the PocketDoc/Dans-Reasoningmaxx-WebInstruct, the PocketDoc/Dans-Benchmaxx-COT, the PocketDoc/Dans-Logicmaxx-SAT-AP, the PocketDoc/Dans-Assistantmaxx-Opus-Merge and the PocketDoc/Dans-Assistantmaxx-sonnetorca-subset datasets.
It achieves the following results on the evaluation set:
- Loss: 0.6204

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1.5e-06
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- total_train_batch_size: 16
- total_eval_batch_size: 16
- optimizer: Use OptimizerNames.ADAMW_HF with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 38
- num_epochs: 2.0

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 0.8049        | 0.0026 | 1    | 0.8338          |
| 0.8006        | 0.0644 | 25   | 0.7551          |
| 0.7197        | 0.1289 | 50   | 0.7009          |
| 0.7165        | 0.1933 | 75   | 0.6810          |
| 0.7183        | 0.2577 | 100  | 0.6697          |
| 0.6671        | 0.3222 | 125  | 0.6620          |
| 0.6406        | 0.3866 | 150  | 0.6567          |
| 0.6656        | 0.4510 | 175  | 0.6528          |
| 0.6539        | 0.5155 | 200  | 0.6489          |
| 0.634         | 0.5799 | 225  | 0.6460          |
| 0.6606        | 0.6443 | 250  | 0.6428          |
| 0.6815        | 0.7088 | 275  | 0.6401          |
| 0.6082        | 0.7732 | 300  | 0.6385          |
| 0.6754        | 0.8376 | 325  | 0.6364          |
| 0.6284        | 0.9021 | 350  | 0.6347          |
| 0.6517        | 0.9665 | 375  | 0.6326          |
| 0.5583        | 1.0309 | 400  | 0.6340          |
| 0.5716        | 1.0954 | 425  | 0.6328          |
| 0.5799        | 1.1598 | 450  | 0.6323          |
| 0.5957        | 1.2242 | 475  | 0.6316          |
| 0.589         | 1.2887 | 500  | 0.6300          |
| 0.6007        | 1.3531 | 525  | 0.6289          |
| 0.5751        | 1.4175 | 550  | 0.6284          |
| 0.5627        | 1.4820 | 575  | 0.6275          |
| 0.5689        | 1.5464 | 600  | 0.6267          |
| 0.5098        | 1.6108 | 625  | 0.6259          |
| 0.5623        | 1.6753 | 650  | 0.6248          |
| 0.5608        | 1.7397 | 675  | 0.6237          |
| 0.5552        | 1.8041 | 700  | 0.6233          |
| 0.554         | 1.8686 | 725  | 0.6228          |
| 0.5696        | 1.9330 | 750  | 0.6217          |
| 0.5537        | 1.9974 | 775  | 0.6204          |


### Framework versions

- Transformers 4.49.0
- Pytorch 2.4.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.1