--- library_name: transformers base_model: Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod tags: - axolotl - generated_from_trainer datasets: - PocketDoc/Dans-Reasoningmaxx-NaturalReasoning - PocketDoc/Dans-Reasoningmaxx-WebInstruct - PocketDoc/Dans-Benchmaxx-COT - PocketDoc/Dans-Logicmaxx-SAT-AP - PocketDoc/Dans-Assistantmaxx-Opus-Merge - PocketDoc/Dans-Assistantmaxx-sonnetorca-subset model-index: - name: 12b-mn-dans-reasoning-test-4 results: [] --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.8.0.dev0` ```yaml base_model: Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: # wandb configuration wandb_project: 12b-mn-dans-reasoning-test wandb_watch: wandb_run_id: V0.0.3-1-3 # V{Version}-{Run Number}-{Attempt Number} wandb_log_model: # push checkpoints to hub hub_model_id: Dans-DiscountModels/12b-mn-dans-reasoning-test-4 # how to push checkpoints to hub # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy hub_strategy: "every_save" # Whether to use hf `use_auth_token` for loading datasets. Useful for fetching private datasets # Required to be true when used in combination with `push_dataset_to_hub` hf_use_auth_token: true # where to save the finished model to output_dir: ./12b-mn-dans-reasoning-test save_safetensors: true # dataset settings (local or huggingface repo) datasets: - path: PocketDoc/Dans-Reasoningmaxx-NaturalReasoning type: dan-chat-advanced - path: PocketDoc/Dans-Reasoningmaxx-WebInstruct type: dan-chat-advanced - path: PocketDoc/Dans-Benchmaxx-COT type: dan-chat-advanced - path: PocketDoc/Dans-Logicmaxx-SAT-AP type: dan-chat-advanced - path: PocketDoc/Dans-Assistantmaxx-Opus-Merge type: dan-chat-advanced - path: PocketDoc/Dans-Assistantmaxx-sonnetorca-subset type: dan-chat-advanced plugins: - axolotl.integrations.liger.LigerPlugin - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin liger_rope: true liger_rms_norm: true liger_layer_norm: true liger_glu_activation: true liger_fused_linear_cross_entropy: false cut_cross_entropy: true load_in_8bit: false load_in_4bit: false strict: false adapter: lora_model_dir: lora_r: 128 lora_alpha: 128 lora_dropout: 0.1 lora_target_linear: True lora_target_modules: lora_modules_to_save: - embed_tokens - lm_head lora_fan_in_fan_out: peft_use_rslora: true dataset_prepared_path: ./12b-mn-dans-reasoning-test-data val_set_size: 0.005 sequence_len: 8192 sample_packing: true eval_sample_packing: true pad_to_sequence_len: true gradient_checkpointing: true # gradient_checkpointing_kwargs: # use_reentrant: false gradient_accumulation_steps: 1 micro_batch_size: 4 num_epochs: 2 optimizer: came_pytorch lr_scheduler: rex learning_rate: 0.0000015 cosine_min_lr_ratio: 0.1 weight_decay: 0.1 max_grad_norm: 0.1 train_on_inputs: false group_by_length: true bf16: true fp16: false tf32: false early_stopping_patience: resume_from_checkpoint: auto_resume_from_checkpoints: true local_rank: logging_steps: 1 xformers_attention: flash_attention: true warmup_ratio: 0.05 evals_per_epoch: 16 eval_table_size: eval_max_new_tokens: saves_per_epoch: 8 save_total_limit: 1 debug: false deepspeed: deepspeed_configs/zero3_bf16.json fsdp: fsdp_config: special_tokens: ```

# 12b-mn-dans-reasoning-test-4 This model is a fine-tuned version of [Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod](https://huggingface.co/Dans-DiscountModels/Mistral-Nemo-Base-2407-ChatML-Mod) on the PocketDoc/Dans-Reasoningmaxx-NaturalReasoning, the PocketDoc/Dans-Reasoningmaxx-WebInstruct, the PocketDoc/Dans-Benchmaxx-COT, the PocketDoc/Dans-Logicmaxx-SAT-AP, the PocketDoc/Dans-Assistantmaxx-Opus-Merge and the PocketDoc/Dans-Assistantmaxx-sonnetorca-subset datasets. It achieves the following results on the evaluation set: - Loss: 0.6204 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1.5e-06 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - distributed_type: multi-GPU - num_devices: 4 - total_train_batch_size: 16 - total_eval_batch_size: 16 - optimizer: Use OptimizerNames.ADAMW_HF with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 38 - num_epochs: 2.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.8049 | 0.0026 | 1 | 0.8338 | | 0.8006 | 0.0644 | 25 | 0.7551 | | 0.7197 | 0.1289 | 50 | 0.7009 | | 0.7165 | 0.1933 | 75 | 0.6810 | | 0.7183 | 0.2577 | 100 | 0.6697 | | 0.6671 | 0.3222 | 125 | 0.6620 | | 0.6406 | 0.3866 | 150 | 0.6567 | | 0.6656 | 0.4510 | 175 | 0.6528 | | 0.6539 | 0.5155 | 200 | 0.6489 | | 0.634 | 0.5799 | 225 | 0.6460 | | 0.6606 | 0.6443 | 250 | 0.6428 | | 0.6815 | 0.7088 | 275 | 0.6401 | | 0.6082 | 0.7732 | 300 | 0.6385 | | 0.6754 | 0.8376 | 325 | 0.6364 | | 0.6284 | 0.9021 | 350 | 0.6347 | | 0.6517 | 0.9665 | 375 | 0.6326 | | 0.5583 | 1.0309 | 400 | 0.6340 | | 0.5716 | 1.0954 | 425 | 0.6328 | | 0.5799 | 1.1598 | 450 | 0.6323 | | 0.5957 | 1.2242 | 475 | 0.6316 | | 0.589 | 1.2887 | 500 | 0.6300 | | 0.6007 | 1.3531 | 525 | 0.6289 | | 0.5751 | 1.4175 | 550 | 0.6284 | | 0.5627 | 1.4820 | 575 | 0.6275 | | 0.5689 | 1.5464 | 600 | 0.6267 | | 0.5098 | 1.6108 | 625 | 0.6259 | | 0.5623 | 1.6753 | 650 | 0.6248 | | 0.5608 | 1.7397 | 675 | 0.6237 | | 0.5552 | 1.8041 | 700 | 0.6233 | | 0.554 | 1.8686 | 725 | 0.6228 | | 0.5696 | 1.9330 | 750 | 0.6217 | | 0.5537 | 1.9974 | 775 | 0.6204 | ### Framework versions - Transformers 4.49.0 - Pytorch 2.4.1+cu124 - Datasets 3.2.0 - Tokenizers 0.21.1