gijs
/

audsemthinker-qa-grpo

qwen2_5_omni_thinker

Model card Files Files and versions

gijs commited on May 22

Commit

01a7248

·

verified ·

1 Parent(s): 19d9241

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -19,6 +19,8 @@ datasets:
 # AudSemThinker-QA-GRPO
 ## Model Description
 `AudSemThinker-QA-GRPO` is an advanced variant of `AudSemThinker`, fine-tuned using Group Relative Policy Optimization (GRPO) with Verifiable Rewards (RLVR). This approach enhances reasoning capabilities and allows for controlled thinking budget during generation. It leverages the structured reasoning framework of `AudSemThinker` (thinking, semantic elements, answer phases) but is specifically optimized for multiple-choice audio question answering. This model is designed to produce accurate answers while maintaining a controlled reasoning length in its `<think>` section.

 # AudSemThinker-QA-GRPO
+Corresponding paper: https://arxiv.org/abs/2505.14142
 ## Model Description
 `AudSemThinker-QA-GRPO` is an advanced variant of `AudSemThinker`, fine-tuned using Group Relative Policy Optimization (GRPO) with Verifiable Rewards (RLVR). This approach enhances reasoning capabilities and allows for controlled thinking budget during generation. It leverages the structured reasoning framework of `AudSemThinker` (thinking, semantic elements, answer phases) but is specifically optimized for multiple-choice audio question answering. This model is designed to produce accurate answers while maintaining a controlled reasoning length in its `<think>` section.