--- license: apache-2.0 --- # Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation [[📖 Paper](https://arxiv.org/abs/2506.15068)] [[github](https://github.com/zli12321/long_form_rl)] ## About Open-Ended R1 Training As open-ended long-form generation gains traction, reliably judging the quality of multi-sentence and paragraph-length outputs has become a major hurdle—traditional overlap metrics like ROUGE-L and BERTScore often miss nuances of coherence, style, and relevance, and can be skewed by pretraining biases. This leaves a critical gap in evaluation methods for guiding and training models that produce lengthy, free-form text. ## 🏅 🔥 Reward Model - RewardBert is specifically targeted for free-form GRPO training, where the answers cannot be evaluated based on simple correctness. - We use [ModernBERT](https://huggingface.co/docs/transformers/en/model_doc/modernbert) as the base model to finetune on [MOCHA](https://arxiv.org/abs/2010.03636), [Prometheus-preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [Pedants](https://arxiv.org/abs/2402.11161) to evaluate free-form text generations. We use RewardBert as the reward in GRPO finetuning. ### Installation ``` ## For more evaluation metrics, refer to https://github.com/zli12321/qa_metrics pip install qa-metrics ``` #### Method: `compute_score` **Parameters** - `reference_answer` (str): gold (correct) answer to the question - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated **Returns** - `tuple`: A tuple of normalized and raw scores. ```python from qa_metrics.RewardBert import RewardBert rb = RewardBert(device='cuda') reference_answer = "The Frog Prince" candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\"" rb.compute_score(reference_answer, candidate_answer) # (0.29113227128982544, 2.1645290851593018) ``` #### Method: `compute_batch_scores` **Parameters** - `reference_answers` (list of str): A list of gold (correct) answers to the question - `candidate_answer` (list of str): A list of answers provided by a candidate that needs to be evaluated - `batch_size` (int): batch size to predict (default 1) **Returns** - `tuple`: A tuple of a list of normalized and raw scores. ```python from qa_metrics.RewardBert import RewardBert rb = RewardBert(device='cuda') reference_answer = ["The Frog Prince"] candidate_answer = ["The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""] rb.compute_batch_scores(reference_answer, candidate_answer, batch_size=1) # ([0.29113227128982544], [2.1645290851593018]) ``` ## Acknowledgements We sincerely appreciate the contributions of the open-source community. The related projects are as follows: [R1-V](https://github.com/Deep-Agent/R1-V) , [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) , [Video-R1](https://github.com/tulerfeng/Video-R1), [Qwen-2.5-VL](https://arxiv.org/abs/2502.13923) ## Citations If you find our work helpful for your research, please consider citing our work. ``` @misc{li2025semanticallyawarerewardsopenendedr1, title={Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation}, author={Zongxia Li and Yapei Chang and Yuhang Zhou and Xiyang Wu and Zichao Liang and Yoo Yeon Sung and Jordan Lee Boyd-Graber}, year={2025}, eprint={2506.15068}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.15068}, } ## VLMs that use RewardBert as an evaluator @misc{li2025videohalluevaluatingmitigatingmultimodal, title={VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos}, author={Zongxia Li and Xiyang Wu and Yubin Qin and Guangyao Shi and Hongyang Du and Dinesh Manocha and Tianyi Zhou and Jordan Lee Boyd-Graber}, year={2025}, eprint={2505.01481}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.01481}, } ```