HINT-lab
/

llama3-8b-final-ppo-c-v0.3

Text Generation

text-generation-inference

Model card Files Files and versions Community

llama3-8b-final-ppo-c-v0.3 / README.md

teapot123's picture

Update README.md

3d5b45a verified 10 months ago

|

history blame contribute delete

1.68 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	PPO-C (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
	PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
	classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
	Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.



	## Model Details


	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	We train [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture) on our [HINT-lab/prompt-collections-final-v0.3](https://huggingface.co/datasets/HINT-lab/prompt-collections-final-v0.3)
	with a vanilla reward model [OpenRLHF/Llama-3-8b-rm-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-rm-mixture).

	- Developed by: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
	- Finetuned from model: [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture)

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [Our repo](https://github.com/SeanLeng1/Reward-Calibration)
	- Paper: [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)
	<!-- - Demo [optional]: [More Information Needed] -->