drula eric
esselte974
39
followers
·
311 following
AI & ML interests
None yet
Recent Activity
reacted
to
Kseniase 's
post
with ❤️
26 days ago
11 Alignment and Optimization Algorithms for LLMs
When we need to align models' behavior with the desired objectives, we rely on specialized algorithms that support helpfulness, accuracy, reasoning, safety, and alignment with user preferences. Much of a model’s usefulness comes from post-training optimization methods.
Here are the main optimization algorithms (both classic and new) in one place:
1. PPO (Proximal Policy Optimization) -> https://huggingface.co/papers/1707.06347
Clips the probability ratio to prevent the new policy from diverging too far from the old one. It helps keep everything stable
2. DPO (Direct Preference Optimization) -> https://huggingface.co/papers/2305.18290
It's a non RL method, where an LM is an implicit reward model. It uses a simple loss to boost the preferred answer’s probability over the less preferred one
3. GRPO (Group Relative Policy Optimization) -> https://huggingface.co/papers/2402.03300
An RL method that compares a group of model outputs for the same input and updates the policy based on relative rankings. It doesn't need a separate critic model
It's latest application is Flow-GRPO which adds online RL into flow matching models -> https://huggingface.co/papers/2505.05470
4. DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) -> https://huggingface.co/papers/2503.14476
Decouples the clipping bounds for flexibility, introducing 4 key techniques: clip-higher (to maintain exploration), dynamic sampling (to ensure gradient updates), token-level loss (to balance learning across long outputs), and overlong reward shaping (to handle long, truncated answers)
5. Supervised Fine-Tuning (SFT) -> https://huggingface.co/papers/2203.02155
Often the first post-pretraining step. A model is fine-tuned on a dataset of high-quality human-written input-output pairs to directly teach desired behaviors
More in the comments 👇
If you liked it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe
View all activity
Organizations
None yet
esselte974 's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
reacted to
Kseniase 's
post with ❤️
26 days ago
view post
11 Alignment and Optimization Algorithms for LLMs When we need to align models' behavior with the desired objectives, we rely on specialized algorithms that support helpfulness, accuracy, reasoning, safety, and alignment with user preferences. Much of a model’s usefulness comes from post-training optimization methods. Here are the main optimization algorithms (both classic and new) in one place: 1. PPO (Proximal Policy Optimization) ->
Proximal Policy Optimization Algorithms (1707.06347) Clips the probability ratio to prevent the new policy from diverging too far from the old one. It helps keep everything stable 2. DPO (Direct Preference Optimization) ->
Direct Preference Optimization: Your Language Model is Secretly a Reward
Model (2305.18290) It's a non RL method, where an LM is an implicit reward model. It uses a simple loss to boost the preferred answer’s probability over the less preferred one 3. GRPO (Group Relative Policy Optimization) ->
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
Language Models (2402.03300) An RL method that compares a group of model outputs for the same input and updates the policy based on relative rankings. It doesn't need a separate critic model It's latest application is Flow-GRPO which adds online RL into flow matching models ->
Flow-GRPO: Training Flow Matching Models via Online RL (2505.05470) 4. DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) ->
DAPO: An Open-Source LLM Reinforcement Learning System at Scale (2503.14476) Decouples the clipping bounds for flexibility, introducing 4 key techniques: clip-higher (to maintain exploration), dynamic sampling (to ensure gradient updates), token-level loss (to balance learning across long outputs), and overlong reward shaping (to handle long, truncated answers) 5. Supervised Fine-Tuning (SFT) ->
Training language models to follow instructions with human feedback (2203.02155) Often the first post-pretraining step. A model is fine-tuned on a dataset of high-quality human-written input-output pairs to directly teach desired behaviors More in the comments 👇 If you liked it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe
See translation
1 reply
·
Reply