-
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
sail/Llama-3-Base-8B-DICE-Iter1
Text Generation • 8B • Updated • 3 • 2 -
sail/Llama-3-Base-8B-DICE-Iter2
Text Generation • 8B • Updated • 4 • 3 -
sail/Zephyr-7B-DICE-Iter1
Text Generation • 7B • Updated • 4
Collections
Discover the best community collections!
Collections including paper arxiv:2406.09760
-
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Paper • 2406.11931 • Published • 66 -
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper • 2406.14544 • Published • 36 -
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 95
-
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper • 2406.11827 • Published • 15 -
Self-Improving Robust Preference Optimization
Paper • 2406.01660 • Published • 20 -
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Paper • 2406.12168 • Published • 7
-
Suppressing Pink Elephants with Direct Principle Feedback
Paper • 2402.07896 • Published • 11 -
Policy Improvement using Language Feedback Models
Paper • 2402.07876 • Published • 9 -
Direct Language Model Alignment from Online AI Feedback
Paper • 2402.04792 • Published • 34 -
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Paper • 2401.01335 • Published • 68
-
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 95 -
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Paper • 2405.21060 • Published • 68 -
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Paper • 2405.20541 • Published • 24 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper • 2406.01574 • Published • 51
-
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Paper • 2406.12168 • Published • 7 -
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper • 2406.11827 • Published • 15 -
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Paper • 2406.18629 • Published • 43
-
Understanding the performance gap between online and offline alignment algorithms
Paper • 2405.08448 • Published • 20 -
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Paper • 2405.19332 • Published • 23 -
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Paper • 2405.19107 • Published • 15 -
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper • 2406.00888 • Published • 34
-
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Paper • 2408.08152 • Published • 60 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 23 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 57 -
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 21
-
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
sail/Llama-3-Base-8B-DICE-Iter1
Text Generation • 8B • Updated • 3 • 2 -
sail/Llama-3-Base-8B-DICE-Iter2
Text Generation • 8B • Updated • 4 • 3 -
sail/Zephyr-7B-DICE-Iter1
Text Generation • 7B • Updated • 4
-
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 95 -
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Paper • 2405.21060 • Published • 68 -
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Paper • 2405.20541 • Published • 24 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper • 2406.01574 • Published • 51
-
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Paper • 2406.11931 • Published • 66 -
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper • 2406.14544 • Published • 36 -
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 95
-
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Paper • 2406.12168 • Published • 7 -
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper • 2406.11827 • Published • 15 -
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Paper • 2406.18629 • Published • 43
-
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper • 2406.11827 • Published • 15 -
Self-Improving Robust Preference Optimization
Paper • 2406.01660 • Published • 20 -
Bootstrapping Language Models with DPO Implicit Rewards
Paper • 2406.09760 • Published • 41 -
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Paper • 2406.12168 • Published • 7
-
Understanding the performance gap between online and offline alignment algorithms
Paper • 2405.08448 • Published • 20 -
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Paper • 2405.19332 • Published • 23 -
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Paper • 2405.19107 • Published • 15 -
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper • 2406.00888 • Published • 34
-
Suppressing Pink Elephants with Direct Principle Feedback
Paper • 2402.07896 • Published • 11 -
Policy Improvement using Language Feedback Models
Paper • 2402.07876 • Published • 9 -
Direct Language Model Alignment from Online AI Feedback
Paper • 2402.04792 • Published • 34 -
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Paper • 2401.01335 • Published • 68
-
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Paper • 2408.08152 • Published • 60 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 23 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 57 -
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 21