Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2406.09760

Self-alignment with DPO Implicit Rewards

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
sail/Llama-3-Base-8B-DICE-Iter1

Text Generation • 8B • Updated Mar 11 • 3 • 2
sail/Llama-3-Base-8B-DICE-Iter2

Text Generation • 8B • Updated Mar 11 • 4 • 3
sail/Zephyr-7B-DICE-Iter1

Text Generation • 7B • Updated Mar 11 • 4

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Paper • 2406.11931 • Published Jun 17, 2024 • 66
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 36
Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20, 2024 • 95

WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 15
Self-Improving Robust Preference Optimization

Paper • 2406.01660 • Published Jun 3, 2024 • 20
Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published Jun 18, 2024 • 7

Learning from feedback dir

Suppressing Pink Elephants with Direct Principle Feedback

Paper • 2402.07896 • Published Feb 12, 2024 • 11
Policy Improvement using Language Feedback Models

Paper • 2402.07876 • Published Feb 12, 2024 • 9
Direct Language Model Alignment from Online AI Feedback

Paper • 2402.04792 • Published Feb 7, 2024 • 34
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Paper • 2401.01335 • Published Jan 2, 2024 • 68

DS' Daily paper

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20, 2024 • 95
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 68
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Paper • 2405.20541 • Published May 30, 2024 • 24
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 51

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published Jun 18, 2024 • 7
WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 15
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 43

Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14, 2024 • 20
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Paper • 2405.19332 • Published May 29, 2024 • 23
Offline Regularised Reinforcement Learning for Large Language Models Alignment

Paper • 2405.19107 • Published May 29, 2024 • 15
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2, 2024 • 34

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

Paper • 2408.08152 • Published Aug 15, 2024 • 60
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23, 2024 • 23
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Paper • 2402.19427 • Published Feb 29, 2024 • 57
Simple linear attention language models balance the recall-throughput tradeoff

Paper • 2402.18668 • Published Feb 28, 2024 • 21

Self-alignment with DPO Implicit Rewards

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
sail/Llama-3-Base-8B-DICE-Iter1

Text Generation • 8B • Updated Mar 11 • 3 • 2
sail/Llama-3-Base-8B-DICE-Iter2

Text Generation • 8B • Updated Mar 11 • 4 • 3
sail/Zephyr-7B-DICE-Iter1

Text Generation • 7B • Updated Mar 11 • 4

DS' Daily paper

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20, 2024 • 95
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper • 2405.21060 • Published May 31, 2024 • 68
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Paper • 2405.20541 • Published May 30, 2024 • 24
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3, 2024 • 51

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Paper • 2406.11931 • Published Jun 17, 2024 • 66
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 36
Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20, 2024 • 95

Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published Jun 18, 2024 • 7
WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 15
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 43

WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 15
Self-Improving Robust Preference Optimization

Paper • 2406.01660 • Published Jun 3, 2024 • 20
Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published Jun 18, 2024 • 7

Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14, 2024 • 20
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Paper • 2405.19332 • Published May 29, 2024 • 23
Offline Regularised Reinforcement Learning for Large Language Models Alignment

Paper • 2405.19107 • Published May 29, 2024 • 15
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2, 2024 • 34

Learning from feedback dir

Suppressing Pink Elephants with Direct Principle Feedback

Paper • 2402.07896 • Published Feb 12, 2024 • 11
Policy Improvement using Language Feedback Models

Paper • 2402.07876 • Published Feb 12, 2024 • 9
Direct Language Model Alignment from Online AI Feedback

Paper • 2402.04792 • Published Feb 7, 2024 • 34
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Paper • 2401.01335 • Published Jan 2, 2024 • 68

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

Paper • 2408.08152 • Published Aug 15, 2024 • 60
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23, 2024 • 23
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Paper • 2402.19427 • Published Feb 29, 2024 • 57
Simple linear attention language models balance the recall-throughput tradeoff

Paper • 2402.18668 • Published Feb 28, 2024 • 21

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs