Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2308.06259

Synthetic Data Generation

Textbooks Are All You Need

Paper • 2306.11644 • Published Jun 20, 2023 • 146
Textbooks Are All You Need II: phi-1.5 technical report

Paper • 2309.05463 • Published Sep 11, 2023 • 88
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 36
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28, 2024 • 105

A collection of arXiv papers from Chip Huyen's AI Engineering organized by chapter and ordered by when each appears in the book.

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

Paper • 2211.04325 • Published Oct 26, 2022 • 1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 20
On the Opportunities and Risks of Foundation Models

Paper • 2108.07258 • Published Aug 16, 2021 • 1
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Paper • 2204.07705 • Published Apr 16, 2022 • 2

Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 42
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 59

Text Instruction Papers

Self-Instruct: Aligning Language Model with Self Generated Instructions

Paper • 2212.10560 • Published Dec 20, 2022 • 9
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

Paper • 2312.16171 • Published Dec 26, 2023 • 37
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Paper • 2401.14196 • Published Jan 25, 2024 • 66
AlpaCare:Instruction-tuned Large Language Models for Medical Application

Paper • 2310.14558 • Published Oct 23, 2023 • 4

Dataset pruning/cleaning/dedup

AlpaGasus: Training A Better Alpaca with Fewer Data

Paper • 2307.08701 • Published Jul 17, 2023 • 23
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 7
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Paper • 2309.04662 • Published Sep 9, 2023 • 24
SlimPajama-DC: Understanding Data Combinations for LLM Training

Paper • 2309.10818 • Published Sep 19, 2023 • 11

Preference Optimization

Statistical Rejection Sampling Improves Preference Optimization

Paper • 2309.06657 • Published Sep 13, 2023 • 14
A General Theoretical Paradigm to Understand Learning from Human Preferences

Paper • 2310.12036 • Published Oct 18, 2023 • 16
Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 42

Synthetic (text) Dataset Generation

Papers about synthetic dataset generation

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper • 2404.14361 • Published Apr 22, 2024 • 2
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Paper • 2403.04190 • Published Mar 7, 2024 • 1
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11, 2024 • 32
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Paper • 2404.14445 • Published Apr 20, 2024

Large Language Model Alignment: A Survey

Paper • 2309.15025 • Published Sep 26, 2023 • 2
Aligning Large Language Models with Human: A Survey

Paper • 2307.12966 • Published Jul 24, 2023 • 1
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 63
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Paper • 2310.05344 • Published Oct 9, 2023 • 1

Dataset curation

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Paper • 2308.12032 • Published Aug 23, 2023 • 1
Know thy corpus! Robust methods for digital curation of Web corpora

Paper • 2003.06389 • Published Mar 13, 2020 • 1
Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 42
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Paper • 2305.06156 • Published May 9, 2023 • 2

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Paper • 2310.13961 • Published Oct 21, 2023 • 5
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

Paper • 2309.09582 • Published Sep 18, 2023 • 4
Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models

Paper • 2310.13127 • Published Oct 19, 2023 • 12
Evaluating the Robustness to Instructions of Large Language Models

Paper • 2308.14306 • Published Aug 28, 2023 • 1

Synthetic Data Generation

Textbooks Are All You Need

Paper • 2306.11644 • Published Jun 20, 2023 • 146
Textbooks Are All You Need II: phi-1.5 technical report

Paper • 2309.05463 • Published Sep 11, 2023 • 88
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 36
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28, 2024 • 105

Preference Optimization

Statistical Rejection Sampling Improves Preference Optimization

Paper • 2309.06657 • Published Sep 13, 2023 • 14
A General Theoretical Paradigm to Understand Learning from Human Preferences

Paper • 2310.12036 • Published Oct 18, 2023 • 16
Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 42

A collection of arXiv papers from Chip Huyen's AI Engineering organized by chapter and ordered by when each appears in the book.

Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

Paper • 2211.04325 • Published Oct 26, 2022 • 1
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 20
On the Opportunities and Risks of Foundation Models

Paper • 2108.07258 • Published Aug 16, 2021 • 1
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Paper • 2204.07705 • Published Apr 16, 2022 • 2

Synthetic (text) Dataset Generation

Papers about synthetic dataset generation

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper • 2404.14361 • Published Apr 22, 2024 • 2
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Paper • 2403.04190 • Published Mar 7, 2024 • 1
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11, 2024 • 32
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Paper • 2404.14445 • Published Apr 20, 2024

Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 42
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 59

Large Language Model Alignment: A Survey

Paper • 2309.15025 • Published Sep 26, 2023 • 2
Aligning Large Language Models with Human: A Survey

Paper • 2307.12966 • Published Jul 24, 2023 • 1
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 63
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Paper • 2310.05344 • Published Oct 9, 2023 • 1

Text Instruction Papers

Self-Instruct: Aligning Language Model with Self Generated Instructions

Paper • 2212.10560 • Published Dec 20, 2022 • 9
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

Paper • 2312.16171 • Published Dec 26, 2023 • 37
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Paper • 2401.14196 • Published Jan 25, 2024 • 66
AlpaCare:Instruction-tuned Large Language Models for Medical Application

Paper • 2310.14558 • Published Oct 23, 2023 • 4

Dataset curation

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Paper • 2308.12032 • Published Aug 23, 2023 • 1
Know thy corpus! Robust methods for digital curation of Web corpora

Paper • 2003.06389 • Published Mar 13, 2020 • 1
Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 42
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Paper • 2305.06156 • Published May 9, 2023 • 2

Dataset pruning/cleaning/dedup

AlpaGasus: Training A Better Alpaca with Fewer Data

Paper • 2307.08701 • Published Jul 17, 2023 • 23
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 7
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

Paper • 2309.04662 • Published Sep 9, 2023 • 24
SlimPajama-DC: Understanding Data Combinations for LLM Training

Paper • 2309.10818 • Published Sep 19, 2023 • 11

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Paper • 2310.13961 • Published Oct 21, 2023 • 5
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

Paper • 2309.09582 • Published Sep 18, 2023 • 4
Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models

Paper • 2310.13127 • Published Oct 19, 2023 • 12
Evaluating the Robustness to Instructions of Large Language Models

Paper • 2308.14306 • Published Aug 28, 2023 • 1

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs