Beagle

Update submodules:

git submodule update --init --recursive --progress

Generate dataset (use --dataset_generation.output_dir to specify an alternative output directory):

rm -rf output/datasets/ds_Llama-2-7b-chat-hf*
CUDA_VISIBLE_DEVICES=0 python -m beagle.data_gen gen_dataset --@llama2_7b_chat --ds_range 0,17500
CUDA_VISIBLE_DEVICES=1 python -m beagle.data_gen gen_dataset --@llama2_7b_chat --ds_range 17500,35000
CUDA_VISIBLE_DEVICES=2 python -m beagle.data_gen gen_dataset --@llama2_7b_chat --ds_range 35000,52500
CUDA_VISIBLE_DEVICES=3 python -m beagle.data_gen gen_dataset --@llama2_7b_chat --ds_range 52500,69999
python -m beagle.data_gen merge_datasets \
    output/datasets/ds_Llama-2-7b-chat-hf \
    output/datasets/ds_Llama-2-7b-chat-hf__range*

Since the generated dataset contains a lot hidden states, it consumes ~680 GiB disk space. I usually use external hard disks to save data. For example:

sudo mount truenas:/mnt/tk-truenas-pool/pool-root/master-tree/sync/beagle_train_data ./mnt/
CUDA_VISIBLE_DEVICES=1 python -m beagle.data_gen gen_dataset --@exaone3.5_2.4B_instr --dataset_generation.output_dir ./mnt/

Generate self-distillation data (or evaluation set during training):

# generate share-gpt inference outputs
CUDA_VISIBLE_DEVICES=4,5 python eval.py beagle --dataset share-gpt --@llama2_7b_chat --inference.mode baseline

# convert inference outputs to sharegpt format
python logs2dataset.py logs/baseline_experiments__llama2_7b_chat_mt-bench/
python -m beagle.data_gen push_dataset output/ds_baseline_experiments__llama2_7b_chat_mt-bench/ w32zhong/llama2_7b_chat_mt-bench

# generate local dataset with pre-computed hidden states
CUDA_VISIBLE_DEVICES=4,5 python -m beagle.data_gen gen_dataset --@llama2_7b_chat --dataset_generation.output_dir ./output --dataset_generation.sharegpt_path w32zhong/llama2_7b_chat_mt-bench --dataset_generation.ds_prefix self_ds_
python -m beagle.data_gen merge_datasets ./output/datasets/self_ds_Llama-2-7b-chat-hf ./output/datasets/self_ds_Llama-2-7b-chat-hf__range*

For Shared-GPT self-distillation, we may utilize multiple processes and speculative checkpoints to accelerate this process:

CUDA_VISIBLE_DEVICES=0 python eval.py beagle --dataset share-gpt --@llama2_7b_chat --modeling.ckpt_path w32zhong/charmed-sunset-finished --modeling.load_config_from_model_path --sample_range 0,17500 --run_name part0
CUDA_VISIBLE_DEVICES=1 python eval.py beagle --dataset share-gpt --@llama2_7b_chat --modeling.ckpt_path w32zhong/charmed-sunset-finished --modeling.load_config_from_model_path --sample_range 17500,35000 --run_name part1
CUDA_VISIBLE_DEVICES=2 python eval.py beagle --dataset share-gpt --@llama2_7b_chat --modeling.ckpt_path w32zhong/charmed-sunset-finished --modeling.load_config_from_model_path --sample_range 35000,52500 --run_name part2
CUDA_VISIBLE_DEVICES=3 python eval.py beagle --dataset share-gpt --@llama2_7b_chat --modeling.ckpt_path w32zhong/charmed-sunset-finished --modeling.load_config_from_model_path --sample_range 52500,69999 --run_name part3

Training:

wandb login
# multi-GPU training:
CUDA_VISIBLE_DEVICES=0,1,2,3 \
    torchrun --standalone --nnodes=1 --nproc-per-node=4 -m beagle.train \
    --@llama2_7b_chat --@rtx4070tis_bs4_ctx4096 --modeling.use_lower_layers 0 \
    --dataset.path /mnt/hg_cache/temp_llama_dataset/datasets/ds_Llama-2-7b-chat-hf \
    --modeling.save_loading --training.report_to wandb

# customized training:
CUDA_VISIBLE_DEVICES=0,1 \
    torchrun --standalone --nnodes=1 --nproc-per-node=2 -m beagle.train \
    --@exaone3.5_2.4B_instr --@rtx4070tis_dev_bs8 \
    --dataset.path /mnt/truenas_sync/beagle_train_data/datasets/ds_EXAONE-3.5-2.4B-Instruct \
    --modeling.use_lower_layers 5 --training.filter_out_shorts \
    --modeling.attention_wind \"\'1\'\" --modeling.use_state_distill --modeling.save_loading \
    --training.max_tti_wind 2 --training.learning_rate 5e-5

# single GPU debugging:
python -m beagle.train --@exaone3.5_2.4B_instr --@rtx4070tis_dev_bs8 \
    --training.report_to wandb --training.project debug \
    --dataset.path ../mnt/beagle_train_data/datasets/ds_EXAONE-3.5-2.4B-Instruct

# multi-GPU debugging
CUDA_VISIBLE_DEVICES=4,5 python -m beagle.train --@llama2_7b_chat --dataset.path example --modeling.save_loading --training.max_tti_wind 4 --modeling.load_config_from_model_path --modeling.ckpt_path w32zhong/giddy-aardvark-finished --training.debug --training.force_model_parallel --training.eval_steps 1 #--training.number_sampled_print debug

# resume a checkpoint:
python -m beagle.train --@llama2_7b_chat --@a10g_dev_bs16 \
    --training.resume_from_checkpoint \
    --training.run_name happy-yogurt-211 \
    --training.report_to wandb \
    --training.resume_wandb_runid \"\'73625e37\'\"

# mock EAGLE training using our data
python -m beagle.train --@llama2_7b_chat --@rtx4070tis_bs8_ctx2048 \
    --modeling.use_fc_eagle --modeling.use_state_distill \
    --modeling.use_lower_layers 0 --modeling.strictly_follow_eagle_decoder \
    --dataset.path /mnt/hg_cache/temp_llama_dataset/datasets/ds_Llama-2-7b-chat-hf

# mock EAGLE training using their data
CUDA_VISIBLE_DEVICES=4,5 torchrun --standalone --nnodes=1 --nproc-per-node=2 \
    -m beagle.train --@llama2_7b_chat --@rtx4070tis_bs4_ctx4096 \
    --dataset.read_eagle_format --dataset.path /mnt/wd_ssd/sharegpt_0_67999_mufp16 \
    --modeling.use_fc_eagle --modeling.strictly_follow_eagle_decoder \
    --modeling.use_lower_layers 0 --modeling.save_loading \
    --training.use_eagle_pipeline --modeling.use_state_distill \
    --training.model_init_ckpt /home/tk/Desktop/EAGLE-v1/eagle/train/ckpt/model_init/pytorch_model.bin

# avoid generating data and use more VRAM
python -m beagle.train --@exaone3.5_2.4B_instr --@a6000ada_dev_bs16_save_vram \
    --training.save_vram False --training.eval_strategy no --training.filter_out_shorts

# continuous training
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 -m beagle.train \
    --@llama2_7b_chat --@rtx4070tis_dev_bs4 \
    --dataset.path /mnt/hg_cache/temp_llama_dataset/datasets/ds_Llama-2-7b-chat-hf \
    --modeling.ckpt_path output/winter-snowball-223 --modeling.load_config_from_model_path \
    --training.report_to wandb

# dry run to get eval metrics of a checkpoint
CUDA_VISIBLE_DEVICES=5,6 python -m beagle.train --@llama2_7b_chat --dataset.path example --modeling.save_loading --training.max_tti_wind 5 --modeling.load_config_from_model_path --modeling.ckpt_path w32zhong/tough-night-506  --training.debug --training.force_model_parallel --dataset.eval_path /mnt/wd_ssd/datasets/self_ds_Llama-2-7b-chat-hf/ --dataset.path /mnt/hg_cache/temp_llama_dataset/datasets/ds_Llama-2-7b-chat-hf/ --training.eval_steps 1 --training.dry_run # --training.disable_sampled_print --training.number_sampled_print debug

Inference:

# load a decoder checkpoint on a llama2 base model:
CUDA_VISIBLE_DEVICES=0 python -m beagle.inference --@llama2_7b_chat \
    --modeling.ckpt_path w32zhong/easy-haze-finished \
    --modeling.use_lower_layers 0 --modeling.load_config_from_model_path \
    --inference.draft_tree_shape mc_sim_7b_63

# load others/old checkpoints which require to convert state dict keys:
python -m beagle.inference --@llama2_7b_chat --modeling.ckpt_path w32zhong/wild-blaze-41500 \
    --modeling.decoder_key_remap='{"decoder.cross_attn": "speculative_decoder.self_attn", "decoder": "speculative_decoder", "decoder.cross_attn_layernorm": "speculative_decoder.input_layernorm"}'

# quick test by simulating training
python -m beagle.inference --@llama2_7b_chat --modeling.ckpt_path w32zhong/ancient-haze-finished \
    --inference.mode simulate_training

# run baseline (using base model w/o speculative decoding)
python -m beagle.inference --@exaone3.5_2.4B_instr --inference.mode baseline

# simulate EAGLE
CUDA_VISIBLE_DEVICES=0 python -m beagle.inference --@llama2_7b_chat \
    --modeling.ckpt_path w32zhong/worthy-forest-finished \
    --modeling.use_fc_eagle --modeling.use_lower_layers 0

Docker build:

make beagle/mixin.c # deploy version
sudo docker build -f beagle_dockerfile -t beagle .
sudo docker image tag beagle ghcr.io/w32zhong/beagle:latest
sudo docker push ghcr.io/w32zhong/beagle:latest

# clean up
sudo docker container prune -f
sudo docker rmi $(sudo docker images -f "dangling=true" -q)

# run locally
mkdir ./output ./logs
sudo docker run --pull=always --gpus '"device=2,3"' --env HF_TOKEN=<hf_token> \
    --ipc=host --ulimit memlock=-1 \
    -v $HOME/.cache:/root/.cache -v /mnt/hg_cache/huggingface:/mnt/hg_cache/huggingface \
    -v `pwd`/output:/workspace/proj/output -v `pwd`/logs:/workspace/proj/logs \
    -it beagle \
    /bin/bash

Evaluation: Update tmux to at least 3.3a:

wget https://github.com/tmux/tmux/releases/download/3.3a/tmux-3.3a.tar.gz
tar xzf tmux-3.3a.tar.gz
cd tmux-3.3a/
sudo apt-get install ncurses-dev
./configure
make
sudo make install
sudo ln -sf /usr/local/bin/tmux /usr/bin/tmux
tmux -V # ensure the version is 3.3a
tmux kill-server # kill the old server

then refer to experiments.sh for dataset evaluations.

Literature Review

A comprehensive list of papers can be found here: https://github.com/hemingkx/SpeculativeDecodingPapers

Although there are some more recent survey papers, but this one to me has the highest credibility: A Comprehensive Survey of Speculative Decoding

When this paper comes out, EAGLE is only in V1. But EAGLE-1 is still good and may be the SoTA according to my own experiences. There are several highly competative ones in surface:

EAGLE-2: which keeps the old EAGLE-1 frozen without any training, but uses logits to assess confidence so that it dynamically constructs an ``optimally-structured draft tree''. If we are not working on dynamic draft tree, we can just compare with EAGLE-1 fairly. (btw. EAGLE-3 is coming out from what i heard)
CLLM: by the time of S3D, this is another SoTA in terms of speed, but the crucial disadvantage of this model is it trains and thus changed the base model. And in my experience, its training objectives hurts the quality of the outputs.

Because we are now targeting lossless and a-decoder-on-fronzen-base-model speculative decoding, I think the most relevant work in recent is:

EAGLE v1 Replication

Set up environment and run an inference test:

# git clone --branch eagle-v1-save --depth 1 https://github.com/w32zhong/EAGLE.git EAGLE
git clone --branch v1 --depth 1 https://github.com/SafeAILab/EAGLE.git EAGLE-v1
cd EAGLE-v1
wget https://raw.githubusercontent.com/w32zhong/EAGLE/refs/heads/eagle-v1-save/application/test_v1.py -O eagle/application/test_v1.py
pip install -e .
pip install transformers==4.36.2
pip install accelerate==0.21.0
pip install datasets==3.2.0
cd eagle
CUDA_VISIBLE_DEVICES=0 python application/test_v1.py

Go to eagle/ge_data/allocation.py and change your training GPU allocations. For example, in my case,

gpus=[[0, 1],[2, 3]]

Then replace ge_data_all_vicuna.py to ge_data_all_llama2chat.py in allocation.py for training llama base model.

Go to eagle/ge_data/ge_data_all_llama2chat.py and change the following:

# bigname="/home/hongyanz/scratch/weights/llama2chat/13B"
bigname="meta-llama/Llama-2-7b-chat-hf"
...
# ds = load_dataset('json', data_files="/home/hongyanz/scratch/data/ShareGPT_V4.3_unfiltered_cleaned_split.json")
ds = load_dataset(
    path="Aeala/ShareGPT_Vicuna_unfiltered",
    data_files=["ShareGPT_V4.3_unfiltered_cleaned_split.json"],
    revision='8b0048ad6ae8c22f46a78c15559dec98feef5539'
)

Run the following to generate training data:

cd ge_data
python allocation.py --outdir /mnt/wd_ssd/

(/mnt/wd_ssd is my data storage directory)

This will take a few hours and will consume 756 GiB disk space.

Change directory to ../train and modify the wandb settings in 'main.py':

#wandb.init(project="ess", entity="yuhui-li", config=train_config)
wandb.init(project="beagle", config=train_config)

Importantly, change the list_files function to filter out empty training files (in my experience there are 0.5% empty inputs), and skip all in-training tests due to potential divided-by-zero errors.

Alternatively,

find /mnt/wd_ssd/sharegpt_0_67999_mufp16/ -size 0 -print -delete

Check out this patch for all detailed changes.

Now train the speculative decoder model:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m --mixed_precision=bf16 eagle.train.main \
    --tmpdir /mnt/wd_ssd/sharegpt_0_67999_mufp16/ --cpdir ./ckpt --configpath ./llama_2_chat_7B_config.json \
    --basepath ~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590 \
    --gradient-accumulation-steps 4 --bs 1

After training, use a simple test script to evaluate the speed based on the saved model. For example, for 10-epoch checkpoint, change ea_model_path to

ea_model_path='../EAGLE-v1/eagle/train/ckpt/model_9'