Upload 2 files
Browse files- README.md +28 -8
- requirements.txt +75 -0
README.md
CHANGED
@@ -14,7 +14,24 @@ pipeline_tag: text-to-image
|
|
14 |
**BeamDiffusion** introduces a novel approach for generating coherent image sequences from text prompts by employing beam search in latent space. Unlike traditional methods that generate images independently, BeamDiffusion iteratively explores latent representations, ensuring smooth transitions and visual continuity across frames. A cross-attention mechanism efficiently scores and prunes search paths, optimizing both textual alignment and visual coherence.
|
15 |
BeamDiffusion addresses the challenge of maintaining visual consistency in image sequences generated from text prompts. By leveraging a beam search strategy in the latent space, it refines the generation process to produce sequences with enhanced coherence and alignment with textual descriptions, as outlined in the [paper](https://arxiv.org/abs/2503.20429).
|
16 |
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
Here's a basic example of how to use BeamDiffusion with the `transformers` library to generate an image sequence based on a series of text prompts:
|
20 |
|
@@ -50,7 +67,7 @@ sequence_imgs = pipe(input_data)
|
|
50 |
|
51 |

|
52 |
|
53 |
-
## Input Parameters Explained
|
54 |
|
55 |
- **`steps`** (`list of strings`): Descriptions for each step in the image generation process. The model generates one image per step, forming a sequence that aligns with these descriptions.
|
56 |
|
@@ -68,15 +85,18 @@ sequence_imgs = pipe(input_data)
|
|
68 |
|
69 |
- **`use_rand`** (`bool`): Flag to introduce randomness in the inference process. If set to `True`, the model generates more varied and creative results; if `False`, it produces more deterministic outputs.
|
70 |
|
71 |
-
## Citation
|
72 |
|
73 |
If you use BeamDiffusion in your research or projects, please cite the following paper:
|
74 |
|
75 |
```
|
76 |
-
@
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
|
|
|
|
|
|
81 |
}
|
82 |
```
|
|
|
14 |
**BeamDiffusion** introduces a novel approach for generating coherent image sequences from text prompts by employing beam search in latent space. Unlike traditional methods that generate images independently, BeamDiffusion iteratively explores latent representations, ensuring smooth transitions and visual continuity across frames. A cross-attention mechanism efficiently scores and prunes search paths, optimizing both textual alignment and visual coherence.
|
15 |
BeamDiffusion addresses the challenge of maintaining visual consistency in image sequences generated from text prompts. By leveraging a beam search strategy in the latent space, it refines the generation process to produce sequences with enhanced coherence and alignment with textual descriptions, as outlined in the [paper](https://arxiv.org/abs/2503.20429).
|
16 |
|
17 |
+
---
|
18 |
+
## 🛠️ Setup Instructions
|
19 |
+
|
20 |
+
Before using BeamDiffusion, follow these steps to set up your environment:
|
21 |
+
|
22 |
+
```bash
|
23 |
+
# 1. Create a virtual environment (recommended)
|
24 |
+
python3 -m venv beam_env
|
25 |
+
|
26 |
+
# 2. Activate the virtual environment
|
27 |
+
source beam_env/bin/activate # On macOS/Linux
|
28 |
+
# beam_env\Scripts\activate # On Windows
|
29 |
+
|
30 |
+
# 3. Install required dependencies
|
31 |
+
pip install -r ./BeamDiffusionModel/requirements.txt
|
32 |
+
```
|
33 |
+
---
|
34 |
+
## 🚀 Quickstart Guide
|
35 |
|
36 |
Here's a basic example of how to use BeamDiffusion with the `transformers` library to generate an image sequence based on a series of text prompts:
|
37 |
|
|
|
67 |
|
68 |

|
69 |
|
70 |
+
## 🔍 Input Parameters Explained
|
71 |
|
72 |
- **`steps`** (`list of strings`): Descriptions for each step in the image generation process. The model generates one image per step, forming a sequence that aligns with these descriptions.
|
73 |
|
|
|
85 |
|
86 |
- **`use_rand`** (`bool`): Flag to introduce randomness in the inference process. If set to `True`, the model generates more varied and creative results; if `False`, it produces more deterministic outputs.
|
87 |
|
88 |
+
## 📚 Citation
|
89 |
|
90 |
If you use BeamDiffusion in your research or projects, please cite the following paper:
|
91 |
|
92 |
```
|
93 |
+
@misc{fernandes2025latentbeamdiffusionmodels,
|
94 |
+
title={Latent Beam Diffusion Models for Decoding Image Sequences},
|
95 |
+
author={Guilherme Fernandes and Vasco Ramos and Regev Cohen and Idan Szpektor and João Magalhães},
|
96 |
+
year={2025},
|
97 |
+
eprint={2503.20429},
|
98 |
+
archivePrefix={arXiv},
|
99 |
+
primaryClass={cs.CV},
|
100 |
+
url={https://arxiv.org/abs/2503.20429},
|
101 |
}
|
102 |
```
|
requirements.txt
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
--extra-index-url https://download.pytorch.org/whl/cu126
|
2 |
+
accelerate==1.6.0
|
3 |
+
aiohappyeyeballs==2.6.1
|
4 |
+
aiohttp==3.11.16
|
5 |
+
aiosignal==1.3.2
|
6 |
+
annotated-types==0.7.0
|
7 |
+
async-timeout==5.0.1
|
8 |
+
attrs==25.3.0
|
9 |
+
certifi==2025.1.31
|
10 |
+
charset-normalizer==3.4.1
|
11 |
+
click==8.1.8
|
12 |
+
diffusers==0.32.2
|
13 |
+
docker-pycreds==0.4.0
|
14 |
+
eval_type_backport==0.2.2
|
15 |
+
filelock==3.18.0
|
16 |
+
frozenlist==1.5.0
|
17 |
+
fsspec==2025.3.2
|
18 |
+
gitdb==4.0.12
|
19 |
+
GitPython==3.1.44
|
20 |
+
huggingface-hub==0.30.1
|
21 |
+
idna==3.10
|
22 |
+
importlib_metadata==8.6.1
|
23 |
+
Jinja2==3.1.3
|
24 |
+
lightning==2.5.1
|
25 |
+
lightning-utilities==0.14.3
|
26 |
+
MarkupSafe==2.1.5
|
27 |
+
mpmath==1.3.0
|
28 |
+
multidict==6.3.2
|
29 |
+
networkx==3.2.1
|
30 |
+
numpy==2.0.2
|
31 |
+
nvidia-cublas-cu12==12.6.4.1
|
32 |
+
nvidia-cuda-cupti-cu12==12.6.80
|
33 |
+
nvidia-cuda-nvrtc-cu12==12.6.77
|
34 |
+
nvidia-cuda-runtime-cu12==12.6.77
|
35 |
+
nvidia-cudnn-cu12==9.5.1.17
|
36 |
+
nvidia-cufft-cu12==11.3.0.4
|
37 |
+
nvidia-curand-cu12==10.3.7.77
|
38 |
+
nvidia-cusolver-cu12==11.7.1.2
|
39 |
+
nvidia-cusparse-cu12==12.5.4.2
|
40 |
+
nvidia-cusparselt-cu12==0.6.3
|
41 |
+
nvidia-nccl-cu12==2.21.5
|
42 |
+
nvidia-nvjitlink-cu12==12.6.85
|
43 |
+
nvidia-nvtx-cu12==12.6.77
|
44 |
+
packaging==24.2
|
45 |
+
pillow==11.0.0
|
46 |
+
platformdirs==4.3.7
|
47 |
+
propcache==0.3.1
|
48 |
+
protobuf==5.29.4
|
49 |
+
psutil==7.0.0
|
50 |
+
pydantic==2.11.2
|
51 |
+
pydantic_core==2.33.1
|
52 |
+
pytorch-lightning==2.5.1
|
53 |
+
PyYAML==6.0.2
|
54 |
+
regex==2024.11.6
|
55 |
+
requests==2.32.3
|
56 |
+
safetensors==0.5.3
|
57 |
+
sentry-sdk==2.25.1
|
58 |
+
setproctitle==1.3.5
|
59 |
+
six==1.17.0
|
60 |
+
smmap==5.0.2
|
61 |
+
sympy==1.13.1
|
62 |
+
tokenizers==0.21.1
|
63 |
+
torch==2.6.0+cu126
|
64 |
+
torchaudio==2.6.0+cu126
|
65 |
+
torchmetrics==1.7.0
|
66 |
+
torchvision==0.21.0+cu126
|
67 |
+
tqdm==4.67.1
|
68 |
+
transformers==4.51.0
|
69 |
+
triton==3.2.0
|
70 |
+
typing-inspection==0.4.0
|
71 |
+
typing_extensions==4.13.1
|
72 |
+
urllib3==2.3.0
|
73 |
+
wandb==0.19.9
|
74 |
+
yarl==1.19.0
|
75 |
+
zipp==3.21.0
|