AbstractPhil/t5-flan-base-vit-l-14-dual-stream-adapter

Update - 6/6/2025

A further refined booru shunt was uploaded with considerably more 1024 captioned steps.

t5-vit-l-14-dual_shunt_booru_51_200_000.safetensors

Roughly 51 million or so samples trained.

Training this variation with this refined methodology takes additional time, so the outcomes are slower on the L4 than I'd like. The signal convergence is slower but also more reliable to the modified loss formula.

Might move it to A100s, but it's probably not necessary. Just patience.

Update - 6/5/2025

With a more refined tokenization system to correctly match the exacting tokens to the deterministic tokenizer. Adjusted losses, noise chances, and additional cross-contamination processes for more careful selection.

The first booru signal expert is born - trained on batch size 1024 for nearly 13 million 77 token samples and is fairly untested.

Instead of plain english, she learned 13 million variations of over 1.2 million tags, artists, classifications, and non-deterministic rotational valuations. Specifically trained in high-batch counted variations to introduce large amounts of variance per update.

The templates for booru are exceptionally different, so this vit-l-14-dual_shunt_booru should have exceptionally different attention to different informations - while simultaneously being expert at both positive and negative tokenizations.

t5-vit-l-14-dual_shunt_booru_13_000_000.safetensors

Simple Summary

This project provides an advanced text control system for any AI generator that uses VIT-L-14 as a basis. Also known as CLIP_L.

It lets you “steer” how AI interprets your written prompts by adding a smart adapter between the text input and the image model. By fine-tuning how the prompt is understood, you get more accurate, creative, or controllable AI-generated images—especially in complex or multi-style models like Stable Diffusion XL.

More technical summary

This repository contains code, configuration, and weights for the Dual Shunt Adapter: a modular cross-attention prompt embedding controller designed for SDXL and multi-CLIP diffusion systems. The adapter bridges T5 (or other transformer) text encoders with CLIP-based pooled embedding spaces, providing delta, gate, log_sigma, anchor, and guidance outputs for per-token, per-field semantic modulation. Compatible with custom and parallel CLIP streams (e.g., SDXL’s CLIP-L/CLIP-G), the system enables targeted latent field steering, dynamic classifier-free guidance, and localized prompt injection for advanced generative workflows—including direct integration with ComfyUI and HuggingFace Diffusers.

Code

The model code is present in model.py. Inference code will be available in the long-winded article.

AbstractPhil
/

t5-flan-base-vit-l-14-dual-stream-adapter

Update - 6/6/2025

Update - 6/5/2025

Simple Summary

More technical summary

Code

Model tree for AbstractPhil/t5-flan-base-vit-l-14-dual-stream-adapter

Dataset used to train AbstractPhil/t5-flan-base-vit-l-14-dual-stream-adapter

Spaces using AbstractPhil/t5-flan-base-vit-l-14-dual-stream-adapter 2