Model Summary

This model is a fine-tuned version of BLIP-2 with the flan-t5-xl language decoder, optimized for image captioning tasks. Fine-tuning was performed using LoRA (Low-Rank Adaptation) for parameter-efficient adaptation. The training objective was to generate high-quality, semantically rich captions for images from the Open Images dataset.

It was developed for a captioning competition evaluated using Fréchet GTE Distance (FGD), which uses GTE-small embeddings to assess the alignment of image and caption semantics.

Training Objective

  • Task: Image Captioning
  • Base Model: Salesforce/blip2-flan-t5-xl
  • Backbone: Frozen ViT-G + frozen Q-Former
  • Decoder: Fine-tuned flan-t5-xl with LoRA
  • Loss: Cross-entropy with optional GTE-aware auxiliary loss
  • Evaluation: Fréchet GTE Distance (FGD) between image and caption embeddings

Dataset

  • Training Dataset: Subset of Open Images with curated image-caption pairs
  • Augmentation: Synthetic captions
  • Image Features: Preprocessed using BLIP-2’s frozen vision encoder

Fine-Tuning Configuration

  • LoRA Rank: 128
  • Alpha: 128
  • Dropout: 0.05
  • Target Modules q, v, and k in attention blocks
  • Precision: bfloat16
  • Optimizer: AdamW
  • Learning Rate: 3e-5
  • Scheduler: Cosine with warmup
  • Batch Size: 32
  • Epochs: 4
  • Accumulation: 2 gradient accumulation steps
  • Logging: Weights & Biases

Performance

  • Fréchet GTE Distance (↓), Achieved competitive score in the competition
  • Caption Quality: High semantic alignment and fluency
  • Improved OCR capabilities

Usage

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

processor = BlipProcessor.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning")
model = BlipForConditionalGeneration.from_pretrained("erayalp/blip2-flan-t5-xl-LoRA-image-captioning").to("cuda")

img = Image.open("example.jpg").convert("RGB")
prompt = "Provide a detailed caption for this photo."
prompts = [prompt] * len(images)
inputs = processor(images=img, text=prompts, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=30)
caption = processor.decode(outputs[0], skip_special_tokens=True)

print(caption)

Limitations

  • May underperform on out-of-domain images or very abstract concepts
  • Quality of captions may vary depending on scene complexity
  • Not trained on video or temporal sequences
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for erayalp/blip2-flan-t5-xl-LoRA-image-captioning

Finetuned
(1)
this model