---
title: Image Captioning
emoji: 🖼️📝
colorFrom: pink
colorTo: purple
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: apache-2.0
---

# Image Captioning App 🖼️📝

A web-based image captioning tool that automatically generates descriptive captions for uploaded images using state-of-the-art computer vision models. Built with Gradio and deployed on Hugging Face Spaces.

![Demo Screenshot](image-captioning-logo.png)

## 🚀 Live Demo

Try the app: [Image-Captioning](https://huggingface.co/spaces/ashish-soni08/Image-Captioning)

## ✨ Features

- **Automatic Caption Generation**: Upload any image and get descriptive captions instantly
- **Visual Understanding**: AI model analyzes objects, scenes, and activities in images
- **Clean Interface**: Intuitive web UI built with Gradio for seamless image uploads
- **Responsive Design**: Works on desktop and mobile devices

## 🛠️ Technology Stack

- **Backend**: Python, Hugging Face Transformers
- **Frontend**: Gradio
- **Model**: [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
- **Deployment**: Hugging Face Spaces

## 🏃‍♂️ Quick Start

### Prerequisites

```bash
Python 3.8+
pip
```

### Installation

1. Clone the repository:
```bash
git clone https://github.com/Ashish-Soni08/image-captioning-app.git
cd image-captioning-app
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Run the application:
```bash
python app.py
```

4. Open your browser and navigate to `http://localhost:7860`

## 📋 Usage

1. **Upload Image**: Click the "Upload image" button and select an image from your device
2. **Generate Caption**: The app automatically processes the image and generates a caption
3. **View Results**: The descriptive caption appears in the output textbox

### Example

**Input Image:**
```
[A photo of a golden retriever playing in a park]
```

**Generated Caption:**
```
"A golden retriever dog playing with a ball in a grassy park on a sunny day"
```

## 🧠 Model Information

This app uses **Salesforce/blip-image-captioning-base**, a vision-language model for image captioning:

- **Architecture**: BLIP with ViT-Base backbone
- **Model Size**: ~990MB (PyTorch model file)
- **Training Data**: COCO dataset with bootstrapped captions from web data
- **Capabilities**: Both conditional and unconditional image captioning
- **Performance**: State-of-the-art results on image captioning benchmarks (+2.8% CIDEr improvement)

## 📁 Project Structure

```
image-captioning-app/
├── app.py                 # Main Gradio application
├── requirements.txt       # Python dependencies
├── README.md             # Project documentation
└── example_images/        # Sample images for testing
```

## 📄 License

This project is licensed under the Apache License 2.0

## 🙏 Acknowledgments

- [Hugging Face](https://huggingface.co/) for the Transformers library and model hosting
- [Gradio](https://gradio.app/) for the web interface framework
- [Salesforce Research](https://github.com/salesforce/BLIP) for the BLIP model

## 📞 Contact

Ashish Soni - ashish.soni2091@gmail.com

Project Link: [github](https://github.com/Ashish-Soni08/image-captioning-app)