--- title: Image Captioning emoji: 🖼️📝 colorFrom: pink colorTo: purple sdk: gradio sdk_version: 5.31.0 app_file: app.py pinned: false license: apache-2.0 --- # Image Captioning App 🖼️📝 A web-based image captioning tool that automatically generates descriptive captions for uploaded images using state-of-the-art computer vision models. Built with Gradio and deployed on Hugging Face Spaces. ![Demo Screenshot](image-captioning-logo.png) ## 🚀 Live Demo Try the app: [Image-Captioning](https://huggingface.co/spaces/ashish-soni08/Image-Captioning) ## ✨ Features - **Automatic Caption Generation**: Upload any image and get descriptive captions instantly - **Visual Understanding**: AI model analyzes objects, scenes, and activities in images - **Clean Interface**: Intuitive web UI built with Gradio for seamless image uploads - **Responsive Design**: Works on desktop and mobile devices ## 🛠️ Technology Stack - **Backend**: Python, Hugging Face Transformers - **Frontend**: Gradio - **Model**: [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base) - **Deployment**: Hugging Face Spaces ## 🏃‍♂️ Quick Start ### Prerequisites ```bash Python 3.8+ pip ``` ### Installation 1. Clone the repository: ```bash git clone https://github.com/Ashish-Soni08/image-captioning-app.git cd image-captioning-app ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Run the application: ```bash python app.py ``` 4. Open your browser and navigate to `http://localhost:7860` ## 📋 Usage 1. **Upload Image**: Click the "Upload image" button and select an image from your device 2. **Generate Caption**: The app automatically processes the image and generates a caption 3. **View Results**: The descriptive caption appears in the output textbox ### Example **Input Image:** ``` [A photo of a golden retriever playing in a park] ``` **Generated Caption:** ``` "A golden retriever dog playing with a ball in a grassy park on a sunny day" ``` ## 🧠 Model Information This app uses **Salesforce/blip-image-captioning-base**, a vision-language model for image captioning: - **Architecture**: BLIP with ViT-Base backbone - **Model Size**: ~990MB (PyTorch model file) - **Training Data**: COCO dataset with bootstrapped captions from web data - **Capabilities**: Both conditional and unconditional image captioning - **Performance**: State-of-the-art results on image captioning benchmarks (+2.8% CIDEr improvement) ## 📁 Project Structure ``` image-captioning-app/ ├── app.py # Main Gradio application ├── requirements.txt # Python dependencies ├── README.md # Project documentation └── example_images/ # Sample images for testing ``` ## 📄 License This project is licensed under the Apache License 2.0 ## 🙏 Acknowledgments - [Hugging Face](https://huggingface.co/) for the Transformers library and model hosting - [Gradio](https://gradio.app/) for the web interface framework - [Salesforce Research](https://github.com/salesforce/BLIP) for the BLIP model ## 📞 Contact Ashish Soni - ashish.soni2091@gmail.com Project Link: [github](https://github.com/Ashish-Soni08/image-captioning-app)