metadata

title: Mustalhim AI
emoji: 👁
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.18.0
app_file: app.py
pinned: false
license: apache-2.0

Mustalhim: Image to Audio Story Generator

Mustalhim (مستلهم), meaning "inspired" in Arabic, is an AI-powered application that transforms images into captivating audio stories. It uses state-of-the-art models for image captioning, story generation, and text-to-speech synthesis to create an immersive experience.

Features

Image Captioning: Generates a descriptive caption for an uploaded image using the Salesforce/blip-image-captioning-large model.
Story Generation: Creates a long, engaging story inspired by the image caption using the ALLaM-7B-Instruct-preview model.
Text-to-Speech: Converts the generated story into an audio file using the kokoro library.
Gradio Interface: Provides an easy-to-use web interface for uploading images and listening to the generated audio.

How It Works

Image Captioning: The app uses a pre-trained image captioning model to generate a textual description of the uploaded image.
Story Generation: The caption is passed to a text-generation model, which creates a long, creative story inspired by the caption.
Text-to-Speech: The generated story is converted into an audio file using a text-to-speech library.
Output: The app returns the audio file, which can be played directly in the interface.

Demo

You can try the app live on Hugging Face Spaces:

Files

app.py: The main application script that defines the Gradio interface and integrates the image captioning, story generation, and text-to-speech functionalities.
requirements.txt: Lists the Python dependencies required for the project.
Dockerfile: Defines the environment for deploying the app on Hugging Face Spaces.

Requirements

Python 3.9+
Libraries:
- gradio
- transformers
- torch
- soundfile
- kokoro
- sentencepiece

Example Usage

Upload an image using the Gradio interface.
The app will generate a caption for the image.
A story will be created based on the caption.
The story will be converted into an audio file, which you can listen to directly in the app.

Screenshots

Example of the Mustalhim interface.

Contributing

Contributions are welcome! If you'd like to improve this project, please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature/YourFeature).
Commit your changes (git commit -m 'Add some feature').
Push to the branch (git push origin feature/YourFeature).
Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Hugging Face for providing the models and deployment platform.
Gradio for the easy-to-use interface.
Salesforce for the blip-image-captioning-large model.
ALLaM-AI for the ALLaM-7B-Instruct-preview model.

About the Name

Mustalhim (مستلهم) is an Arabic word meaning "inspired." This project is inspired by the power of AI to transform images into creative and engaging stories, bridging the gap between visual and auditory storytelling.

Contact

For questions or feedback, feel free to reach out:

Name: Mohammad Alkhatim
Email: your-email@example.com
GitHub: MoJaff
LinkedIn: Mohammad Alkhatim
Hugging Face: MoJaff

Experience the magic of Mustalhim and let your images inspire stories! 🚀

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference