Spaces:
Running
Running
metadata
title: Perceptual Copilot
emoji: ποΈ
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: main.py
pinned: false
license: mit
β¨ What is Perceptual Copilot?
Perceptual Copilot is a prototype that demonstrates the integration of OpenAI agents with visual tools to process real-time video streams. This experimental platform showcases both the promising potential and current limitations of equipping agents with vision capabilities to understand and interact with live visual data.
Architecture Overview
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Webcam βββββΆβ Memory βββββΆβ Gradio β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Agent βββββΆβ Tools β
βββββββββββββββββββ βββββββββββββββββββ
Available Tools
Tool | Description | Output |
---|---|---|
caption |
Generate detailed image descriptions | Rich visual descriptions |
ocr |
Extract text from images | Extracted text content |
localize |
Detect and locate objects | Bounding boxes with labels |
qa |
Answer questions about images | Contextual answers |
time |
Get current timestamp | Current date and time |
More tools coming soon... | Additional capabilities in development | Various outputs |
π Quick Start
Prerequisites
- Webcam access
Installation
Install dependencies
pip install -r requirements.txt
Set up environment variables
export HF_TOKEN="your_huggingface_token" export API_KEY="your_openai_api_key" export END_LANG="your_llm_endpoint" export END_TASK="your_task_endpoint" export MODEL_AGENT="your_agent_model" export MODEL_MLLM="your_multimodal_model" export MODEL_LOC="your_localization_model"
Launch the application
python main.py
π‘ Usage Examples
Basic Interaction
- User: "What do you see?"
- Assistant: Generates detailed caption of current view
OCR Functionality
- User: "Read the text in this document"
- Assistant: Extracts and returns all visible text
Object Detection
- User: "What objects are in front of me?"
- Assistant: Identifies and localizes objects with bounding boxes
Acknowledgments
- Built with Gradio for the interactive web interface
- Uses Supervision for frame annotation
- WebRTC integration via FastRTC