perceptual-copilot / README.md
jing-bi's picture
Fresh deploy: all latest files
c564b63
---
title: Perceptual Copilot
emoji: πŸ‘οΈ
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: main.py
pinned: false
license: mit
---
## ✨ What is Perceptual Copilot?
Perceptual Copilot is a prototype that demonstrates the integration of OpenAI agents with visual tools to process real-time video streams. This experimental platform showcases both the promising potential and current limitations of equipping agents with vision capabilities to understand and interact with live visual data.
### Architecture Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Webcam │───▢│ Memory │◀──▢│ Gradio β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Agent │◀──▢│ Tools β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Available Tools
| Tool | Description | Output |
|------|-------------|---------|
| `caption` | Generate detailed image descriptions | Rich visual descriptions |
| `ocr` | Extract text from images | Extracted text content |
| `localize` | Detect and locate objects | Bounding boxes with labels |
| `qa` | Answer questions about images | Contextual answers |
| `time` | Get current timestamp | Current date and time |
| _More tools coming soon..._ | Additional capabilities in development | Various outputs |
## πŸš€ Quick Start
### Prerequisites
- Webcam access
### Installation
1. **Install dependencies**
```bash
pip install -r requirements.txt
```
2. **Set up environment variables**
```bash
export HF_TOKEN="your_huggingface_token"
export API_KEY="your_openai_api_key"
export END_LANG="your_llm_endpoint"
export END_TASK="your_task_endpoint"
export MODEL_AGENT="your_agent_model"
export MODEL_MLLM="your_multimodal_model"
export MODEL_LOC="your_localization_model"
```
3. **Launch the application**
```bash
python main.py
```
## πŸ’‘ Usage Examples
### Basic Interaction
- **User**: "What do you see?"
- **Assistant**: *Generates detailed caption of current view*
### OCR Functionality
- **User**: "Read the text in this document"
- **Assistant**: *Extracts and returns all visible text*
### Object Detection
- **User**: "What objects are in front of me?"
- **Assistant**: *Identifies and localizes objects with bounding boxes*
## Acknowledgments
- Built with [Gradio](https://gradio.app/) for the interactive web interface
- Uses [Supervision](https://supervision.roboflow.com/) for frame annotation
- WebRTC integration via [FastRTC](https://github.com/gradio-app/gradio)