Spaces:
Running
Running
File size: 3,196 Bytes
c564b63 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
title: Perceptual Copilot
emoji: ποΈ
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: main.py
pinned: false
license: mit
---
## β¨ What is Perceptual Copilot?
Perceptual Copilot is a prototype that demonstrates the integration of OpenAI agents with visual tools to process real-time video streams. This experimental platform showcases both the promising potential and current limitations of equipping agents with vision capabilities to understand and interact with live visual data.
### Architecture Overview
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Webcam βββββΆβ Memory βββββΆβ Gradio β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Agent βββββΆβ Tools β
βββββββββββββββββββ βββββββββββββββββββ
```
### Available Tools
| Tool | Description | Output |
|------|-------------|---------|
| `caption` | Generate detailed image descriptions | Rich visual descriptions |
| `ocr` | Extract text from images | Extracted text content |
| `localize` | Detect and locate objects | Bounding boxes with labels |
| `qa` | Answer questions about images | Contextual answers |
| `time` | Get current timestamp | Current date and time |
| _More tools coming soon..._ | Additional capabilities in development | Various outputs |
## π Quick Start
### Prerequisites
- Webcam access
### Installation
1. **Install dependencies**
```bash
pip install -r requirements.txt
```
2. **Set up environment variables**
```bash
export HF_TOKEN="your_huggingface_token"
export API_KEY="your_openai_api_key"
export END_LANG="your_llm_endpoint"
export END_TASK="your_task_endpoint"
export MODEL_AGENT="your_agent_model"
export MODEL_MLLM="your_multimodal_model"
export MODEL_LOC="your_localization_model"
```
3. **Launch the application**
```bash
python main.py
```
## π‘ Usage Examples
### Basic Interaction
- **User**: "What do you see?"
- **Assistant**: *Generates detailed caption of current view*
### OCR Functionality
- **User**: "Read the text in this document"
- **Assistant**: *Extracts and returns all visible text*
### Object Detection
- **User**: "What objects are in front of me?"
- **Assistant**: *Identifies and localizes objects with bounding boxes*
## Acknowledgments
- Built with [Gradio](https://gradio.app/) for the interactive web interface
- Uses [Supervision](https://supervision.roboflow.com/) for frame annotation
- WebRTC integration via [FastRTC](https://github.com/gradio-app/gradio)
|