File size: 3,196 Bytes
c564b63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
title: Perceptual Copilot
emoji: πŸ‘οΈ
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: main.py
pinned: false
license: mit
---

## ✨ What is Perceptual Copilot?

Perceptual Copilot is a prototype that demonstrates the integration of OpenAI agents with visual tools to process real-time video streams. This experimental platform showcases both the promising potential and current limitations of equipping agents with vision capabilities to understand and interact with live visual data. 


### Architecture Overview



```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Webcam     │───▢│      Memory     │◀──▢│      Gradio     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚      Agent      │◀──▢│      Tools      β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Available Tools

| Tool | Description | Output |
|------|-------------|---------|
| `caption` | Generate detailed image descriptions | Rich visual descriptions |
| `ocr` | Extract text from images | Extracted text content |
| `localize` | Detect and locate objects | Bounding boxes with labels |
| `qa` | Answer questions about images | Contextual answers |
| `time` | Get current timestamp | Current date and time |
| _More tools coming soon..._ | Additional capabilities in development | Various outputs |

## πŸš€ Quick Start

### Prerequisites

- Webcam access

### Installation

1. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```

2. **Set up environment variables**
   ```bash
   export HF_TOKEN="your_huggingface_token"
   export API_KEY="your_openai_api_key"
   export END_LANG="your_llm_endpoint"
   export END_TASK="your_task_endpoint"
   export MODEL_AGENT="your_agent_model"
   export MODEL_MLLM="your_multimodal_model"
   export MODEL_LOC="your_localization_model"
   ```

3. **Launch the application**
   ```bash
   python main.py
   ```

## πŸ’‘ Usage Examples

### Basic Interaction
- **User**: "What do you see?"
- **Assistant**: *Generates detailed caption of current view*

### OCR Functionality
- **User**: "Read the text in this document"
- **Assistant**: *Extracts and returns all visible text*

### Object Detection
- **User**: "What objects are in front of me?"
- **Assistant**: *Identifies and localizes objects with bounding boxes*


## Acknowledgments

- Built with [Gradio](https://gradio.app/) for the interactive web interface
- Uses [Supervision](https://supervision.roboflow.com/) for frame annotation
- WebRTC integration via [FastRTC](https://github.com/gradio-app/gradio)