perceptual-copilot / README.md
jing-bi's picture
Fresh deploy: all latest files
14441ee
metadata
title: Perceptual Copilot
emoji: πŸ‘οΈ
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.33.1
app_file: main.py
pinned: false
license: mit

✨ What is Perceptual Copilot?

Perceptual Copilot is a prototype that demonstrates the integration of OpenAI agents with visual tools to process real-time video streams. This experimental platform showcases both the promising potential and current limitations of equipping agents with vision capabilities to understand and interact with live visual data.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Webcam     │───▢│      Memory     │◀──▢│      Gradio     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚      Agent      │◀──▢│      Tools      β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Available Tools

Tool Description Output
caption Generate detailed image descriptions Rich visual descriptions
ocr Extract text from images Extracted text content
localize Detect and locate objects Bounding boxes with labels
qa Answer questions about images Contextual answers
time Get current timestamp Current date and time
More tools coming soon... Additional capabilities in development Various outputs

πŸš€ Quick Start

Prerequisites

  • Webcam access

Installation

  1. Install dependencies

    pip install -r requirements.txt
    
  2. Set up environment variables

    export HF_TOKEN="your_huggingface_token"
    export API_KEY="your_openai_api_key"
    export END_LANG="your_llm_endpoint"
    export END_TASK="your_task_endpoint"
    export MODEL_AGENT="your_agent_model"
    export MODEL_MLLM="your_multimodal_model"
    export MODEL_LOC="your_localization_model"
    
  3. Launch the application

    python main.py
    

πŸ’‘ Usage Examples

Basic Interaction

  • User: "What do you see?"
  • Assistant: Generates detailed caption of current view

OCR Functionality

  • User: "Read the text in this document"
  • Assistant: Extracts and returns all visible text

Object Detection

  • User: "What objects are in front of me?"
  • Assistant: Identifies and localizes objects with bounding boxes

Acknowledgments

  • Built with Gradio for the interactive web interface
  • Uses Supervision for frame annotation
  • WebRTC integration via FastRTC