Spaces:
Sleeping
Sleeping
Update vision_agent.py
Browse files- vision_agent.py +96 -0
vision_agent.py
CHANGED
@@ -50,6 +50,13 @@ def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
|
|
50 |
from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
|
51 |
model = OpenAIServerModel(model_id="gpt-4o")
|
52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
agent = CodeAgent(
|
54 |
tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
|
55 |
model=model,
|
@@ -58,3 +65,92 @@ agent = CodeAgent(
|
|
58 |
max_steps=20,
|
59 |
verbosity_level=2,
|
60 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
|
51 |
model = OpenAIServerModel(model_id="gpt-4o")
|
52 |
|
53 |
+
############# OpenAIServerModel: Connects to any service that offers an OpenAI API interface.
|
54 |
+
#model = OpenAIServerModel(
|
55 |
+
# model_id="gpt-4o",
|
56 |
+
# api_base="https://api.openai.com/v1",
|
57 |
+
# api_key=os.environ["OPENAI_API_KEY"],
|
58 |
+
#)
|
59 |
+
|
60 |
agent = CodeAgent(
|
61 |
tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
|
62 |
model=model,
|
|
|
65 |
max_steps=20,
|
66 |
verbosity_level=2,
|
67 |
)
|
68 |
+
|
69 |
+
prompt_analysis="""Extract information from an image by analyzing and interpreting its visual elements to provide a detailed description or identify specific data.
|
70 |
+
|
71 |
+
# Steps
|
72 |
+
|
73 |
+
1. **Analyze the Image**: Identify key elements such as objects, text, colors, and any notable features or contexts.
|
74 |
+
2. **Interpret Visual Elements**: Determine the significance or purpose of the elements identified. Consider relationships between objects, text recognition if applicable, and any context clues.
|
75 |
+
3. **Synthesize Information**: Bring together the interpreted elements to form a coherent understanding or summary.
|
76 |
+
4. **Verify Details**: Ensure accuracy by cross-referencing identifiable text or icons with known data or references, if relevant.
|
77 |
+
|
78 |
+
# Output Format
|
79 |
+
|
80 |
+
The output should be a detailed text description or a structured data response (such as a JSON) containing the identified elements and their interpretations. Each identified element should be clearly described along with its context or significance.
|
81 |
+
|
82 |
+
# Example
|
83 |
+
|
84 |
+
**Input**: (An image with a storefront displaying 'Bakery' sign and a variety of bread on display.)
|
85 |
+
|
86 |
+
**Output**:
|
87 |
+
|
88 |
+
Description:
|
89 |
+
- **Storefront**: A bakery
|
90 |
+
- **Signage Text**: "Bakery"
|
91 |
+
- **Products**: Various types of bread
|
92 |
+
|
93 |
+
JSON Example:
|
94 |
+
```json
|
95 |
+
{
|
96 |
+
"storeType": "Bakery",
|
97 |
+
"signText": "Bakery",
|
98 |
+
"products": ["bread", "baguette", "pastry"]
|
99 |
+
}
|
100 |
+
```
|
101 |
+
|
102 |
+
# Notes
|
103 |
+
|
104 |
+
- Consider optical character recognition (OCR) for text extraction.
|
105 |
+
- Evaluate colors and objects for brand or function associations.
|
106 |
+
- Provide a holistic overview rather than disjointed elements when possible."""
|
107 |
+
|
108 |
+
|
109 |
+
prompt_deep_analysis="""Extract information from a video by analyzing and interpreting its audiovisual elements to provide a detailed description or identify specific data.
|
110 |
+
|
111 |
+
You will have specific information to retrieve from the video. Adapt analysis steps to cater for motion, audio, and potential scene changes unique to video content.
|
112 |
+
|
113 |
+
# Steps
|
114 |
+
|
115 |
+
1. **Parse the Video**: Break down the video into manageable segments, focusing on scenes or timeframes relevant to the target information.
|
116 |
+
2. **Identify Key Elements**: Within these segments, identify crucial visual and audio elements such as objects, text, dialogue, sounds, and any notable features or contexts.
|
117 |
+
3. **Interpret Audiovisual Elements**: Determine the significance or purpose of the identified elements. Consider relationships between objects, text recognition, audio cues, and any context provided by the video.
|
118 |
+
4. **Synthesize Information**: Integrate the interpreted elements to form a coherent understanding or summary.
|
119 |
+
5. **Verify Details**: Ensure accuracy by cross-referencing identifiable text, icons, or audio snippets with known data or references, if relevant.
|
120 |
+
|
121 |
+
# Output Format
|
122 |
+
|
123 |
+
The output should be a detailed text description or a structured data response (such as a JSON) containing the identified elements and their interpretations. Each element should be described along with its context or significance within the video.
|
124 |
+
|
125 |
+
# Examples
|
126 |
+
|
127 |
+
**Input**: (A video of a cooking show with captions and background music.)
|
128 |
+
|
129 |
+
**Output**:
|
130 |
+
|
131 |
+
Description:
|
132 |
+
- **Scene**: Cooking demonstration of a pasta dish
|
133 |
+
- **Captions**: Step-by-step instructions
|
134 |
+
- **Audio**: Background music, presenter dialogue
|
135 |
+
- **Visual Elements**: Ingredients and cooking utensils
|
136 |
+
|
137 |
+
JSON Example:
|
138 |
+
```json
|
139 |
+
{
|
140 |
+
"sceneType": "Cooking Demonstration",
|
141 |
+
"captions": ["Boil water", "Add pasta"],
|
142 |
+
"audio": {
|
143 |
+
"backgroundMusic": "light jazz",
|
144 |
+
"dialogue": ["Today we are making pasta..."]
|
145 |
+
},
|
146 |
+
"visualElements": ["pasta", "saucepan", "spoon"]
|
147 |
+
}
|
148 |
+
```
|
149 |
+
|
150 |
+
# Notes
|
151 |
+
|
152 |
+
- Consider using video timestamp and scene identification for accurate element referencing.
|
153 |
+
- Evaluate both visual and audio elements for context comprehension.
|
154 |
+
- Ensure that video dynamics like scene changes or motion are accounted for in the synthesis of information."""
|
155 |
+
|
156 |
+
|