RCaz commited on
Commit
0f4fb47
·
verified ·
1 Parent(s): 8a93790

Update vision_agent.py

Browse files
Files changed (1) hide show
  1. vision_agent.py +96 -0
vision_agent.py CHANGED
@@ -50,6 +50,13 @@ def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
50
  from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
51
  model = OpenAIServerModel(model_id="gpt-4o")
52
 
 
 
 
 
 
 
 
53
  agent = CodeAgent(
54
  tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
55
  model=model,
@@ -58,3 +65,92 @@ agent = CodeAgent(
58
  max_steps=20,
59
  verbosity_level=2,
60
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
51
  model = OpenAIServerModel(model_id="gpt-4o")
52
 
53
+ ############# OpenAIServerModel: Connects to any service that offers an OpenAI API interface.
54
+ #model = OpenAIServerModel(
55
+ # model_id="gpt-4o",
56
+ # api_base="https://api.openai.com/v1",
57
+ # api_key=os.environ["OPENAI_API_KEY"],
58
+ #)
59
+
60
  agent = CodeAgent(
61
  tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
62
  model=model,
 
65
  max_steps=20,
66
  verbosity_level=2,
67
  )
68
+
69
+ prompt_analysis="""Extract information from an image by analyzing and interpreting its visual elements to provide a detailed description or identify specific data.
70
+
71
+ # Steps
72
+
73
+ 1. **Analyze the Image**: Identify key elements such as objects, text, colors, and any notable features or contexts.
74
+ 2. **Interpret Visual Elements**: Determine the significance or purpose of the elements identified. Consider relationships between objects, text recognition if applicable, and any context clues.
75
+ 3. **Synthesize Information**: Bring together the interpreted elements to form a coherent understanding or summary.
76
+ 4. **Verify Details**: Ensure accuracy by cross-referencing identifiable text or icons with known data or references, if relevant.
77
+
78
+ # Output Format
79
+
80
+ The output should be a detailed text description or a structured data response (such as a JSON) containing the identified elements and their interpretations. Each identified element should be clearly described along with its context or significance.
81
+
82
+ # Example
83
+
84
+ **Input**: (An image with a storefront displaying 'Bakery' sign and a variety of bread on display.)
85
+
86
+ **Output**:
87
+
88
+ Description:
89
+ - **Storefront**: A bakery
90
+ - **Signage Text**: "Bakery"
91
+ - **Products**: Various types of bread
92
+
93
+ JSON Example:
94
+ ```json
95
+ {
96
+ "storeType": "Bakery",
97
+ "signText": "Bakery",
98
+ "products": ["bread", "baguette", "pastry"]
99
+ }
100
+ ```
101
+
102
+ # Notes
103
+
104
+ - Consider optical character recognition (OCR) for text extraction.
105
+ - Evaluate colors and objects for brand or function associations.
106
+ - Provide a holistic overview rather than disjointed elements when possible."""
107
+
108
+
109
+ prompt_deep_analysis="""Extract information from a video by analyzing and interpreting its audiovisual elements to provide a detailed description or identify specific data.
110
+
111
+ You will have specific information to retrieve from the video. Adapt analysis steps to cater for motion, audio, and potential scene changes unique to video content.
112
+
113
+ # Steps
114
+
115
+ 1. **Parse the Video**: Break down the video into manageable segments, focusing on scenes or timeframes relevant to the target information.
116
+ 2. **Identify Key Elements**: Within these segments, identify crucial visual and audio elements such as objects, text, dialogue, sounds, and any notable features or contexts.
117
+ 3. **Interpret Audiovisual Elements**: Determine the significance or purpose of the identified elements. Consider relationships between objects, text recognition, audio cues, and any context provided by the video.
118
+ 4. **Synthesize Information**: Integrate the interpreted elements to form a coherent understanding or summary.
119
+ 5. **Verify Details**: Ensure accuracy by cross-referencing identifiable text, icons, or audio snippets with known data or references, if relevant.
120
+
121
+ # Output Format
122
+
123
+ The output should be a detailed text description or a structured data response (such as a JSON) containing the identified elements and their interpretations. Each element should be described along with its context or significance within the video.
124
+
125
+ # Examples
126
+
127
+ **Input**: (A video of a cooking show with captions and background music.)
128
+
129
+ **Output**:
130
+
131
+ Description:
132
+ - **Scene**: Cooking demonstration of a pasta dish
133
+ - **Captions**: Step-by-step instructions
134
+ - **Audio**: Background music, presenter dialogue
135
+ - **Visual Elements**: Ingredients and cooking utensils
136
+
137
+ JSON Example:
138
+ ```json
139
+ {
140
+ "sceneType": "Cooking Demonstration",
141
+ "captions": ["Boil water", "Add pasta"],
142
+ "audio": {
143
+ "backgroundMusic": "light jazz",
144
+ "dialogue": ["Today we are making pasta..."]
145
+ },
146
+ "visualElements": ["pasta", "saucepan", "spoon"]
147
+ }
148
+ ```
149
+
150
+ # Notes
151
+
152
+ - Consider using video timestamp and scene identification for accurate element referencing.
153
+ - Evaluate both visual and audio elements for context comprehension.
154
+ - Ensure that video dynamics like scene changes or motion are accounted for in the synthesis of information."""
155
+
156
+