Smolagent_to_GAIA_benchmark

Sleeping

App Files Files Community

RCaz commited on Jun 2

Commit

0f4fb47

verified ·

1 Parent(s): 8a93790

Update vision_agent.py

Browse files

Files changed (1) hide show

vision_agent.py +96 -0

vision_agent.py CHANGED Viewed

@@ -50,6 +50,13 @@ def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
 from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
 model = OpenAIServerModel(model_id="gpt-4o")
 agent = CodeAgent(
     tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
     model=model,
@@ -58,3 +65,92 @@ agent = CodeAgent(
     max_steps=20,
     verbosity_level=2,
 )

 from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
 model = OpenAIServerModel(model_id="gpt-4o")
+############# OpenAIServerModel: Connects to any service that offers an OpenAI API interface.
+#model = OpenAIServerModel(
+#    model_id="gpt-4o",
+#    api_base="https://api.openai.com/v1",
+#    api_key=os.environ["OPENAI_API_KEY"],
+#)
 agent = CodeAgent(
     tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
     model=model,
     max_steps=20,
     verbosity_level=2,
 )
+prompt_analysis="""Extract information from an image by analyzing and interpreting its visual elements to provide a detailed description or identify specific data.
+# Steps
+1. **Analyze the Image**: Identify key elements such as objects, text, colors, and any notable features or contexts.
+2. **Interpret Visual Elements**: Determine the significance or purpose of the elements identified. Consider relationships between objects, text recognition if applicable, and any context clues.
+3. **Synthesize Information**: Bring together the interpreted elements to form a coherent understanding or summary.
+4. **Verify Details**: Ensure accuracy by cross-referencing identifiable text or icons with known data or references, if relevant.
+# Output Format
+The output should be a detailed text description or a structured data response (such as a JSON) containing the identified elements and their interpretations. Each identified element should be clearly described along with its context or significance.
+# Example
+**Input**: (An image with a storefront displaying 'Bakery' sign and a variety of bread on display.)
+**Output**:
+Description:
+- **Storefront**: A bakery
+- **Signage Text**: "Bakery"
+- **Products**: Various types of bread
+JSON Example:
+```json
+{
+  "storeType": "Bakery",
+  "signText": "Bakery",
+  "products": ["bread", "baguette", "pastry"]
+}
+```
+# Notes
+- Consider optical character recognition (OCR) for text extraction.
+- Evaluate colors and objects for brand or function associations.
+- Provide a holistic overview rather than disjointed elements when possible."""
+prompt_deep_analysis="""Extract information from a video by analyzing and interpreting its audiovisual elements to provide a detailed description or identify specific data.
+You will have specific information to retrieve from the video. Adapt analysis steps to cater for motion, audio, and potential scene changes unique to video content.
+# Steps
+1. **Parse the Video**: Break down the video into manageable segments, focusing on scenes or timeframes relevant to the target information.
+2. **Identify Key Elements**: Within these segments, identify crucial visual and audio elements such as objects, text, dialogue, sounds, and any notable features or contexts.
+3. **Interpret Audiovisual Elements**: Determine the significance or purpose of the identified elements. Consider relationships between objects, text recognition, audio cues, and any context provided by the video.
+4. **Synthesize Information**: Integrate the interpreted elements to form a coherent understanding or summary.
+5. **Verify Details**: Ensure accuracy by cross-referencing identifiable text, icons, or audio snippets with known data or references, if relevant.
+# Output Format
+The output should be a detailed text description or a structured data response (such as a JSON) containing the identified elements and their interpretations. Each element should be described along with its context or significance within the video.
+# Examples
+**Input**: (A video of a cooking show with captions and background music.)
+**Output**:
+Description:
+- **Scene**: Cooking demonstration of a pasta dish
+- **Captions**: Step-by-step instructions
+- **Audio**: Background music, presenter dialogue
+- **Visual Elements**: Ingredients and cooking utensils
+JSON Example:
+```json
+{
+  "sceneType": "Cooking Demonstration",
+  "captions": ["Boil water", "Add pasta"],
+  "audio": {
+    "backgroundMusic": "light jazz",
+    "dialogue": ["Today we are making pasta..."]
+  },
+  "visualElements": ["pasta", "saucepan", "spoon"]
+}
+```
+# Notes
+- Consider using video timestamp and scene identification for accurate element referencing.
+- Evaluate both visual and audio elements for context comprehension.
+- Ensure that video dynamics like scene changes or motion are accounted for in the synthesis of information."""