Refactor benchmark functions to use updated client.predict API and improve prompt clarity 0581f43 Kunal Pai commited on May 15
QOL updates and refactoring. Also fixed the tool/agent budgeting 6900003 helloparthshah Kunal Pai harshil-21 commited on May 4
Refactor get_last_assistant_content function to improve response handling and support various response formats 81fafc1 Kunal Pai commited on May 3
Refactor benchmarking script to implement HLE dataset performance evaluation and improve response handling aa7e221 Kunal Pai commited on May 3