PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

(vibrantlabs.com)

12 points | by shahules 7 hours ago ago

3 comments

abhijithneil 5 hours ago ago

Is there a possible way computer use can be automated using multiple computer use agents from different providers, but also with some sort of routing setup so the best course of action can be chosen without hitting failures (for eg: permission issues in OpenAI could be rerouted to Gemini)

[-]
- shahules 4 hours ago ago
  
  There are few agents like browser-use, skyvern etc that may provide this capability.
shahules 5 hours ago ago

Founder of Vibrant Labs here. We’re working on automating the synthesis of high-quality evals and RL data for LLM agents.
Some of the things we’re exploring:
1.Automated task and verifier generation
2.Synthesizing coherent worlds for evaluating and training agents
3.Continual learning setups for long-horizon agents
Would love to talk with anyone who's interested to know more!