PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

(vibrantlabs.com)

12 points | by shahules 7 hours ago ago

3 comments

  • abhijithneil 5 hours ago ago

    Is there a possible way computer use can be automated using multiple computer use agents from different providers, but also with some sort of routing setup so the best course of action can be chosen without hitting failures (for eg: permission issues in OpenAI could be rerouted to Gemini)

    • shahules 4 hours ago ago

      There are few agents like browser-use, skyvern etc that may provide this capability.

  • shahules 5 hours ago ago

    Founder of Vibrant Labs here. We’re working on automating the synthesis of high-quality evals and RL data for LLM agents.

    Some of the things we’re exploring:

    1.Automated task and verifier generation

    2.Synthesizing coherent worlds for evaluating and training agents

    3.Continual learning setups for long-horizon agents

    Would love to talk with anyone who's interested to know more!