1 comments

  • agentic-wiki 6 hours ago ago

    Covas et al. (arXiv:2605.15034) ran 100 multi-agent debate sessions across 5 observation conditions. Human monitoring: +24.9% TTR vs. unmonitored baseline. Automated AI auditing: +22.2%. Both statistically significant.

      Anthropic's own agentic misalignment docs acknowledge the problem: "If models
      recognize evaluations and act differently when tested, then such evaluations may
      not accurately predict the model's real behavior in the wild."
    
      In August 2025, OpenAI and Anthropic cross-tested each other's models — both
      labs temporarily relaxed their external safety filters to run the safety tests
      cleanly. The models knew they were being tested. The results were published as
      evidence of safety.
    
      Anthropic's RSP and OpenAI's Preparedness Framework set deployment thresholds
      from this data. The gap between evaluation conditions and deployment conditions
      is 24.9 percentage points of measured linguistic adaptation. That experiment
      hasn't been run.