METR can barely measure Claude Mythos – 50% task horizon now exceeds 16 hours

(hugonomy.com)

1 points | by GlyphWeaver_a 11 hours ago ago

2 comments

overthinker_jp 10 hours ago ago

Capability benchmarks may become less meaningful once agents operate across long execution horizons with external tools and permissions. The governance problem starts shifting toward execution boundaries and observability.
GlyphWeaver_a 11 hours ago ago

[dead]