I used autoresearch to improve my AGENTS.md, measured against real tasks

(stet.sh)

8 points | by bisonbear 10 hours ago ago

6 comments

fuckinpuppers 7 hours ago ago

I had a blast having all the major models figure out the most optimal strategy for itself inside of Cursor, with cursorrules, AGENTS.md, .cursor/rules/ mrd files or whatever and learned some interesting things, how it won’t guarantee every instruction even when it’s told to, for example
Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.
I must have spent a couple hundred on the company dime having the models rephrase/rewrite or change where instructions were found, what made sense as a skill vs a rule, trying to keep things as portable as possible. At this point the Cursor-specific files would need to be ported to a different agent/framework if it needed to be. But the content should be pretty solid.
It was an interesting (and productive) exploration for me

[-]
- bisonbear 2 hours ago ago
  
  > Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.
  This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.
  The part I'm curious about is, when we're making the sorts of behavior changes you're describing on shared repos, how do we actually measure and quantify impact? It's one thing to tell the team that the agent should perform better, and it's another to say that you made the agent 5% better across a variety of tasks for every dev in the repo.
joshka 7 hours ago ago

If you look at the 95% CI on https://marginlab.ai/trackers/codex/ with N=50, it's still pretty huge (+/- 13-14% usually). I suspect it would be difficult to reasonably get a measure that numerically assesses whether an AGENTS.md is good. What you can observe though is whether the model paid attention to certain rules while editing. I.e. did the behavior you're steering away or towards take place.
The hardest thing I think is judging whether your AGENTS.md is still good based on each model release. OpenAI does release prompting guidance however to help this (and have added a skills to apply this to your prompts IIRC)

[-]
- bisonbear 2 hours ago ago
  
  Yes, agree that low n makes overclaiming a real risk with this sort of optimization loop. Low n results can be useful directionally but can't claim superiority without expanding the dataset. If I were running this for a shared repo with real consequences / value to improving AGENTS.md, instead of just as an experiment, I would expand n by a few factors for training / holdout, depending on expected variation on the tasks.
  I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.
jauntywundrkind 6 hours ago ago

The fine tuning where we run tests/experiments again and again and again on our prompts, our set-ups: really looking forward to when we can start to compare our amalgamated rigs and harnesses and prompts, all these systems. We are guided by intuition, a desire for structure & clarity & direction we think we add. But we lack common tools to assess and compare.
And even when we do compare, the thermal values, the entropy of our systems: that alone can lead us down very different paths. Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

[-]
- bisonbear 2 hours ago ago
  
  > we lack common tools to assess and compare
  This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.
  > Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)
  Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise