Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

(github.com)

28 points | by darkrishabh 5 hours ago ago

5 comments

ssgodderidge an hour ago ago

The example model in the documentation is 4o-mini, you might want to update that to a more recent model.
As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

[-]
- stingraycharles an hour ago ago
  
  It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.
  For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.
  Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.
- block_dagger an hour ago ago
  
  The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).
egeozcan 2 hours ago ago

Are there any published results gathered using this?
ianhxu an hour ago ago

How do you iterate on the judge prompt? Is there an auto rater?