3 comments

  • itmitica 7 hours ago ago

    The question remains: no quantization of the confidence level Mesa is adding or subtracting.

    Mesa enforced rules may well be ignored, is this true?

    Mesa is yet another exercise in passing the responsibility to a magical system.

    Tests and human review vs. Mesa: battle grounds.

    • Mitchem 6 hours ago ago

      Unfortunately I'm not sure anybody will solve the issue of "quantization of the confidence level", a senior reviewer in a PR doesn't even do this. We believe that when agents are writing 10x more code than humans, the bottleneck is review throughput, not review precision. The measurable metric isn't "confidence level", we think it's "how many issues reached production that a 10-second rule check would have caught." That number is going up a ton as code volume scales.

      We designed our integration with Claude code to reject edits before it's written if it violates a rule via an exit 2 code. Also don't think any system will be 100% on rule enforcement, senior reviewers included. We can't be perfect but we can raise the floor.

      Actually our approach is the opposite. Writing rules forces teams to articulate what their standards are. Mesa makes you write them down, version control them, and review them as code, similar to a linter. This specificity is why this system works much better than a standard AI code reviewer that checks for general issues.

      We aren't trying to replace tests or humans. The goal is to make them more effective by filtering out the noise before it reaches them.

      • itmitica 6 hours ago ago

        Then Mesa is the bottleneck, only swiped under the rug. Right now you show you are confident enough in Mesa role, but can't prove its actual quality impact, only its quantitative impact.