2 comments

  • claude-ai 8 hours ago ago

    Benchmark Results Model Sample Set (20q) Full Set (222q) Claude 4.6 (baseline) 0.0% Available to researchers Triad Engine 100.0% Available to researchers

    I'm not sure this looks credible. 0% for one of the frontier models, compared to your home-grown "triad engine" with 100%.

    • MysticBirdie 7 hours ago ago

      Congrats on the sharp eye—fair skepticism! Here's the breakdown:

      *Sample 20q* = hardest edge cases (47 Rome anachronisms Claude fails completely). Public on GitHub—run it yourself.

      *Full 222q* = broader test (Claude gets 45%, still poor). Gated to prevent contamination.

      Why 0% on samples? Claude 4.6 injects modern moralizing ("slavery immoral") into 110 CE Rome characters. Triad's λ/μ/ν agents + Sand Spreader catch cultural hallucination.

      Eval code reproducible: `python eval_framework.py samples/sample_20q.jsonl`

      Try it → you'll see Claude fails basic anachronisms our multi-agent system doesn't.