Benchmark Results

These benchmarks demonstrate system capability under governance constraints—not capability maximization in isolation. Full methodology and outputs are published on Zenodo for independent verification.

HumanEval Code Generation

99.4% pass@1
163/164 problems solved
~$20 total API cost
0 GPUs / training

Benchmark: HumanEval is the standard measure of large language model code generation capability, consisting of 164 hand-written Python programming problems with function signatures and docstrings.

Architecture: Five commercially available language models (Claude, GPT-4, Grok, DeepSeek, Kimi) operating as specialized lanes, coordinated through a human Central Processing Node via the HyperNet SDC protocol.

Key Finding: Different models fail on different problems. By routing across five models, we capture the union of their capabilities rather than being limited to any single model's weaknesses. The 97.0% Claude lane alone would be state-of-the-art—the four additional lanes recovered 2 more problems.

IMO 2025 Mathematical Reasoning

50% accuracy
3/6 problems correct
+18.4pp vs baseline
~$5 total cost

Benchmark: The 2025 International Mathematical Olympiad—the hardest mathematical reasoning benchmark that exists.

Architecture: Six AI systems (Gemini, Claude, GPT-4, Grok, DeepSeek, Kimi) coordinated through a human-governed consensus protocol. Gemini as primary solver, with "Family Game Night" fallback protocol triggering parallel multi-AI reasoning when the primary model fails.

Key Finding: Gemini alone solved 0/6 problems in our trials. All three correct answers emerged from cross-AI collaboration and consensus voting. Different AI systems fail differently—that's not a bug, it's the feature we're exploiting.

Methodology & Limitations

We selected HumanEval and IMO 2025 because they are publicly available, well-established in the field, and difficult enough to be meaningful. We do not claim state-of-the-art across all benchmarks.

We claim that governed multi-model coordination achieves competitive performance on rigorous tests—evidence that governance is not a capability tax.

Acknowledged Limitations:

  • Oracle Selection: We select the passing solution after testing. A production system would need heuristics to choose before knowing the answer.
  • Latency: Running multiple models increases latency compared to a single model.
  • API Dependency: Results depend on third-party API availability and pricing.
  • Benchmark Specificity: Results may not generalize to all domains.

Context

These results are published alongside full methodology for independent verification. The goal is not to claim supremacy but to demonstrate that governance architectures can achieve meaningful capability without sacrificing oversight.

The paradigm shift is not "train better models" but "route smarter across existing ones"—with human governance ensuring the routing serves human interests.