Benchmarks — AI Agent Productivity
Traditional language benchmarks measure execution speed. For Kōdo, the question is different: how fast can an AI agent go from “code written” to “code verified and deployed”?
These are qualitative benchmarks comparing agent workflows. Quantitative measurements are in development.
Benchmark 1: Error→Fix Loop Speed
Scenario: An AI agent generates code with 10 type errors. How fast can it reach a clean compilation?
Python + mypy
# Agent generates code → runs mypy
$ mypy main.py
main.py:12: error: Incompatible types in assignment
(expression has type "str", variable has type "int")
main.py:25: error: Argument 1 to "process" has incompatible type...
# Agent must: parse prose → regex match error locations →
# guess the fix → rewrite → re-run mypy
# Some errors are ambiguous. Auto-fix rate: ~60%
Kōdo
// Agent generates code → runs kodoc check --json-errors
$ kodoc check main.ko --json-errors
{
"code": "E0201",
"message": "Type mismatch: expected Int, found String",
"span": { "file": "main.ko", "start": 142, "end": 155 },
"fix_patch": {
"replacement": "parse_int(value)",
"start_byte": 142,
"end_byte": 155,
"confidence": "high"
},
"fix_difficulty": "auto"
}
# Agent applies fix_patch directly — no guessing
$ kodoc fix main.ko
Fixed 10 errors. 0 remaining.
| Metric | Python + mypy | Kōdo |
|---|---|---|
| Error format | Prose (regex parsing needed) | Structured JSON |
| Fix mechanism | Agent guesses | FixPatch with byte offsets |
| Auto-fix rate | ~60% of type errors | 100% of errors with patches |
| Cycles to clean build | 2–5 | 1–2 |
Benchmark 2: Correctness by Construction
Scenario: An AI agent generates a division function. How many bugs reach runtime?
Python
def divide(a, b):
return a / b # No compile-time check — ZeroDivisionError at runtime
# Agent can add a check, but nothing *enforces* it
def divide_safe(a, b):
if b == 0:
raise ValueError("division by zero")
return a / b
# Still no guarantee callers handle the error
Kōdo
fn divide(a: Int, b: Int) -> Int
requires { b != 0 }
ensures { result * b == a }
{
return a / b
}
// Calling divide(10, 0) → compile-time error E0301:
// "Precondition 'b != 0' cannot be satisfied:
// argument 'b' is literal 0"
| Metric | Python | Kōdo |
|---|---|---|
| Division by zero | Runtime exception | Compile-time error (Z3 proves b != 0 is violated) |
| Contract enforcement | None (convention only) | Grammar-level requires/ensures |
| Bugs reaching runtime | Possible | Zero for statically verified contracts |
| Agent behavior | Hope the tests catch it | Compiler blocks the build — agent must fix |
Benchmark 3: Trust Propagation
Scenario: A module has 5 functions. One is experimental with @confidence(0.6). How fast is the risk detected?
Python
# No mechanism to track confidence or authorship
def stable_function(): # Who wrote this? How confident? No idea.
return process(experimental_helper())
def experimental_helper(): # Agent generated this at 60% confidence
return risky_computation()
# Risk: experimental code is silently used in production
# Detection: manual code review, maybe never
Kōdo
@authored_by(agent: "claude")
@confidence(0.95)
fn stable_function() -> Int {
return process(experimental_helper())
// ↑ E0260: Calling function with confidence 0.6
// from function with confidence 0.95.
// Add @reviewed_by to acknowledge the risk.
}
@authored_by(agent: "claude")
@confidence(0.6)
fn experimental_helper() -> Int {
return risky_computation()
}
| Metric | Python | Kōdo |
|---|---|---|
| Confidence tracking | None | @confidence scores on every function |
| Risk propagation | Invisible | Transitive — min confidence propagates through call chains |
| Detection time | Manual review (hours/days/never) | Compile-time (instant) |
| Policy enforcement | None | Build blocked until @reviewed_by is added |
| Audit trail | git blame | Build certificates (.ko.cert.json) with per-function scores |
The closed-loop advantage
These benchmarks share a pattern: Kōdo moves verification left — from runtime to compile-time, from human review to automated checks.
For an AI agent operating in a tight loop:
┌─────────────────────────────────────────────────┐
│ Agent writes code │
│ ↓ │
│ kodoc check --json-errors │
│ ↓ │
│ Parse JSON → apply FixPatch → recompile │
│ ↓ │
│ All contracts verified by Z3 │
│ ↓ │
│ Confidence scores > threshold │
│ ↓ │
│ Build certificate generated → deploy │
└─────────────────────────────────────────────────┘
No human in the loop. No hoping tests catch it. No “it works on my machine.”
Quantitative benchmarks
Reproducible quantitative benchmarks are in development, measuring:
- Cycles to clean compilation — agent attempts until zero errors
- Time to first successful build — wall clock from code generation to binary
- Contract coverage — percentage of functions with verified pre/post-conditions
- Fix patch hit rate — percentage of errors with machine-applicable patches
Results will be published here as they become available. Want to contribute? Open an issue on GitHub.