Mulch vs Baseline (Self Improving Agent — Rank #2 on ClawHub)
Benchmark comparison — token efficiency, troubleshooting, style & memory
Total efficiency gain over legacy
~27.5% fewer chars
For a session that does session start + 3 troubleshooting lookups + 6 style/memory lookups: 3792 chars (Mulch) vs 5233 chars (Self Improving Agent — Rank #2 on ClawHub). 1441 chars (~352 tokens) saved per full mix.
~14%
Session (rem + ctx + ret)
Baseline (Self Improving Agent — Rank #2 on ClawHub) vs Mulch — what’s different
1. Token efficiency (session + retrieval)
2. Troubleshooting (3 error scenarios)
3. Style & memory (6 scenarios: Gmail/Twitter voice, addressing, standup)
4. Projected savings (chars → tokens ≈ ÷4)
How to run the benchmark
docker build -t mulch-self-improver-test .
docker run --rm mulch-self-improver-test benchmark
The run prints token-efficiency, troubleshooting, and style & memory tables and asserts Mulch wins on reminder, retrieval, total chars, troubleshooting chars, and style/memory chars.