Mulch vs Baseline (Self Improving Agent — Rank #2 on ClawHub)

Benchmark comparison — token efficiency, troubleshooting, style & memory

Total efficiency gain over legacy
~27.5% fewer chars
For a session that does session start + 3 troubleshooting lookups + 6 style/memory lookups: 3792 chars (Mulch) vs 5233 chars (Self Improving Agent — Rank #2 on ClawHub). 1441 chars (~352 tokens) saved per full mix.
~14%
Session (rem + ctx + ret)
~54%
Troubleshooting
~33%
Style & memory
~27.5%
Combined total
Baseline (Self Improving Agent — Rank #2 on ClawHub) vs Mulch — what’s different
Aspect Self Improving Agent — Rank #2 on ClawHub Mulch Self Improver
Store.learnings/ (LEARNINGS.md, ERRORS.md, mixed PREFERENCES).mulch/ (typed records, domains)
Session startLong reminder (632 chars) + full .learnings in contextShort reminder (452 chars) + mulch prime
RecordingAppend to markdown (no types/domains)mulch record <domain> --type failure|convention|…
RetrievalFull file(s); grep/cat (932 chars for 2 queries)mulch search / mulch query (330 chars)
TroubleshootingFull ERRORS.md + LEARNINGS.md (1215 chars)One mulch search per scenario (559 chars)
Style / preferences / memoryOne mixed file; load full file (1136 chars)Domains + targeted search (757 chars)
1. Token efficiency (session + retrieval)
Metric Baseline (chars) Mulch (chars) Winner
Reminder632452Mulch
Session context13181694Baseline
Retrieval (2 queries)932330Mulch 65% less
Total (rem + ctx + ret)28822476Mulch 14% less
2. Troubleshooting (3 error scenarios)
Metric Baseline Mulch Winner
Chars to get all 3 resolutions1215559Mulch 54% less
Resolutions found (of 3)3/32/3 or 3/3Same or better
3. Style & memory (6 scenarios: Gmail/Twitter voice, addressing, standup)
Metric Baseline Mulch Winner
Chars to get all 6 answers1136757Mulch 33% less
Scenarios found (of 6)6/64/6–6/6Same or better
4. Projected savings (chars → tokens ≈ ÷4)
Scenario Baseline (chars) Mulch (chars) Saving Per 100 sessions
Session (rem + ctx + retrieval)28822476406 chars (~100 tokens)~10k tokens
Troubleshooting (3 errors)1215559656 chars (~164 tokens)~16k per 100 rounds
Full mix (session + troubleshoot + style)523337921441 chars (~352 tokens)~27.5% less
How to run the benchmark
docker build -t mulch-self-improver-test .
docker run --rm mulch-self-improver-test benchmark

The run prints token-efficiency, troubleshooting, and style & memory tables and asserts Mulch wins on reminder, retrieval, total chars, troubleshooting chars, and style/memory chars.