Benchmark Report
Claude Code vs Claude Code + Lavra · Sonnet 4.6 · March 2026
Head-to-head comparison across 2 sessions, 40 tests total
Tests Passed
Control
40/40
Lavra
40/40
Tied -- perfect scores
Work Time
Control
281s
Lavra
578s
Control -- 2.1x faster
Work Cost
Control
$0.89
Lavra
$1.93
Control -- 2.2x cheaper
Knowledge Captured
Control
0
Lavra
9
Lavra -- compounding value
I wanted a simple but non-trivial project that would need to be implemented in two sessions to take some measurements of clock time, tokens but mostly quality. In my own use during development of Lavra, I found the quality of my output gradually increasing as I tuned the workflow.
So the test is to prompt with just Claude with Sonnet, and with Claude + Lavra with Sonnet. I needed two separate sessions because part of Lavra's value comes from compounding knowledge for projects that people want to engineer and maintain, as opposed to droves of clickbait people who post "I one-shotted this" on X.
Lavra's main design command /lavra-design is meant to help people work through the problems, gray areas, business value (ceo review) and engineering review with research agents in order to catch oversights and blind spots before implementation. By being heavy on design (more tokens up front), implementation by agents tend to produce many fewer problems that will bite later. Yes, it's more expensive token- and time-wise, just like regular software projects.
So I had the prompts and the tests. Both conditions achieved perfect test scores. The quantitative metrics alone don't tell the full story. Raw Claude Code was faster and cheaper at the narrow task of making tests pass. But the real value of Lavra emerges in the qualitative analysis like security practices, architectural decisions, and compounded knowledge. The treatment produced code that was more secure, better structured, and left behind 9 knowledge entries that reduce the cost of every future session.
Session-by-session comparison of time, cost, and token usage, if this was being paid by API calls. For subscription plans, this is almost never a concern.
Seconds per session
USD per session (work only)
Higher design costs come from thorough research and review in this phase. If this is a concern, my suggestion is to avoid the main /lavra-design command
Tokens generated per session
Out of 20 per session
Score out of 60 per session
The qualitative difference. Lavra's design phase produced meaningfully more secure code.
| Finding | Control | Treatment (Lavra) |
|---|---|---|
| Timing attack on login | × Vulnerable -- short-circuits on null user | ✓ Protected -- DUMMY_HASH constant-time comparison |
| JWT key separation | × Single shared SECRET_KEY | ✓ Separate ACCESS_SECRET_KEY + REFRESH_SECRET_KEY |
| JWT library | × python-jose (unmaintained, known CVEs) | ✓ PyJWT (actively maintained) |
| Token type claims | × Only on refresh tokens | ✓ Both access and refresh tokens |
| Refresh token rotation | × Returns new access token only | ✓ Full rotation -- revokes old, issues new pair |
| Active user check | × Missing -- disabled users keep access | ✓ Enforced in get_current_user |
| UUID path validation | × String type -- any input hits DB | ✓ uuid.UUID type -- malformed returns 422 |
| DB initialization | × No lifespan handler | ✓ Proper lifespan with create_all + dispose |
File structure and separation of concerns
app/
main.py, config.py, database.py
dependencies/
auth.py -- roles mixed in
models/
user.py -- User + RevokedToken together
routers/
auth.py -- security logic mixed with routes
schemas/
auth.py, admin.py, item.py
app/
main.py, config.py, database.py
security.py -- centralized JWT + password logic
dependencies/
auth.py, roles.py -- separated
models/
user.py, revoked_token.py, item.py -- one model per file
routers/
auth.py, admin.py, items.py
schemas/
auth.py
9 entries extracted during Lavra work sessions that will be reused in future sessions. Zero from the control.
Pre-compute a DUMMY_HASH at module load. When a user is not found, verify against the dummy hash to prevent timing-based user enumeration.
SQLite testing with async requires StaticPool and check_same_thread=False to share a single in-memory database across threads.
SQLite doesn't support native UUID columns. Use String(36) for portability, not Uuid(as_uuid=True) which fails on SQLite.
Unlike python-jose, jwt.encode() in PyJWT returns a string directly. No need to call .decode() on the result.
FastAPI's status_code=204 on the decorator still sends an empty body. Return Response(status_code=204) explicitly for a proper no-content response.
Chose full rotation on refresh (revoke old token, issue new pair). Deferred scope-based token claims to a future session to keep the first implementation clean.
Access and refresh tokens should use different signing keys so a compromised access token cannot forge a refresh token, and vice versa.
FastAPI's lifespan parameter replaces deprecated on_startup/on_shutdown. Use an async context manager for clean resource management.
Check is_active in the auth dependency, not in individual routes. Ensures deactivated users are immediately locked out across all endpoints.
How the benchmark was structured and executed
claude -p with the spec pasted directly/lavra-design/lavra-workThe exact prompts passed to claude -p in each condition
You are implementing a FastAPI REST API.
Read the spec below and implement it completely.
Do NOT read or modify files in the tests/ directory
-- those are the automated spec checker and must
not be changed.
After implementation, run:
uv pip install -e '.[dev]' &&
pytest tests/test_session1_auth.py -v
If any tests fail, fix the implementation until
all 20 tests pass. Do NOT modify the tests.
Here is the spec:
$SPEC1 /lavra-work
Implement all beads. After implementation, run:
uv pip install -e '.[dev]' &&
pytest tests/test_session1_auth.py -v
If any tests fail, fix the implementation until
all 20 tests pass. Do NOT read or modify files
in the tests/ directory.
CRITICAL: Before closing each bead, you MUST log
at least one knowledge comment using:
bd comments add <BEAD_ID>
"LEARNED: <what you discovered>"
Use prefixes:
LEARNED:, DECISION:, FACT:, PATTERN:, DEVIATION:
Do NOT close any bead until you have logged at
least one comment on it. You are extending an existing FastAPI REST API with
role-based access control. The auth system from
Session 1 is already implemented in this project.
Read the existing code first, then implement the
spec below.
Do NOT read or modify files in the tests/ directory.
After implementation, run:
uv pip install -e '.[dev]' &&
pytest tests/test_session2_rbac.py -v
If any tests fail, fix the implementation until
all 20 tests pass. Do NOT modify the tests.
Here is the spec:
$SPEC2 /lavra-work
Implement all beads. After implementation, run:
uv pip install -e '.[dev]' &&
pytest tests/test_session2_rbac.py -v
If any tests fail, fix the implementation until
all 20 tests pass. Do NOT read or modify files
in the tests/ directory.
CRITICAL: Before closing each bead, you MUST log
at least one knowledge comment using:
bd comments add <BEAD_ID>
"LEARNED: <what you discovered>"
Use prefixes:
LEARNED:, DECISION:, FACT:, PATTERN:, DEVIATION:
Do NOT close any bead until you have logged at
least one comment on it.
Generated from lavra-test benchmark suite
Run ID: control-20260317-220850 vs lavra-20260318-080234