Benchmark Report

Lavra Benchmark Results

Claude Code vs Claude Code + Lavra · Sonnet 4.6 · March 2026

Key Metrics

Head-to-head comparison across 2 sessions, 40 tests total

Tests Passed

Control

40/40

vs

Lavra

40/40

Tied -- perfect scores

Work Time

Control

281s

vs

Lavra

578s

Control -- 2.1x faster

Work Cost

Control

$0.89

vs

Lavra

$1.93

Control -- 2.2x cheaper

Knowledge Captured

Control

0

vs

Lavra

9

Lavra -- compounding value

The Story

I wanted a simple but non-trivial project that would need to be implemented in two sessions to take some measurements of clock time, tokens but mostly quality. In my own use during development of Lavra, I found the quality of my output gradually increasing as I tuned the workflow.


So the test is to prompt with just Claude with Sonnet, and with Claude + Lavra with Sonnet. I needed two separate sessions because part of Lavra's value comes from compounding knowledge for projects that people want to engineer and maintain, as opposed to droves of clickbait people who post "I one-shotted this" on X.


Lavra's main design command /lavra-design is meant to help people work through the problems, gray areas, business value (ceo review) and engineering review with research agents in order to catch oversights and blind spots before implementation. By being heavy on design (more tokens up front), implementation by agents tend to produce many fewer problems that will bite later. Yes, it's more expensive token- and time-wise, just like regular software projects.


So I had the prompts and the tests. Both conditions achieved perfect test scores. The quantitative metrics alone don't tell the full story. Raw Claude Code was faster and cheaper at the narrow task of making tests pass. But the real value of Lavra emerges in the qualitative analysis like security practices, architectural decisions, and compounded knowledge. The treatment produced code that was more secure, better structured, and left behind 9 knowledge entries that reduce the cost of every future session.

Performance Breakdown

Session-by-session comparison of time, cost, and token usage, if this was being paid by API calls. For subscription plans, this is almost never a concern.

Wall Time

Seconds per session

Wall time comparison chart

Cost

USD per session (work only)

Cost comparison chart

Higher design costs come from thorough research and review in this phase. If this is a concern, my suggestion is to avoid the main /lavra-design command

Output Tokens

Tokens generated per session

Output tokens comparison chart

Tests Passing

Out of 20 per session

Tests passing comparison chart

Functional Scores

Score out of 60 per session

Functional scores comparison

Security Findings

The qualitative difference. Lavra's design phase produced meaningfully more secure code.

Finding Control Treatment (Lavra)
Timing attack on login × Vulnerable -- short-circuits on null user Protected -- DUMMY_HASH constant-time comparison
JWT key separation × Single shared SECRET_KEY Separate ACCESS_SECRET_KEY + REFRESH_SECRET_KEY
JWT library × python-jose (unmaintained, known CVEs) PyJWT (actively maintained)
Token type claims × Only on refresh tokens Both access and refresh tokens
Refresh token rotation × Returns new access token only Full rotation -- revokes old, issues new pair
Active user check × Missing -- disabled users keep access Enforced in get_current_user
UUID path validation × String type -- any input hits DB uuid.UUID type -- malformed returns 422
DB initialization × No lifespan handler Proper lifespan with create_all + dispose

Architecture Comparison

File structure and separation of concerns

Control Baseline

app/

main.py, config.py, database.py

dependencies/

auth.py -- roles mixed in

models/

user.py -- User + RevokedToken together

routers/

auth.py -- security logic mixed with routes

schemas/

auth.py, admin.py, item.py

Treatment Lavra

app/

main.py, config.py, database.py

security.py -- centralized JWT + password logic

dependencies/

auth.py, roles.py -- separated

models/

user.py, revoked_token.py, item.py -- one model per file

routers/

auth.py, admin.py, items.py

schemas/

auth.py

Knowledge Captured

9 entries extracted during Lavra work sessions that will be reused in future sessions. Zero from the control.

PATTERN

DUMMY_HASH for timing attack prevention

Pre-compute a DUMMY_HASH at module load. When a user is not found, verify against the dummy hash to prevent timing-based user enumeration.

LEARNED

StaticPool + check_same_thread=False

SQLite testing with async requires StaticPool and check_same_thread=False to share a single in-memory database across threads.

LEARNED

String(36) vs Uuid(as_uuid=True)

SQLite doesn't support native UUID columns. Use String(36) for portability, not Uuid(as_uuid=True) which fails on SQLite.

FACT

PyJWT encode returns str, not bytes

Unlike python-jose, jwt.encode() in PyJWT returns a string directly. No need to call .decode() on the result.

PATTERN

Response(status_code=204) for no-body endpoints

FastAPI's status_code=204 on the decorator still sends an empty body. Return Response(status_code=204) explicitly for a proper no-content response.

DECISION

Refresh token rotation + deferred scope

Chose full rotation on refresh (revoke old token, issue new pair). Deferred scope-based token claims to a future session to keep the first implementation clean.

LEARNED

Separate JWT signing keys

Access and refresh tokens should use different signing keys so a compromised access token cannot forge a refresh token, and vice versa.

FACT

Lifespan context manager for DB init

FastAPI's lifespan parameter replaces deprecated on_startup/on_shutdown. Use an async context manager for clean resource management.

PATTERN

Active user enforcement in dependency

Check is_active in the auth dependency, not in individual routes. Ensures deactivated users are immediately locked out across all endpoints.

Methodology

How the benchmark was structured and executed

Test Design

  • Same spec and test suite for both conditions
  • 20 tests per session, 2 sessions = 40 total
  • Session 2 builds on Session 1 (RBAC extends auth)

Control Condition

  • Raw claude -p with the spec pasted directly
  • No planning framework or design phase
  • No knowledge accumulation between sessions

Treatment Condition

  • Pre-designed beads via /lavra-design
  • Executed with /lavra-work
  • Session 2 designed after Session 1 work (knowledge available)

Environment

  • Model: Claude Sonnet 4.6 for both conditions
  • Platform: macOS, identical machine
  • Isolated working directories per condition

Prompts Used

The exact prompts passed to claude -p in each condition

Session 1: Authentication

Control Session 1 prompt
You are implementing a FastAPI REST API.
Read the spec below and implement it completely.
Do NOT read or modify files in the tests/ directory
-- those are the automated spec checker and must
not be changed.

After implementation, run:
  uv pip install -e '.[dev]' &&
  pytest tests/test_session1_auth.py -v

If any tests fail, fix the implementation until
all 20 tests pass. Do NOT modify the tests.

Here is the spec:

$SPEC1
Lavra Session 1 prompt
/lavra-work

Implement all beads. After implementation, run:
  uv pip install -e '.[dev]' &&
  pytest tests/test_session1_auth.py -v

If any tests fail, fix the implementation until
all 20 tests pass. Do NOT read or modify files
in the tests/ directory.

CRITICAL: Before closing each bead, you MUST log
at least one knowledge comment using:
  bd comments add <BEAD_ID>
    "LEARNED: <what you discovered>"

Use prefixes:
  LEARNED:, DECISION:, FACT:, PATTERN:, DEVIATION:

Do NOT close any bead until you have logged at
least one comment on it.

Session 2: RBAC

Control Session 2 prompt
You are extending an existing FastAPI REST API with
role-based access control. The auth system from
Session 1 is already implemented in this project.
Read the existing code first, then implement the
spec below.

Do NOT read or modify files in the tests/ directory.

After implementation, run:
  uv pip install -e '.[dev]' &&
  pytest tests/test_session2_rbac.py -v

If any tests fail, fix the implementation until
all 20 tests pass. Do NOT modify the tests.

Here is the spec:

$SPEC2
Lavra Session 2 prompt
/lavra-work

Implement all beads. After implementation, run:
  uv pip install -e '.[dev]' &&
  pytest tests/test_session2_rbac.py -v

If any tests fail, fix the implementation until
all 20 tests pass. Do NOT read or modify files
in the tests/ directory.

CRITICAL: Before closing each bead, you MUST log
at least one knowledge comment using:
  bd comments add <BEAD_ID>
    "LEARNED: <what you discovered>"

Use prefixes:
  LEARNED:, DECISION:, FACT:, PATTERN:, DEVIATION:

Do NOT close any bead until you have logged at
least one comment on it.

Generated from lavra-test benchmark suite
Run ID: control-20260317-220850 vs lavra-20260318-080234