Skip to content

Letting AI Review AI: Cross-Model Code Review with Claude, Codex, and Gemini

A model is the worst possible reviewer of its own work. The blind spot that produced a bug is the same one that reads the bug back as fine — a model has a hard time catching the class of mistake it is itself prone to making.

Ask the model that wrote a spec to review that spec, and it mostly agrees with itself. What you want is a different model, trained differently, looking at the same work with a fresh prior.

It turns out the three major terminal AI agents — Anthropic's claude, OpenAI's codex, and Google's gemini — all have a headless mode that takes a prompt and prints an answer. That means one of them can shell out to the others. So Claude Code writes something, then runs Codex and Gemini over it as independent reviewers, reads their findings, checks every claim against the real code, and reconciles. One model writes; the others try to break it.

This is not a framework. It is three CLIs and a pipe.

The headless trick

Each agent exposes the same shape — a non-interactive mode that reads a prompt and writes an answer — which is exactly why they compose:

CLIHeadless invocationRead-only / safe modeJSON output
claudeclaude -p "<prompt>"default (no edits without tool calls)--output-format json
codexcodex exec "<prompt>" — note: no -p flag-s read-only--json
geminigemini -p "<prompt>"--approval-mode plan-o json

The first trap is right there in the table: Codex has no -p flag. It is the exec subcommand. People paste codex -p out of habit and it fails.

Two things that make the reviews good instead of decorative

1. Pass the prompt as a file piped to stdin. Inline prompts get unwieldy and shell-escaping eats them. Write the reviewer instructions to a file, pipe it in, capture the result to another file:

bash
cat .cross-review/codex-prompt.txt \
  | codex exec -s read-only --skip-git-repo-check \
      -o .cross-review/codex-review.md -      # trailing - = read prompt from stdin
bash
cat .cross-review/gemini-prompt.txt \
  | gemini -p - --approval-mode plan \
      2>&1 | tee .cross-review/gemini-review.md

2. Make the reviewer read the real code — not the artifact's claims about itself. This is the single biggest quality lever. The reviewer prompt does not paste the code; it lists actual file paths and runs the reviewer in a read-only sandbox (-s read-only for Codex, --approval-mode plan for Gemini) so it can explore the repository and ground its findings in real function signatures, return shapes, and tests. The difference is the difference between "this looks plausible" and "I grepped for parse_dotd and it returns empty peaks by design, which contradicts §4 of the spec."

A reviewer prompt that works has a consistent skeleton:

You are a senior staff engineer doing a critical, opinionated review of <the artifact>.
Be direct. Do not just praise it.

## What to read
1. <the artifact under review>
2. <each grounding path, with a one-line "why this matters"> — read the ACTUAL
   code to ground your review in reality, not the artifact's claims about itself.

## What to look for
- Internal contradictions / ambiguity
- Misalignment with the real codebase (signatures, return shapes, invariants, error types)
- Architecture that will hurt later; missing risks (concurrency, caching, restart
  semantics, partial failure, security, offline behaviour); scope realism

## Output format
- Strengths (2-3 bullets)
- Concerns ranked P0 (ship-blocking) / P1 (fix before this lands) / P2 (nice-to-fix)
- Suggested edits (quote the passage, propose the replacement)
- Open questions a senior reviewer wants answered before sign-off

Two shapes: relay and jury

Sequential relay. One model writes, the next reviews, the orchestrator verifies and revises, then a third reviews the corrected version. The trick that makes the second reviewer worth its tokens is priming it to beat the first:

Important: this was already reviewed once by a different LLM and revised in response. You are a second pair of eyes — your job is to find what the first reviewer missed. Do not repeat findings already addressed.

Without that line, reviewer two mostly re-states reviewer one. With it, reviewer two goes hunting for blind spots — concurrency, caching invalidation, restart semantics, cross-section contradictions — the things the first pass did not think to look for.

Parallel jury. Fan the same prompt out to all reviewers at once, then reconcile by agreement: a concern both models raise is high-confidence; a concern only one raises is a coin-flip you adjudicate by reading the code. This shape is the right one for scoring or grading work, where you want independent votes rather than a relay. We have used exactly this — a Claude + Codex + Gemini panel, reconciled by median score and spread, with disagreement routed to a human — to grade a large batch of assignments in a way that was repeatable and auditable.

The distinction matters: the relay folds each correction in before the next look; the jury keeps the votes independent so that agreement actually means something.

A real session

We ran the relay on a design spec for an analysis toolkit we were building — a deterministic measurement-analysis tool shipped as a CLI, with a proposed web workbench on top of it.

Codex reviewed first, exploring the real codebase read-only, and came back with three P0 issues, six P1, and three P2 — most of them grounded in code the author model had genuinely glossed over: function signatures that did not match the spec's description, a parser that returned empty results by design where the spec assumed populated ones.

Then Gemini reviewed, primed as the second pair of eyes, and found five net-new issues Codex had missed — including a demo-day fragility (a dev server that would not survive an analyst's laptop) and a cold-start cost in a library load that nobody had costed.

These were not rubber stamps. They were defects the model that wrote the spec could not see in its own work.

The discipline that keeps it honest

Here is the part that is easy to skip and expensive to get wrong: the orchestrator never blind-applies a finding. After each external review, Claude Code checked the cited signatures, dataclasses, and tests itself, and stated explicitly which claims it verified, which it could not, and which were simply wrong because the reviewer had misread the code. Only verified findings graduated into the revision.

The external models are a source of hypotheses. The orchestrator is the arbiter that checks them against ground truth. Skip that step and you have not added a safety net — you have built a machine for laundering one model's hallucination through another and stamping it "peer reviewed." Cross-model review is only as good as the verification pass behind it.

Reviews are an opt-in gate, not a reflex

Each external review costs roughly 10–30 seconds and a few thousand tokens. This is worth it for a spec, a plan, or a meaningful diff — not for every keystroke. Wire it to a slash command or a pre-commit checkpoint, not to autosave.

Where it bites

The honest list, from real runs:

  • Codex refuses to run outside a Git repo. Add --skip-git-repo-check, or it stops with "Not inside a trusted directory."
  • Codex blocks on stdin in scripts. Feed it the file pipe shown above, or stdin=subprocess.DEVNULL if you drive it from Python.
  • Naming an unknown Gemini model degrades silently. -m gemini-3-pro did not error — it just produced weaker output until we dropped the flag and let it use the default model. Do not specify a model id you have not confirmed exists on the machine.
  • Gemini is noisy on stderr. It prints [ERROR] [IDEClient] Failed to connect to IDE companion extension in headless mode. Harmless. It still answers. Read stdout.
  • timeout is not on macOS by default. A small thing that breaks copy-pasted scripts.

Freezing it into a tool

For a couple of weeks we drove this by hand — "let Codex review the spec," "try Gemini too." It worked, but it lived in muscle memory and a few throwaway prompt files. The natural next step was to make it a real tool: a /cross-review slash command that knows the invocation quirks, writes the grounded reviewer prompts, runs the relay, fact-checks the findings against the actual code, and hands back a single reconciled verdict.

/cross-review                          # review the current working-tree diff
/cross-review path/to/spec.md          # review a specific artifact
/cross-review --parallel               # jury mode (independent votes) instead of relay
/cross-review --three                  # add a fresh claude -p as a third reviewer
/cross-review path/to/plan.md --apply  # apply the P0/P1 fixes the reviewers and you agree on

The command is the technique, frozen: grounded prompts, read-only sandboxes, the "second reviewer, find what the first missed" priming, and — most importantly — the verification step that makes a multi-model review trustworthy instead of merely impressive.

Download the command

Grab the /cross-review command and drop it into Claude Code: download cross-review-command.zip. Unzip, move cross-review.md into ~/.claude/commands/, start a new session, and run /cross-review. It expects the codex and gemini CLIs on your PATH (and claude for the --three option) — the README in the zip has the details.


The newest, most-hyped model is not the unlock here. The unlock is structural: different models fail differently, and a finding two of them reach independently is worth more than a finding one of them reaches confidently. Put a fact-checker in the middle who answers to the actual code, and "AI reviewing AI" stops being a gimmick and starts catching real bugs.

For the foundation this builds on, see Why VS Code Is Probably the Only AI Workbench Your Company Needs. If you want help wiring multi-model review into your team's workflow, get in touch.

Want to discuss a project like this?

SFLOW helps companies implement practical AI solutions — from custom platforms to industrial integrations with your existing systems.

SFLOW BV - Polderstraat 37, 2491 Balen-Olmen