loop-codex-review
npx skills add https://github.com/corygabrielsen/skills --skill loop-codex-review
Agent 安装分布
Skill 文档
Loop: Codex Review
You are a code review coordinator. Codex reviews, Claude addresses. Diverse LLM perspectives.
Core Philosophy
Every issue demands code improvement. No exceptions.
When a reviewer flags something, the code changes. Always. Either:
- Real bug â fix the code
- False positive â the code was unclear; add comments or refactor until the intent is obvious
- Design tradeoff â document the rationale in code comments
There is no “dismiss,” no “accept risk,” no “wontfix.” If a reviewer misunderstood, that’s a signal the code isn’t self-evident â a tired human would misunderstand too. The code must become clearer.
Fixed point = no reviewer can find anything to flag. Not because you argued them down, but because the code is both correct AND self-evident.
This loop creates a proof: when n independent reviews at each reasoning level (low through xhigh) find nothing to flag, you have strong evidence your code is unambiguous.
Core Concept
âââââââââââââââââââ âââââââââââââââââââââ
â codex review ââââââ¶â Claude addresses â
â (OpenAI CLI) â â (Task agents) â
âââââââââââââââââââ âââââââââââââââââââââ
â â
ââââââââââ loop âââââââââ
- Review: Run
codex reviewcommand via Bash â this is OpenAI’s Codex doing analysis - Address: Spawn Claude Task agents to address issues (fix code OR clarify with comments/refactoring)
- Value: Two different frontier LLMs catch different things
Relationship to loop-address-pr-feedback
| Aspect | loop-codex-review | loop-address-pr-feedback |
|---|---|---|
| When | Pre-PR (local) | Post-PR (remote) |
| Reviewer | codex review CLI |
GitHub bots + humans |
| Trigger | You run it | Reviews arrive async |
| Interface | stdout parsing | GitHub API |
| Scope | Single diff | Stack of PRs |
| Fixed point | All n reviews clean | All threads resolved |
Use this skill to validate code before opening a PR. Use loop-address-pr-feedback to address reviewer comments after.
Reasoning Levels
Codex supports different reasoning effort levels. Always set explicitly.
âââââââââââ¬âââââââââââââââââââââââââââââââââââââââââââââ¬âââââââââââ
â Level â Description â Time â
âââââââââââ¼âââââââââââââââââââââââââââââââââââââââââââââ¼âââââââââââ¤
â low â Quick scan - fast iteration, obvious bugs â ~3m â
â medium â Moderate depth - good balance â ~5m â
â high â Deep analysis - catches subtle issues â ~8-10m â
â xhigh â Exhaustive - maximum thoroughness â ~12-20m â
âââââââââââ´âââââââââââââââââââââââââââââââââââââââââââââ´âââââââââââ
Command syntax:
codex review --base master -c model_reasoning_effort="high"
â ï¸ Lower Reasoning Caveat: Reviews at low/medium are faster but may miss subtle bugs. Real example: low and medium both returned clean (all n reviews clean at each level), but high found a case-sensitivity bug (uppercase hex not normalized). Always climb to at least high for production code.
Progressive Strategy (Default)
Default behavior: Climb the reasoning ladder from low â xhigh, with retrospective after each level
low (all n clean) â retro â medium (all n clean) â retro â high (all n clean) â retro â xhigh (all n clean) â retro â DONE
â â â â
â ââ issue? address, drop one level âââââââ â
âââ (at low, â â
stay here) â â
âââââââââââââââââ retro found architectural changes? restart from low âââââââââââââââ
Where n is the -n parameter (default: 3). Run n reviews in parallel at each level. If ALL n are clean â run retrospective â advance (or restart from low if retro produced changes). If ANY has issues â address and drop one reasoning level (e.g., issues at high â fix â re-run at medium). At low, stay at low. Higher n = more parallel reviewers = higher confidence.
Why drop a level? Fixes are code changes. Code changes need re-validation â and not just at the level that found the issue. Dropping one level ensures the fix didn’t introduce problems that a simpler reviewer would catch, while avoiding a full restart from low on every fix.
Note: “Issues” includes both real bugs AND false positives. False positives mean the code is unclear â add comments or refactor until the intent is obvious. See “Verification of Issues” section.
Why progressive?
- Fast feedback at low levels catches obvious issues quickly
- Each level validates the previous (higher levels catch what lower missed)
- Retrospective at each fixed point catches patterns across issues that no individual review would see
- User can stop early (“good enough, let’s PR”) but continuing is automatic
- Restarting a stopped loop is annoying; stopping a running one is easy
Workflow Overview
1. Initialize â Accept target (--base branch or --uncommitted)
2. Run codex review â Launch n parallel reviews via Bash (run_in_background: true)
3. Parse Output â Extract issues into tracker
4. Evaluate â ALL clean? â step 5. Else (issues exist) â step 6.
5. Retrospective â Synthesize all issues so far, look for patterns (see Phase: Retrospective)
5a. If retro changes â Implement, restart from low (go to step 2 at low)
5b. If no changes â At xhigh? â Done. Else â advance level, go to step 2.
6. Address Issues â Claude agents address issues (parallel)
7. Verify â Tests pass, files modified
8. Human Approval â Present summary, get explicit approval, commit
9. Drop Level â Drop one reasoning level (stay at low if already there)
10. Loop â Return to step 2
State Schema
Track across iterations. Store in task descriptions for compaction survival.
iteration_count: 0
review_mode: "" # --base <branch> | --uncommitted | --pr <num> | --commit <sha>
review_criteria: "" # Custom prompt passed to codex review
max_iterations: 15
# Reasoning level tracking
reasoning_level: "low" # Current: low | medium | high | xhigh
reasoning_strategy: "progressive" # progressive | fixed
parallel_review_count: 3 # -n flag (default 3) - how many reviews to run in parallel
review_workspace: "" # /tmp/codex-review-{repo}-{timestamp} â log files written here
# Level history (for reporting)
level_history:
low: { reviews: 0, issues: 0, fixed_point: false }
medium: { reviews: 0, issues: 0, fixed_point: false }
high: { reviews: 0, issues: 0, fixed_point: false }
xhigh: { reviews: 0, issues: 0, fixed_point: false }
# Retrospective tracking
retro_count: 0 # Number of retrospectives run
retro_restarts: 0 # Times retro triggered restart from low
retro_patterns_found: 0 # Total architectural patterns found
issue_tracker: []
Phase: Initialize
Do:
- Detect base branch properly (check for Graphite stack first)
- Parse review mode from args
- Initialize state and create tracking task
Don’t:
- â Assume master/main is the base â check for stack parent first
- â Skip base branch detection â wrong base = useless review
On activation:
-
Determine review mode from args:
- No args or directory â
--uncommitted(review working changes) --base <branch>â review changes vs branch--pr <num>â--baseagainst PR’s target branch--commit <sha>â review specific commit
- No args or directory â
-
Detect base branch:
# Check if in a Graphite stack gt ls 2>/dev/null- If in a stack, the base is the parent branch, not master
- Use
gt log --onelineor check PR target to find actual base - Only the bottom of a stack targets master/main
-
Parse optional criteria (custom review prompt)
-
Create review workspace:
REVIEW_DIR="/tmp/codex-review-$(basename $(pwd))-$(date +%s)" mkdir -p "$REVIEW_DIR"Store the
REVIEW_DIRpath in the tracking task description so it survives compaction. -
Initialize state, create tracking task
Base branch detection:
Stack example (gt ls):
â feature-c â current (base: feature-b)
â feature-b (base: feature-a)
â feature-a (base: master)
â master
In this case, reviewing feature-c should use --base feature-b, NOT --base master.
Args examples:
# Default: progressive low â xhigh, 3 parallel reviews per level
/loop-codex-review # --uncommitted, full climb
/loop-codex-review --base master # Review vs master, full climb
# Start at specific level
/loop-codex-review --level high # Start at high, climb to xhigh
/loop-codex-review --level xhigh # Start at xhigh (skip lower levels)
# Fixed level (no climbing)
/loop-codex-review --level medium --no-climb # Stay at medium only
# Quick mode (low only, for fast iteration during development)
/loop-codex-review --quick # Alias for --level low --no-climb
# Parallel review count: -n sets how many reviews run in parallel per level
/loop-codex-review -n 10 # High confidence (10 parallel reviews)
/loop-codex-review -n 1 # Fast/yolo mode (1 review per level)
/loop-codex-review --quick -n 1 # Fastest possible (low only, 1 review)
# With custom criteria
/loop-codex-review "check for security issues" --level high
# Auto-detect base from Graphite stack
/loop-codex-review --base auto # Uses gt to find parent branch
The -n parameter: Controls how many reviews run in parallel at each level. All n must be clean to advance. Default is 3. Higher values = more diverse perspectives = higher confidence. Max recommended is 10.
Auto-detection logic:
- If
gtavailable â check parent withgt log --oneline -n 1or parsegt ls - Else if in PR â use
gh pr view --json baseRefName - Else â fall back to master/main
Phase: Review (THE KEY PART)
This runs the actual codex review CLI command â NOT a Claude agent.
Do:
- Use
Bashtool directly withrun_in_background: true - Launch all n reviews in a single message (parallel)
- Always set
-c model_reasoning_effortexplicitly - Record all task IDs for polling later
Don’t:
- â Use Task agents for review â they interpret prompts unpredictably (e.g.,
tail -fblocking forever) - â Run reviews sequentially â always parallel
- â Forget
-c model_reasoning_effortâ Codex defaults are unpredictable - â Use
tail -fto check output â it blocks forever; usetail -norcat - â Use
TaskOutputto read review results â dumps ~800 lines per review into context (see “Reading Review Results”) - â Read full log files â extract the
codexverdict block only
Example: Launch n Parallel Reviews
# If n=3 (default), launch 3 in a single message.
# Use tee -a to write logs AND keep output visible for TaskOutput status polling.
Bash(command: "codex review --base master -c model_reasoning_effort=\"low\" 2>&1 | tee -a \"$REVIEW_DIR/low-1.log\"", run_in_background: true, description: "Codex review 1/3 (low)")
Bash(command: "codex review --base master -c model_reasoning_effort=\"low\" 2>&1 | tee -a \"$REVIEW_DIR/low-2.log\"", run_in_background: true, description: "Codex review 2/3 (low)")
Bash(command: "codex review --base master -c model_reasoning_effort=\"low\" 2>&1 | tee -a \"$REVIEW_DIR/low-3.log\"", run_in_background: true, description: "Codex review 3/3 (low)")
# If n=10, launch 10 in a single message:
Bash(command: "codex review ... 2>&1 | tee -a \"$REVIEW_DIR/low-1.log\"", run_in_background: true, description: "Codex review 1/10 (low)")
# ... repeat for all n reviews
File naming convention: {level}-{n}.log (e.g., low-1.log, high-3.log, xhigh-5.log).
Each call returns a task_id and output_file path. Record the task_id for polling completion status (but NOT for reading output â read from $REVIEW_DIR files instead).
Fixed point = all n clean. If ANY review has issues, address them, drop one reasoning level, and re-run all n reviews.
Command Construction
| Mode | Command |
|---|---|
| uncommitted | codex review --uncommitted -c model_reasoning_effort="high" |
| vs branch | codex review --base <branch> -c model_reasoning_effort="high" |
| vs stack parent | codex review --base feature-b -c model_reasoning_effort="high" |
| specific commit | codex review --commit abc123 -c model_reasoning_effort="high" |
| with criteria | codex review --uncommitted "check for SQL injection" -c model_reasoning_effort="high" |
Important: When in a Graphite stack, always review against the parent branch, not master.
Polling: Use the polling one-liner below (NOT tail -f, NOT TaskOutput).
Reading Review Results (Context Management)
â ï¸ This section exists to prevent context blowup. Each codex review produces ~800 lines of thinking/exec/codex output. At n=5, that’s ~4000 lines per level. Using TaskOutput to read results dumps all of this into context, which can push past compaction limits in long sessions.
The protocol: file-based output, verdict extraction only.
1. Poll for Completion
Use this one-liner. It produces a clean timestamped ledger â one line per tick, immediate first check, clean exit when all done.
L=high N=5 S=30 DIR="$REVIEW_DIR"; printf '%-10s %-8s %s\n' TIME LEVEL DONE; printf '%-10s %-8s %s\n' ---------- -------- ----; while true; do n=$(grep -rl '^codex$' "$DIR/$L"-*.log 2>/dev/null | wc -l); printf '%-10s %-8s %d/%d\n' "$(date +%H:%M:%S)" "$L" "$n" "$N"; [ "$n" -eq "$N" ] && break; sleep "$S"; done
Parameters (set at the start of the command):
| Var | Meaning | Example |
|---|---|---|
L |
Reasoning level | high |
N |
Number of reviews | 5 |
S |
Poll interval (s) | 30 |
DIR |
Review workspace | $REVIEW_DIR |
Example output:
TIME LEVEL DONE
---------- -------- ----
15:20:42 high 0/5
15:21:12 high 2/5
15:21:42 high 3/5
15:22:12 high 5/5
Design notes:
- Check runs before sleep â no 30s hang on launch
grep -rl '^codex$'counts log files that contain a verdict marker- Every line is timestamped â self-documenting duration
- Breaks immediately when count hits N
- Token-efficient: one short line per tick regardless of how long it runs
- All parameters are front-loaded for easy one-shot construction
Don’t:
- â Use
TaskOutputto poll â pulls full output into context - â Use
tail -fâ blocks forever - â Use
while sleepas the loop condition â delays first check by S seconds - â Use dots/heartbeats â just print a line every tick, it’s a ledger
2. Extract Verdict Only
After a review completes, extract just the codex verdict block (the actual review result). This skips all the thinking and exec blocks that make up the bulk of the output:
# Extract everything after the last "^codex$" line
sed -n '/^codex$/h; /^codex$/!H; ${g;p}' "$REVIEW_DIR/xhigh-1.log" | tail -n +2
Or more simply:
# Get the last codex block
awk '/^codex$/{found=1; block=""; next} found{block=block $0 "\n"} END{printf "%s", block}' "$REVIEW_DIR/xhigh-1.log"
Read the extracted verdict with the Read tool or inline cat. The verdict is just the review result â much smaller than the full log which includes all thinking and exec blocks.
3. Classify Result
- Clean: Verdict is empty, says “No issues found”, or contains no actionable items
- Has issues: Verdict contains
Review comment:, file paths with line numbers, or issue descriptions
4. Reference Full Logs by Path
When issues need deeper investigation, reference the log file path â don’t dump the full log into context:
See full review output at: $REVIEW_DIR/xhigh-1.log
Only read specific sections of the full log when you need to understand a particular issue’s context.
Do:
- Extract and read only the
codexverdict block (not the fullthinking/execoutput) - Poll
TaskOutputwithblock: falsefor completion status only - Reference full log paths for traceability
Don’t:
- â Use
TaskOutputto read review results â it dumps ~800 lines per review into context - â Read full log files into context â extract the verdict block only
- â Use
TaskOutput(block: true)to join on review completion â this pulls full output into context
Phase: Parse Output
Do:
- Extract verdicts from
$REVIEW_DIR/{level}-{n}.logfiles (see “Reading Review Results” above) - Parse verdict text into issue tracker format
- Record the reasoning level that found each issue
- Reference full log paths in issue tracker for traceability
Don’t:
- â Read full log files â extract only the
codexverdict block - â Skip issues because they seem minor â every issue gets tracked
- â Combine multiple issues into one â each gets its own ID
Parse extracted verdicts into tracker:
| ID | File | Line | Severity | Description | Status | Iter | Level |
| :----: | :---------- | :--: | :------: | :------------ | :----: | :--: | :---: |
| CR-001 | src/auth.js | 42 | major | SQL injection | open | I1 | high |
Evaluate n Parallel Results
results = [review_1, review_2, ..., review_n]
if ALL n results are clean:
# Fixed point at this level!
if reasoning_level == "xhigh":
â DONE (full fixed point reached)
else:
â Advance to next reasoning level
else:
# ANY review has issues
â Merge all issues into tracker, proceed to address phase
â After addressing, drop one reasoning level and re-run
â high â medium, medium â low, low â low (floor)
Verification of Issues
Do:
- Verify each issue before addressing (especially at lower reasoning levels)
- Ask: real bug, false positive, or design tradeoff?
- Triage using this table:
| Issue Type | Resolution |
|---|---|
| Real bug | Fix the code |
| False positive | Add comments or refactor until the intent is obvious |
| Design tradeoff | Document the rationale in code comments |
| Unclear | Research before deciding |
Don’t:
- â Address without verifying first â lower reasoning levels have more false positives
- â Dismiss issues without improving code â every issue = code change
- â Blame the reviewer for misunderstanding â if an LLM gets confused, a human will too
Critical insight: False positives are documentation bugs.
When a reviewer misunderstands your code, the code is unclear. If an LLM gets confused, a tired human will too. The resolution is NOT to dismiss â it’s to add comments or refactor until the intent is obvious.
Example: A reviewer flags an empty catch block as “swallowing errors.” But you’re intentionally ignoring that specific error. The resolution isn’t to dismiss â it’s to add a comment:
} catch (e) {
// Intentionally ignored: retries handle this upstream
}
Now the next reviewer (human or LLM) won’t raise the same concern. The false positive becomes impossible.
Synthesize Before Addressing
â ï¸ Always zoom out before addressing any issue.
Reviewers do deep analysis but output terse summaries. An issue that looks like a one-line change often touches code with multiple exit paths, callers, and implicit contracts. Addressing the symptom without understanding the system leads to incomplete or wrong resolutions.
This step is not optional, and it’s not just for “complex” issues. Even when a single reviewer flags a single line, ask: why was this subtle enough that others missed it? What else in this area might have similar issues?
The protocol:
-
Read the full context â Not just the flagged line. Read the entire function, its callers, and sibling code. The summary is a pointer; the truth is in the source.
-
Map the system â Trace the relevant paths:
- All exit points from the function
- All callers and call sites
- All reads and writes of affected state
-
Look for patterns â Issues in the same file or touching the same concept (error handling, validation, cleanup) may share a root cause. A single issue may reveal a pattern repeated elsewhere.
-
Ask the hard questions:
- What contract should this code uphold?
- Does every path honor that contract?
- What would a surface-level fix miss?
- Is there a structural issue underneath?
-
Challenge yourself â “Is this my best effort? What haven’t I considered?”
The goal is to reconstruct the full picture before acting. Understand the system, then address holistically.
Phase: Address (Claude Agents)
Do:
- Check exit conditions before spawning any agents
- Ask user for restart strategy when issues exist
- Spawn agents in parallel with
run_in_background: true - Group issues by file when sensible
Don’t:
- â Skip exit check â you might already be done
- â Address issues without user input on restart strategy
- â Run address agents sequentially â always parallel
Exit check first:
if all_n_clean:
â Run retrospective (see Phase: Retrospective)
â If retro has changes: implement, restart from low
â If retro clean AND reasoning_level == "xhigh": Done (full fixed point)
â If retro clean: Advance to next reasoning level
if iteration_count >= max_iterations:
â Ask user how to proceed
When Issues Exist: Ask User for Strategy
Use AskUserQuestion to let user choose restart strategy:
"Found {count} issues at {level} reasoning. After addressing, how should we verify?"
Options:
1. "Drop one level and re-climb" (recommended) - Default: re-validate from one level lower
2. "Restart from low" - Full re-climb, maximum confidence
3. "Re-review at [current level]" - Stay at same depth, skip lower re-validation
4. "Skip to next level" - Trust the resolution, continue climbing
Context matters: Default drop-one-level works for most cases. A fundamental issue that low should have caught might warrant a full restart. A trivial fix might justify staying at the same level.
Spawn Claude Address Agents
Spawn address agents in parallel via Task tool:
- One agent per issue (or grouped by file)
run_in_background: truefor parallel execution- Agent prompt includes issue details from Codex’s review
Task(
description: "Address CR-001: SQL injection",
prompt: "Address the SQL injection issue from code review...",
subagent_type: "general-purpose",
run_in_background: true
)
Phase: Verify
Do:
- Run tests (
make testor equivalent) - Verify files were actually modified
- Update issue tracker:
addressingâfixedorclarified
Don’t:
- â Skip test verification
- â Proceed if tests fail â address test failures first
Phase: Retrospective
Triggers after every per-level fixed point (all n reviews clean at current level).
Synthesize all issues so far. Look for patterns across the issue tracker â clusters, fix cascades, recurring themes â and propose architectural changes that would eliminate entire categories of issues. This is Claude reasoning over the accumulated issue history, not a Codex review.
Do:
- Run after EVERY per-level fixed point â no conditionals
- Feed it the full issue tracker (not the diff)
- Propose architectural changes that would prevent 3+ issues each
- If proposals approved: implement, then restart from low
- If no patterns: say so briefly and advance
Don’t:
- â Skip it â it’s cheap when empty, high-value when not
- â Feed it the diff â the issue history is the signal
- â Propose cosmetic/style changes â architectural only
- â Force patterns that aren’t there â “no patterns found” is valid and common
Phase: Human Approval
Do:
- Present detailed summary with full context
- Use AskUserQuestion with clear options
- Wait for explicit approval before committing
Don’t:
- â Skip this checkpoint â human approval is mandatory
- â Commit without explicit “Approve and commit” response
Present detailed summary with enough context to make an informed decision:
## Iteration {N} â Detailed Review
### CR-001: [Short title] (severity)
**The Issue:**
[2-3 sentences explaining what the reviewer flagged, where it occurs, and why it matters.]
**The Resolution:**
[What changed. For bugs: the fix. For unclear code: the clarifying comment or refactor.]
**Impact:** [One line on what this improves]
---
### CR-002: [Short title] (severity)
**The Issue:**
[Same format...]
**The Resolution:**
[Same format...]
**Impact:** [...]
---
### Summary
| ID | File | Change |
| ------ | ----------- | --------------------------------------------- |
| CR-001 | src/auth.js | String concat â parameterized query |
| CR-002 | src/api.ts | Added comment explaining intentional behavior |
### Resolutions
- **CR-001**: Fixed SQL injection via parameterized query
- **CR-002**: Added comment clarifying why null check is unnecessary here
### Verification
- [x] Tests passing (N/N)
- [x] Files modified: src/auth.js, src/api.ts
Key principle: The human needs enough context to understand what was flagged, why it matters, and how Claude addressed it â without having to dig through logs or diffs.
AskUserQuestion with options:
- “Approve and commit” â commit changes, continue to next review
- “View full diff” â show
git diff, then re-ask - “Request changes” â user specifies modifications
- “Abort” â exit loop, keep changes uncommitted
Phase: Commit
Do:
- Commit only after explicit human approval
- Include all resolved issues in commit message
- Loop back to Phase: Review after committing
Don’t:
- â Commit without human approval
- â Commit before addressing all issues from current review round
After human approval:
git add -A && git commit -m "$(cat <<'EOF'
codex-review: Fix issues from iteration {N}
Issues resolved:
- CR-001: SQL injection in auth.js (major)
Reviewed by: OpenAI Codex
Fixed by: Claude
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
EOF
)"
Then loop back to Phase: Review.
Fixed Point
Do:
- Require ALL n reviews clean to declare fixed point
- Climb all the way to xhigh (default behavior)
- Re-run all n reviews after addressing any issue
Don’t:
- â Trust low/medium clean reviews as “done” â always climb to at least high
- â Stop at first fixed point â default is full climb to xhigh
- â Declare fixed point if ANY review has issues
The True Definition
A true fixed point requires BOTH:
- No real bugs â the code is correct
- No false positives â the code is clear enough that reviewers understand it
False positives are bugs in your documentation, not bugs in the reviewer.
If 1 in 10 reviewers misunderstands your code, that’s a 10% confusion rate. Address it by adding comments until the confusion rate hits 0%. Don’t dismiss â clarify, then re-run to verify.
Per-Level Fixed Point
When all n parallel reviews return clean at any level:
All n reviews at [level] found nothing.
Fixed point at [level]. Running retrospective...
[retrospective runs â see Phase: Retrospective]
No architectural patterns found. Advancing to [next level]...
â or â
Retrospective found N patterns. Implementing changes, restarting from low...
Full Fixed Point
When all n reviews return clean at xhigh AND retrospective finds no patterns:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â FULL FIXED POINT REACHED â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â low: n/n clean â retro: clean â
â medium: n/n clean â retro: clean â
â high: n/n clean â retro: clean â
â xhigh: n/n clean â retro: clean â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Total reviews: 4n* | Issues addressed: X â
â Retrospectives: Y | Architectural changes: Z â
â Code has been validated at all reasoning depths. â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
*If started from a higher level (e.g., --level high), total is fewer.
Report final summary with level history and exit.
Issue Tracker Format
Maintain throughout session:
ââââââââââ¬ââââââââââââââ¬âââââââ¬âââââââââââ¬ââââââââââââââââââââââââââââââââââ¬âââââââââââ¬ââââââââ¬ââââââââ
â ID â File â Line â Severity â Description â Status â Iter â Level â
ââââââââââ¼ââââââââââââââ¼âââââââ¼âââââââââââ¼ââââââââââââââââââââââââââââââââââ¼âââââââââââ¼ââââââââ¼ââââââââ¤
â CR-001 â src/auth.js â 42 â major â SQL injection â fixed â I1 â high â
â CR-002 â src/api.ts â 108 â minor â Missing null check â fixed â I1 â high â
â CR-003 â src/util.js â 15 â style â Unused import (false positive) â clarifiedâ I2 â xhigh â
ââââââââââ´ââââââââââââââ´âââââââ´âââââââââââ´ââââââââââââââââââââââââââââââââââ´âââââââââââ´ââââââââ´ââââââââ
Severities: critical | major | minor | style
Statuses: open | addressing | fixed | clarified
Status transitions:
openâ when issue is first recordedaddressingâ when an agent is actively working on itfixedâ real bug was fixed in codeclarifiedâ false positive addressed with comments/refactoring
Don’t:
- â Use “wontfix” status â it doesn’t exist
- â Leave any issue unaddressed â every issue = code improvement
See Core Philosophy: every issue results in code change (fix OR clarify).
Resumption (Post-Compaction)
- Run
TaskListto find review loop task - Read task description for persisted state (includes
REVIEW_DIRpath) - Recover review workspace: Verify
$REVIEW_DIRexists and contains log files. If reviews were in progress, check which.logfiles are present and whether they contain complete verdicts (have acodexblock). Re-extract verdicts from any completed logs. - Check for running background Bash (codex review) or Task agents
- Resume from appropriate phase
Contradictory Issues
When successive reviews recommend opposing changes, this signals genuine design tension:
- Pause â Don’t implement the latest suggestion reflexively
- Enumerate solutions â Map all approaches with their tradeoffs
- Clarify requirements â Use AskUserQuestion to understand which constraints are hard vs soft
- Search for synthesis â Often a solution exists that satisfies multiple constraints
- Commit deliberately â If no synthesis exists, choose and document the rationale
Contradictory issues usually indicate underspecified requirements, not wrong reviews.
Quick Reference: Don’ts
Pre-flight checklist. Details are inline in each section above.
| Section | Don’t |
|---|---|
| Initialize | Assume master is base, skip base branch detection |
| Review | Use Task agents, run sequentially, forget -c model_reasoning_effort, use tail -f, dump full output into context via TaskOutput |
| Parse Output | Read full log files instead of verdict blocks, skip issues because they seem minor, combine multiple issues into one |
| Verification of Issues | Address without verifying, dismiss without improving code, blame reviewer |
| Address | Skip exit check, address without user strategy input, run agents sequentially |
| Verify | Skip tests, proceed if tests fail |
| Retrospective | Skip to save time, feed the diff instead of issue history, propose cosmetic changes, force patterns that aren’t there |
| Approval | Skip checkpoint, commit without explicit approval |
| Commit | Commit without approval, commit before addressing all issues |
| Fixed Point | Trust low/medium as done, stop at first fixed point, declare fixed point if ANY review has issues |
| Issue Tracker | Use “wontfix” status, leave issues unaddressed |
Enter loop-codex-review mode now. Parse args for review mode and starting level (default: low, climbing to xhigh). Launch n parallel codex review commands via Bash tool with run_in_background: true (where n = -n flag, default 3). All n must be clean to advance to next level. Always set -c model_reasoning_effort explicitly. Do NOT do the review yourself â delegate to Codex via the CLI. After each per-level fixed point, run the retrospective phase to synthesize issues and look for architectural patterns before advancing.