Every startup engineering team hits the same wall. The developers are productive — commits are landing, features are shipping. But code review is piling up. Nobody has time to review every commit carefully. Things slip through: a hardcoded secret, a missing null check, a SQL query that will fall over at 10x data, a frontend link without rel="noopener noreferrer".

This is the AI productivity paradox in practice. AI coding assistants make developers faster at writing code. They do not make reviewers faster at reviewing it. The bottleneck moves downstream.

As a fractional CTO serving multiple startup clients simultaneously, this is not a hypothetical — it's a daily operational constraint. I can't be a first-pass reviewer on every commit for every client. Neither can a small team's tech lead who is also writing code, handling escalations, and sitting in product discussions.

So we built a tool to be that first-pass reviewer. This is what it does, how it works, and what we learned building it.

What the Tool Does

The tool is a Python script — git-commit-reviews.py — that runs on a schedule (or on-demand for a specific commit). It does five things:

Fetches recent commits from one or more git repositories, within a configurable time window.
Extracts the unified diff for each commit using git show, with configurable context lines around each change.
Pulls JIRA context if the commit subject contains a ticket reference — fetching the issue summary, type, status, and description to give the LLM intent context.
Runs a two-pass LLM review: a draft pass that generates findings, followed by a reflection pass that removes anything not directly evidenced by changed code.
Posts the structured review as a commit comment on Bitbucket, and saves it locally as a JSON file.

The output is a structured JSON review — commit metadata, list of changed files, and a set of observations each with a title, classification, severity, description, code snippet, and actionable suggestions.

Why Two Passes?

The single biggest quality problem with LLM code reviews is speculative findings. A model reads a diff and notices, say, a function that calls an external API. It flags "this could throw an unhandled exception if the network is down." Technically possible. But the diff didn't change exception handling — that's pre-existing code. The observation is noise, not signal.

Noisy reviews are worse than no reviews. Engineers stop reading them.

The reflection pass fixes this. After the draft review is generated, a second LLM call is made with a focused instruction:

For each observation, check whether the problematic code is directly visible in the "+" lines of the diff. Remove any finding that references unchanged context lines, uses speculative language ("might", "could potentially", "may cause") without pointing to a concrete added line, or flags something that existed before this commit.

The reflection model is allowed to be a different (and potentially cheaper) model than the draft reviewer. In practice, the reflection pass removes 20–40% of observations from a typical review — almost all of them speculative. What remains is grounded findings about code that was actually changed.

One implementation detail worth noting: the reflection pass rewrites the full observations array and updates the summary counts. But it doesn't know about JIRA alignment — that's a separate field added by the first pass. So before returning the refined review, we stash the jira_alignment array from the draft and re-inject it into the reflected output. The model never touches it.

JIRA Alignment: Did the Code Actually Do What the Ticket Asked?

For teams using JIRA, every commit should relate to a ticket. In practice, this relationship is often assumed rather than verified. A developer closes a ticket and the PR is approved, but nobody checked whether the code change actually addresses the acceptance criteria.

When JIRA integration is enabled, the tool:

Extracts ticket keys from the commit subject using a configurable regex pattern (e.g. PROJ-\d+).
Fetches the issue details from JIRA — summary, type, status, and up to 800 characters of description.
Prepends this context to the review prompt before the diff.
Instructs the LLM to return a jira_alignment array alongside observations, with one entry per referenced ticket.

Each alignment verdict is one of four values:

aligned — the code changes clearly implement or fix what the ticket describes.
misaligned — the changes appear unrelated to the ticket — flagged as a concern.
partial — the changes only partially address the ticket requirements.
unverifiable — the ticket description is too vague to determine alignment.

A commit that's misaligned with its ticket is a genuine signal worth acting on. Either the commit is going to the wrong branch, the ticket reference is wrong, or the developer is solving a different problem than the ticket describes. Any of these is worth a conversation before the code gets merged.

JIRA is optional — the tool works without it, and individual clients can disable it. The per-run in-memory cache ensures the same ticket is fetched at most once per execution, regardless of how many commits reference it.

What It Reviews

The review prompt is configurable per customer, but the default covers seven dimensions:

Security — hardcoded credentials, injection vectors, authentication gaps
Functional correctness — logic errors, edge cases, incorrect assumptions
Non-functional concerns — performance, scalability, maintainability, resiliency
Coding style and readability — naming, structure, complexity
Error handling and logging — missing null checks, swallowed exceptions, insufficient log context
User experience — for frontend code: accessibility, responsive images, loading performance
Resource management — potential memory leaks, unclosed handles

For frontend code specifically (HTML, JS, TS, CSS, JSX, TSX), the prompt adds a set of privacy and security checks:

Input fields handling PII or credentials must have spellcheck="false", autocomplete="off", autocorrect="off", and autocapitalize="off" — missing any of these is flagged as a privacy concern.
External links must include rel="noopener noreferrer" to prevent TabNabbing.
Images should use srcset and loading="lazy" where appropriate.

The instruction "If a category or check does not apply, skip it — do not speculate" is explicit in the prompt. Combined with the reflection pass, this keeps the output tight.

Architecture and Configuration

The tool is designed around a per-customer YAML configuration file. Each client gets their own config that specifies which repositories to monitor, which LLM to use, what time window to review, and their Bitbucket workspace details.

┌─────────────────────────────────────┐
│         gcr-client.yaml             │
│  repositories, time window, LLM     │
│  service, JIRA config, Bitbucket    │
└──────────────────┬──────────────────┘
                   │
         ┌─────────▼─────────┐
         │  git-commit-       │
         │  reviews.py        │
         └──┬────────────┬───┘
            │            │
    ┌───────▼──┐   ┌─────▼──────┐
    │ git show  │   │ JIRA API   │
    │ (diff)    │   │ (context)  │
    └───────┬──┘   └─────┬──────┘
            │            │
         ┌──▼────────────▼──┐
         │  LiteLLM Router   │
         │  (Pass 1: Draft)  │
         └──────────┬────────┘
                    │
         ┌──────────▼────────┐
         │  LiteLLM Router   │
         │  (Pass 2: Reflect) │
         └──────────┬────────┘
                    │
         ┌──────────▼────────┐
         │  Bitbucket API    │
         │  (commit comment) │
         └───────────────────┘

The LLM layer is a LiteLLM router, so the actual model is swappable without touching the review code. Different passes can use different models — for example, a fast cheap model for reflection and a more capable model for the initial draft. The review_service and reflection_service keys in the config control this independently.

Token safety is built in: the tool estimates diff size (characters ÷ 4 ≈ tokens) before sending to the LLM. If the diff exceeds the model's effective token limit, the commit is skipped with a warning rather than truncating and producing a partial review. The limit is either set explicitly in the config or inferred from the model's max_tokens setting in the LLM router config.

How You Can Run It

The tool is flexible about when and how it gets triggered. There are three deployment modes, and they're not mutually exclusive.

1. On-Demand — Review a Specific Commit

Pass a commit hash directly to review a single commit immediately. This is useful when a PR is raised and you want AI feedback before the peer review starts, or when you want to re-review a commit after refining the prompt:

python git-commit-reviews.py -c gcr-client.yaml --commit-id <hash>

2. Scheduled — Cron Job on a Time Window

The most common production setup. A cron job runs the tool every few hours (or nightly), picking up all commits since a configured since timestamp across one or more repositories. No manual trigger needed — every commit gets reviewed automatically as a background process:

python git-commit-reviews.py -c gcr-client.yaml

The since value in the config controls the lookback window — it can be a relative duration like 24 hours ago or an absolute UTC timestamp for a fixed start point.

3. Event-Driven — Triggered by a Commit or Pull Request

For tighter integration, the tool can be wired into your CI/CD pipeline or source control webhook so it fires automatically on specific events:

On commit push — a Bitbucket webhook triggers the review pipeline the moment code lands on the branch. The developer sees the AI review comment within minutes of pushing.
On pull request creation — the review runs when a PR is opened, so findings are visible to both the author and reviewers before any human review begins. The AI acts as a pre-screen, letting human reviewers focus on design and intent rather than catching mechanical issues.

In all three modes, the output is the same: a structured review posted as a Bitbucket commit comment and saved locally. The only difference is what pulls the trigger.

Reviews are saved locally to a reviews/ directory as {repo}-{hash}-{service}.txt files — useful for audit trails, trend analysis, or re-running with a different prompt.

Automatic Bitbucket Comments — Review Where the Code Lives

The most visible part of the workflow is not the local file — it's the automatic comment posted against the commit in Bitbucket. After the two-pass review completes, the tool calls the Bitbucket API and attaches the full structured review as a commit comment, right where the developer (and their team lead) will naturally look.

This matters for adoption. If a review lives in a log file on a server, nobody reads it. If it appears as a comment on the commit in Bitbucket — the same place where peer review comments go — it becomes part of the natural development workflow. Developers see it when they check their commit. Tech leads see it when they browse recent activity. No separate dashboard to check, no extra tool to learn.

The Bitbucket integration is configured per customer in the YAML:

bitbucket:
  enabled: true
  workspace: "your-workspace"

The repo slug is derived automatically from the repository path, so a single config covers multiple repositories without manual mapping for each one. The integration appends rather than replaces — if a commit is reviewed twice (e.g. after refining the prompt), both reviews appear as separate comments with their own timestamp. Nothing is silently overwritten.

Security Findings Mapped to Industry Standards

For security observations, the prompt instructs the LLM to map each finding to three frameworks — but only when the mapping is clearly defensible:

OWASP Top 10 (2021) — the widely recognised web application security risk categories (e.g. A03 – Injection, A07 – Identification and Authentication Failures)
CWE — Common Weakness Enumeration IDs that precisely identify the weakness class (e.g. CWE-89 for SQL Injection, CWE-79 for XSS)
CIS Critical Security Controls v8 — the control(s) from the CIS framework that apply (e.g. CIS Control 16.1 – Establish and Maintain a Secure Application Development Process)

The instruction is explicit: "Apply mappings only when clearly defensible. If a mapping is weak or speculative, omit it rather than guessing." This matters in practice — a LLM will happily invent a plausible-sounding OWASP category for almost any finding if you let it. Forcing it to only map when confident keeps the output useful for actual security conversations, audit discussions, or compliance reviews.

For mobile code (iOS/Android SDKs), the tool maps to the OWASP Mobile Top 10 (2024) instead of the web Top 10. The framework used is configurable per customer in the YAML prompt.

The practical benefit of these mappings: when a security finding lands on a developer's desk, they can immediately look up the CWE for a precise definition, the OWASP category for broader context, and the CIS control for what a remediation programme should look like. It turns a review comment into a traceable, standard-referenced finding — the kind a security auditor or CTO can act on.

What It Catches in Practice

Across several months running this across client repositories, the findings that appear most consistently:

Security: Hardcoded API keys and tokens in configuration files, missing input validation on user-supplied parameters, JWT tokens stored in localStorage (a known XSS risk), and missing CSRF protection on state-changing endpoints.

Frontend privacy: Login forms without autocomplete="off" on password fields — surprisingly common and a GDPR concern on EU-facing products. External links without rel="noopener noreferrer".

Functional: Off-by-one errors in pagination logic, null reference paths in newly added conditional branches, missing error handling after await calls.

JIRA misalignment: Commits referencing a bug ticket but changing unrelated infrastructure code — usually a copy-paste error in the commit message, but occasionally a developer working in the wrong branch.

The reflection pass is what makes these findings actionable. Without it, the list would also include warnings about the rest of the file — exception handling in code that wasn't touched, performance concerns in methods that weren't changed. With it, the findings are tightly scoped to what was actually committed.

This Is Not a Product — It's a Point About What's Possible

Before the comparison: this tool is not a commercial product competing with CodeRabbit or Rovo. We built it as part of our fractional CTO practice — it runs on client infrastructure, is configured per client, and is one of several AI-assisted engineering tools we deploy as part of an engagement. We are not selling it as a SaaS.

The reason we're comparing it to commercial alternatives is a different point: the gap between what a small, focused engineering effort can produce and what you'd pay $20–$30/developer/month for is smaller than most teams realise. If your team has the appetite to own a lightweight tool, the building blocks — LiteLLM, a git diff, a JIRA API client — are all accessible. If you'd rather have a managed product with enterprise controls and a vendor to call, the commercial options are good and the comparison below will help you choose.

With that framing, here is an honest, fact-based look at the landscape.

vs. CodeRabbit ($24/developer/month)

CodeRabbit is a polished commercial product with 2M+ connected repositories and SOC 2 Type II certification. It supports GitHub, GitLab, Bitbucket, and Azure DevOps, posts inline PR comments, and allows model selection across OpenAI, Anthropic, and Google providers. It learns from team feedback to reduce false positives over time. For teams that want a managed, enterprise-ready solution with RBAC, no infrastructure to run, and a vendor to call, CodeRabbit is a strong choice.

Where the tools differ:

Security standard mappings: CodeRabbit provides natural-language security findings but does not natively output OWASP Top 10 category codes, CWE IDs, or CIS control references in its review comments. Our tool explicitly maps each security finding to OWASP (2021), CWE, and CIS Controls v8 — and only does so when the mapping is clearly defensible. For teams running compliance programmes, security audits, or working toward certifications, named standard references are the currency that matters.
Reflection pass: CodeRabbit offers incremental reviews and learns from feedback over time to reduce future false positives. Our tool removes speculative findings at generation time — the reflection pass runs on every review, eliminating findings not directly evidenced by the diff before they ever reach the developer.
Pricing model: At $24/developer/month, CodeRabbit is a per-seat subscription. Our tool's cost is purely LLM API usage — roughly $0.01–0.05 per commit review depending on diff size and model. For a 10-developer team committing daily, the difference is significant at the scale most startups operate.
What CodeRabbit has that we don't: SOC 2 certification, RBAC, enterprise SSO, managed infrastructure, 40+ integrated linters running in parallel, and a product team maintaining it full-time.

vs. Atlassian Rovo Dev ($20/user/month)

Rovo Dev is Atlassian's AI assistant suite, not a standalone code review product. The code review component (Rovo Dev) integrates tightly with JIRA — validating code changes against acceptance criteria and business objectives from linked JIRA issues, which is genuine value for Atlassian-native teams. It supports model selection (Claude, GPT series) and runs on Atlassian's cloud infrastructure with an option for Atlassian-hosted LLMs on Enterprise plans.

The distinction: Rovo Dev is designed for teams already inside the Atlassian ecosystem. Its JIRA integration is deep but tightly coupled to that platform. Our tool's JIRA integration is configurable and optional — it works equally well on repositories with no JIRA at all. The per-seat pricing applies regardless of how heavily individual developers use the review feature.

vs. Traditional Static Analysis (SonarQube, Semgrep)

Static analysis tools and AI code review are solving different problems and are best understood as complementary rather than competing.

SonarQube and Semgrep are pattern-based and deterministic — they apply rules to code and flag violations. They are excellent at what they do: detecting known vulnerability patterns, code smells, test coverage gaps, and rule violations consistently and at scale. SonarQube's Enterprise Edition and Semgrep both map findings to OWASP Top 10 and CWE standards. SonarQube Enterprise starts at ~$16,000/year; Semgrep Team is $35/contributor/month.

What static analysis cannot do is understand intent. A static analyser sees code patterns. It cannot read a diff and ask "does this change actually implement what the JIRA ticket described?" It cannot notice that a new API endpoint follows the correct pattern but is missing a business logic check that is only obvious when you read the requirement. It cannot flag that a JWT is being stored in localStorage as a deliberate architectural choice worth questioning rather than a simple rule violation.

AI review catches semantic and contextual issues. Static analysis catches pattern violations. For teams that can run both, they cover different ground. For teams that can run only one, the choice depends on whether rule-violation detection or contextual reasoning is the bigger gap in their current process.

vs. GitHub Copilot Code Review (Business: $19/user/month)

GitHub Copilot's code review feature reached 60 million reviews by March 2026 and uses an agentic approach that gathers full repository context before commenting — not just the diff. It deliberately stays silent on 29% of reviews where it has low confidence, producing roughly 5 comments per review on average. It is tightly integrated with GitHub but not with Bitbucket.

The key limitation for teams evaluating model flexibility: Copilot's code review uses OpenAI's models. There is no documented ability to swap to Anthropic, Google, or open-weight models. Our tool routes through LiteLLM, so the underlying model is a configuration choice — useful when a client has specific data residency requirements, cost constraints, or a preference for a particular model's reasoning style.

Where We Are Honest About the Gaps

Commercial tools have years of investment and dedicated teams. The gaps are real:

No SSO or RBAC: Enterprise identity management and access controls are not implemented. Commercial tools handle this.
No SOC 2 or compliance certification: CodeRabbit is SOC 2 Type II certified. We are not a certified product.
No integrated linters: CodeRabbit runs 40+ deterministic linters (ESLint, Pylint, etc.) alongside AI review. We run only the LLM review pass.
Self-hosted infrastructure: You run it. There is no managed SaaS — which is both a flexibility advantage and an operational responsibility.
No learning from feedback: Commercial tools learn from how teams respond to findings over time. Our tool starts fresh on every review.

For teams that want enterprise procurement, compliance certification, and a managed product with dedicated vendor support — CodeRabbit or Rovo Dev are the right choices. For teams that want to understand what's inside the box, control their own AI infrastructure, and build something that fits their exact workflow — the building blocks are all there. This article is the blueprint.

What It Is Not

This is a first-pass reviewer, not a replacement for human review on significant changes. For large architectural commits, complex business logic, or cross-cutting refactors, a human reviewer who knows the domain is irreplaceable.

The tool also doesn't understand runtime behaviour — it can't know whether a code path is actually reachable, what a variable will contain at runtime, or whether a performance concern materialises at real traffic levels. It reasons only from the diff.

What it reliably does is catch the mechanical issues — the security hygiene, the missing attributes, the unchecked nulls — that reviewers tend to miss when they're reading quickly. It raises the floor of review quality across every commit, without adding load to the people doing reviews.

For a fractional CTO managing multiple client codebases, or a tech lead who is also building features, that floor-raising is the practical value. Not replacing judgement — removing the noise that makes judgement harder.

AI Code Review on Every Commit: What We Built and Why