AI Code Review on Every Commit: What We Built and Why

AI Code Review on Every Commit: What We Built and Why

A two-pass LLM reviewer that reads git diffs, checks JIRA ticket alignment, and posts structured findings to Bitbucket — built for startup teams where code review is always the bottleneck.


Every startup engineering team hits the same wall. The developers are productive — commits are landing, features are shipping. But code review is piling up. Nobody has time to review every commit carefully. Things slip through: a hardcoded secret, a missing null check, a SQL query that will fall over at 10x data, a frontend link without rel="noopener noreferrer".

This is the AI productivity paradox in practice. AI coding assistants make developers faster at writing code. They do not make reviewers faster at reviewing it. The bottleneck moves downstream.

As a fractional CTO serving multiple startup clients simultaneously, this is not a hypothetical — it's a daily operational constraint. I can't be a first-pass reviewer on every commit for every client. Neither can a small team's tech lead who is also writing code, handling escalations, and sitting in product discussions.

So we built a tool to be that first-pass reviewer. This is what it does, how it works, and what we learned building it.

What the Tool Does

The tool is a Python script — git-commit-reviews.py — that runs on a schedule (or on-demand for a specific commit). It does five things:

  1. Fetches recent commits from one or more git repositories, within a configurable time window.
  2. Extracts the unified diff for each commit using git show, with configurable context lines around each change.
  3. Pulls JIRA context if the commit subject contains a ticket reference — fetching the issue summary, type, status, and description to give the LLM intent context.
  4. Runs a two-pass LLM review: a draft pass that generates findings, followed by a reflection pass that removes anything not directly evidenced by changed code.
  5. Posts the structured review as a commit comment on Bitbucket, and saves it locally as a JSON file.

The output is a structured JSON review — commit metadata, list of changed files, and a set of observations each with a title, classification, severity, description, code snippet, and actionable suggestions.

Why Two Passes?

The single biggest quality problem with LLM code reviews is speculative findings. A model reads a diff and notices, say, a function that calls an external API. It flags "this could throw an unhandled exception if the network is down." Technically possible. But the diff didn't change exception handling — that's pre-existing code. The observation is noise, not signal.

Noisy reviews are worse than no reviews. Engineers stop reading them.

The reflection pass fixes this. After the draft review is generated, a second LLM call is made with a focused instruction:

For each observation, check whether the problematic code is directly visible in the "+" lines of the diff. Remove any finding that references unchanged context lines, uses speculative language ("might", "could potentially", "may cause") without pointing to a concrete added line, or flags something that existed before this commit.

The reflection model is allowed to be a different (and potentially cheaper) model than the draft reviewer. In practice, the reflection pass removes 20–40% of observations from a typical review — almost all of them speculative. What remains is grounded findings about code that was actually changed.

One implementation detail worth noting: the reflection pass rewrites the full observations array and updates the summary counts. But it doesn't know about JIRA alignment — that's a separate field added by the first pass. So before returning the refined review, we stash the jira_alignment array from the draft and re-inject it into the reflected output. The model never touches it.

JIRA Alignment: Did the Code Actually Do What the Ticket Asked?

For teams using JIRA, every commit should relate to a ticket. In practice, this relationship is often assumed rather than verified. A developer closes a ticket and the PR is approved, but nobody checked whether the code change actually addresses the acceptance criteria.

When JIRA integration is enabled, the tool:

  1. Extracts ticket keys from the commit subject using a configurable regex pattern (e.g. PROJ-\d+).
  2. Fetches the issue details from JIRA — summary, type, status, and up to 800 characters of description.
  3. Prepends this context to the review prompt before the diff.
  4. Instructs the LLM to return a jira_alignment array alongside observations, with one entry per referenced ticket.

Each alignment verdict is one of four values:

A commit that's misaligned with its ticket is a genuine signal worth acting on. Either the commit is going to the wrong branch, the ticket reference is wrong, or the developer is solving a different problem than the ticket describes. Any of these is worth a conversation before the code gets merged.

JIRA is optional — the tool works without it, and individual clients can disable it. The per-run in-memory cache ensures the same ticket is fetched at most once per execution, regardless of how many commits reference it.

What It Reviews

The review prompt is configurable per customer, but the default covers seven dimensions:

For frontend code specifically (HTML, JS, TS, CSS, JSX, TSX), the prompt adds a set of privacy and security checks:

The instruction "If a category or check does not apply, skip it — do not speculate" is explicit in the prompt. Combined with the reflection pass, this keeps the output tight.

Architecture and Configuration

The tool is designed around a per-customer YAML configuration file. Each client gets their own config that specifies which repositories to monitor, which LLM to use, what time window to review, and their Bitbucket workspace details.

┌─────────────────────────────────────┐
│         gcr-client.yaml             │
│  repositories, time window, LLM     │
│  service, JIRA config, Bitbucket    │
└──────────────────┬──────────────────┘
                   │
         ┌─────────▼─────────┐
         │  git-commit-       │
         │  reviews.py        │
         └──┬────────────┬───┘
            │            │
    ┌───────▼──┐   ┌─────▼──────┐
    │ git show  │   │ JIRA API   │
    │ (diff)    │   │ (context)  │
    └───────┬──┘   └─────┬──────┘
            │            │
         ┌──▼────────────▼──┐
         │  LiteLLM Router   │
         │  (Pass 1: Draft)  │
         └──────────┬────────┘
                    │
         ┌──────────▼────────┐
         │  LiteLLM Router   │
         │  (Pass 2: Reflect) │
         └──────────┬────────┘
                    │
         ┌──────────▼────────┐
         │  Bitbucket API    │
         │  (commit comment) │
         └───────────────────┘

The LLM layer is a LiteLLM router, so the actual model is swappable without touching the review code. Different passes can use different models — for example, a fast cheap model for reflection and a more capable model for the initial draft. The review_service and reflection_service keys in the config control this independently.

Token safety is built in: the tool estimates diff size (characters ÷ 4 ≈ tokens) before sending to the LLM. If the diff exceeds the model's effective token limit, the commit is skipped with a warning rather than truncating and producing a partial review. The limit is either set explicitly in the config or inferred from the model's max_tokens setting in the LLM router config.

How You Can Run It

The tool is flexible about when and how it gets triggered. There are three deployment modes, and they're not mutually exclusive.

1. On-Demand — Review a Specific Commit

Pass a commit hash directly to review a single commit immediately. This is useful when a PR is raised and you want AI feedback before the peer review starts, or when you want to re-review a commit after refining the prompt:

python git-commit-reviews.py -c gcr-client.yaml --commit-id <hash>

2. Scheduled — Cron Job on a Time Window

The most common production setup. A cron job runs the tool every few hours (or nightly), picking up all commits since a configured since timestamp across one or more repositories. No manual trigger needed — every commit gets reviewed automatically as a background process:

python git-commit-reviews.py -c gcr-client.yaml

The since value in the config controls the lookback window — it can be a relative duration like 24 hours ago or an absolute UTC timestamp for a fixed start point.

3. Event-Driven — Triggered by a Commit or Pull Request

For tighter integration, the tool can be wired into your CI/CD pipeline or source control webhook so it fires automatically on specific events:

In all three modes, the output is the same: a structured review posted as a Bitbucket commit comment and saved locally. The only difference is what pulls the trigger.

Reviews are saved locally to a reviews/ directory as {repo}-{hash}-{service}.txt files — useful for audit trails, trend analysis, or re-running with a different prompt.

Automatic Bitbucket Comments — Review Where the Code Lives

The most visible part of the workflow is not the local file — it's the automatic comment posted against the commit in Bitbucket. After the two-pass review completes, the tool calls the Bitbucket API and attaches the full structured review as a commit comment, right where the developer (and their team lead) will naturally look.

This matters for adoption. If a review lives in a log file on a server, nobody reads it. If it appears as a comment on the commit in Bitbucket — the same place where peer review comments go — it becomes part of the natural development workflow. Developers see it when they check their commit. Tech leads see it when they browse recent activity. No separate dashboard to check, no extra tool to learn.

The Bitbucket integration is configured per customer in the YAML:

bitbucket:
  enabled: true
  workspace: "your-workspace"

The repo slug is derived automatically from the repository path, so a single config covers multiple repositories without manual mapping for each one. The integration appends rather than replaces — if a commit is reviewed twice (e.g. after refining the prompt), both reviews appear as separate comments with their own timestamp. Nothing is silently overwritten.

Security Findings Mapped to Industry Standards

For security observations, the prompt instructs the LLM to map each finding to three frameworks — but only when the mapping is clearly defensible:

The instruction is explicit: "Apply mappings only when clearly defensible. If a mapping is weak or speculative, omit it rather than guessing." This matters in practice — a LLM will happily invent a plausible-sounding OWASP category for almost any finding if you let it. Forcing it to only map when confident keeps the output useful for actual security conversations, audit discussions, or compliance reviews.

For mobile code (iOS/Android SDKs), the tool maps to the OWASP Mobile Top 10 (2024) instead of the web Top 10. The framework used is configurable per customer in the YAML prompt.

The practical benefit of these mappings: when a security finding lands on a developer's desk, they can immediately look up the CWE for a precise definition, the OWASP category for broader context, and the CIS control for what a remediation programme should look like. It turns a review comment into a traceable, standard-referenced finding — the kind a security auditor or CTO can act on.

What It Catches in Practice

Across several months running this across client repositories, the findings that appear most consistently:

Security: Hardcoded API keys and tokens in configuration files, missing input validation on user-supplied parameters, JWT tokens stored in localStorage (a known XSS risk), and missing CSRF protection on state-changing endpoints.

Frontend privacy: Login forms without autocomplete="off" on password fields — surprisingly common and a GDPR concern on EU-facing products. External links without rel="noopener noreferrer".

Functional: Off-by-one errors in pagination logic, null reference paths in newly added conditional branches, missing error handling after await calls.

JIRA misalignment: Commits referencing a bug ticket but changing unrelated infrastructure code — usually a copy-paste error in the commit message, but occasionally a developer working in the wrong branch.

The reflection pass is what makes these findings actionable. Without it, the list would also include warnings about the rest of the file — exception handling in code that wasn't touched, performance concerns in methods that weren't changed. With it, the findings are tightly scoped to what was actually committed.

What It Is Not

This is a first-pass reviewer, not a replacement for human review on significant changes. For large architectural commits, complex business logic, or cross-cutting refactors, a human reviewer who knows the domain is irreplaceable.

The tool also doesn't understand runtime behaviour — it can't know whether a code path is actually reachable, what a variable will contain at runtime, or whether a performance concern materialises at real traffic levels. It reasons only from the diff.

What it reliably does is catch the mechanical issues — the security hygiene, the missing attributes, the unchecked nulls — that reviewers tend to miss when they're reading quickly. It raises the floor of review quality across every commit, without adding load to the people doing reviews.

For a fractional CTO managing multiple client codebases, or a tech lead who is also building features, that floor-raising is the practical value. Not replacing judgement — removing the noise that makes judgement harder.



Hungry for more hands‑on guides on coding, security, and open‑source? Join our newsletter community—new insights delivered every week. Sign up below 👇