A year ago, AI-generated code in pull requests was a curiosity. Today, in many of the codebases we work in, it’s the default.
What hasn’t kept pace is how teams review it.
We see two failure modes, both broken.
The first is review theatre: a reviewer scrolls through a 400-line diff, sees that it compiles and tests pass, and clicks “LGTM” without fully understanding what changed. The assumption is that if the model wrote it and nothing failed, it must be fine.
The second is the opposite extreme: over-review. Teams react to the risks of AI-generated code by scrutinizing every line. Naming, formatting, structure - everything gets debated. Reviews take longer, throughput drops, and the team slows to a crawl.
Neither approach scales. Neither reflects what code review is actually for.
The model that works - the one we’ve settled on across our embedded engagements - treats review as a layered system.
Machines handle form. Humans guard intent.
Most teams already have the basics in place: linting, type-checking, and a test suite. These are table stakes.
But with AI-generated code, they’re not enough.
We extend the automated layer in three directions.
First, we enforce standard quality gates: linting, formatting, type safety, and test execution. This ensures the code meets baseline expectations before a reviewer ever sees it.
Second, we run security and compliance checks: static application security testing (SAST) and dependency/license scanning. These catch known vulnerabilities and prevent problematic dependencies from entering the codebase.
Third - and this is where most teams fall short - we add AI-specific checks.
These are designed to catch failure modes that traditional tooling misses:
None of these are reliably caught by standard CI pipelines. All of them are common in AI-generated code.
The good news: these checks are not complex. A small set of project-aware rules - often written in a day - can eliminate entire categories of risk. And once in place, they scale automatically.
By the time a human reviewer sees the pull request, the “form” of the code - syntax, structure, safety - has already been enforced.
Once machines handle form, human reviewers can focus on what actually matters: intent.
This is where experienced engineers add real value.
The key areas we expect humans to review are:
AI is very good at producing plausible code. It’s much less reliable at producing correct code in context.
What’s equally important is what doesn’t need human attention:
These are form problems. And form problems should be solved by tools, not debated in reviews.
When humans spend time on form, they’re not just inefficient - they’re distracted from the higher-order risks that actually matter.
AI doesn’t just generate code. It’s increasingly used to review it as well.
Used correctly, AI reviewers can accelerate the process.
They are particularly effective at:
In these areas, AI scales what humans already do well.
But there are also clear failure modes.
AI reviewers tend to:
The result is often a review process that feels thorough but isn’t.
The rule of thumb we use is simple:
AI reviewers are good at scaling things humans already do well. They are not a substitute for the things humans uniquely do.
To make this practical, we rely on a simple checklist that keeps reviews focused and consistent.
Here’s a version you can adopt directly:
- Does any function call reference an API we don’t depend on?
- Is any error path silent or swallowing exceptions?
- Are inputs at trust boundaries properly validated?
- Does this change fit the architecture of the surrounding code?
- Are security-sensitive flows (auth, PII) handled correctly?
- Are tests asserting behavior, not just structure?
- Is any code included that we can’t trace from a licensing perspective?
- Is the diff scoped to a single concern, or is it mixing responsibilities?
- Does the PR description explain intent, not just describe changes?
- Is there anything that looks correct but might not meet the original requirement?
This checklist is deliberately short.
It’s not meant to cover everything - it’s meant to focus attention where it matters.
Teams often ask which tools they should adopt to improve code review in AI-heavy environments.
The honest answer is: tools are secondary.
The real decision is how you structure the review process.
Once you commit to a layered model - machines handling form, humans guarding intent - the tooling becomes obvious and relatively inexpensive.
Without that decision, even the most advanced toolchain won’t help. You’ll still end up in one of the two failure modes:
Review velocity isn’t a tooling problem. It’s a layering problem.
Conclusion
In AI-heavy codebases, the volume of code is increasing. The risk profile is changing.
But the goal of code review hasn’t changed: protect the integrity of the system without blocking progress.
The teams that succeed aren’t the ones reviewing more or reviewing less.
They’re the ones reviewing differently.
They let machines enforce consistency. They rely on humans to protect intent. And they design their process so both can operate at their strengths.
If your reviews are slowing you down, we can help fix that. Let's talk.