Thanks for this article. What's your suggestion on orgs encouraging non-engineers to "vibe-code" changes to codebases? With that user base we cannot expect "Don't merge if you can't explain it".
My thoughts for now are that all code needs someone **accountable** for it.
There has to be psychological safety among reviewers too to push back on changes which are too large or complex. Strong feedback loops can reduce the risk, but still the expectation of explaining the change then falls on the reviewer/owner of the codebase
This is the question I get asked most right now. My take: accountability has to be structural, not just cultural. If non-engineers are shipping to production, someone technical must own that code path - full stop. That owner reviews, approves, and carries the pager.
The psychological safety point is huge though. Reviewers need explicit air cover to reject PRs that exceed their ability to verify, regardless of who submitted them or how "simple" AI made it look. Maybe the rule becomes: if the reviewer can't explain it, it doesn't merge and that's the reviewer's right, not their failure.
Will be interesting to see how this plays out this year.
We’ve been working on building Mault to enforce architectural and testing constraints at the point of change so reviewers aren’t stuck catching violations that never should’ve made it to a PR. Humans still own intent and risk but the system can make being wrong non-destructive much earlier.
Thanks Kimberley! You've nailed the key bit there - shifting left on constraint enforcement means reviewers can focus on the harder judgment calls rather than playing gotcha with violations that tooling could have caught. "Make being wrong non-destructive earlier" is a great framing. Would love to hear how Mault handles the balance between being strict enough to catch real issues vs. flexible enough that developers don't just route around it.
We don’t believe there’s a single “right” AI workflow. We segment by risk. For newer or vibe coders, Mault acts as guardrails and a learning surface. For fast movers, it stays mostly invisible and just traps high impact drift like hallucinated libraries or architectural breaks. At the org level, strictness shifts into test quality, not just coverage, so green builds actually mean something. If someone feels the need to route around it, we treat that as our failure.
I expanded on this a bit more in a DM I just sent. Would love your feedback.
This reminds me of CodeRabbit's recent report, which analyzed 470 open source GitHub pull requests, where 320 PRs were AI-co-authored and 150 were likely generated by humans alone.
Really appreciate you bringing in that CodeRabbit data - it's one of the more rigorous analyses I've seen on this in the last few weeks (and my original draft leaned more heavily into their findings so agree its a helpful read). The "looks right at a glance but violates local idioms" finding is particularly insidious because it's exactly what slips past fatigued reviewers.
The volume problem is real. But in enterprise retail teams there's a second layer - inherited AI-generated code that passes review because the reviewer can validate structure but not domain behaviour.
Nobody flags that the checkout logic doesn't account for promotional pricing rules because that knowledge isn't in the team anymore.
Review catches bugs but it doesn't catch wrong assumptions built on missing domain knowledge.
"If you haven't seen the code do the right thing yourself, it doesn't work" - this is the rule I learned the hard way. Let my agent deploy something I didn't verify. It broke production.
Now I treat AI output like junior dev work: trust but verify. Test coverage became non-negotiable. The speed gains are huge, but only if you have guardrails.
Strong take. AI has shifted the bottleneck from writing code to proving it works. Reviews now feel less about syntax and more about intent, risk, and long-term maintainability. The “PR contract” idea is especially practical.
Very interesting - my bet is that we will see a lot more tools in this space during 2026. Especially shifting the validations from only checking the code to actually testing whether or not the application generated is doing what was intended in the first place.
To achieve this, we need to start versioning or saving the intents, maybe in the form of prompts, or specs, or something else. And then use AI agents to verify that behavior.
Like Deming would say, the QA approach to quality is: "you burn the toast, I scrape it". Of course a PR should contain code with automated tests in 2026, and "solo developers" are not suddenly exempt from understanding the code and providing automated test coverage. Feels like this is an attempt not to offend them while they move the bottleneck to their customers being burdened with the problems they create.
Thanks for this article. What's your suggestion on orgs encouraging non-engineers to "vibe-code" changes to codebases? With that user base we cannot expect "Don't merge if you can't explain it".
My thoughts for now are that all code needs someone **accountable** for it.
There has to be psychological safety among reviewers too to push back on changes which are too large or complex. Strong feedback loops can reduce the risk, but still the expectation of explaining the change then falls on the reviewer/owner of the codebase
This is the question I get asked most right now. My take: accountability has to be structural, not just cultural. If non-engineers are shipping to production, someone technical must own that code path - full stop. That owner reviews, approves, and carries the pager.
The psychological safety point is huge though. Reviewers need explicit air cover to reject PRs that exceed their ability to verify, regardless of who submitted them or how "simple" AI made it look. Maybe the rule becomes: if the reviewer can't explain it, it doesn't merge and that's the reviewer's right, not their failure.
Will be interesting to see how this plays out this year.
i think this hits on the real bottleneck.
We’ve been working on building Mault to enforce architectural and testing constraints at the point of change so reviewers aren’t stuck catching violations that never should’ve made it to a PR. Humans still own intent and risk but the system can make being wrong non-destructive much earlier.
Thanks Kimberley! You've nailed the key bit there - shifting left on constraint enforcement means reviewers can focus on the harder judgment calls rather than playing gotcha with violations that tooling could have caught. "Make being wrong non-destructive earlier" is a great framing. Would love to hear how Mault handles the balance between being strict enough to catch real issues vs. flexible enough that developers don't just route around it.
We don’t believe there’s a single “right” AI workflow. We segment by risk. For newer or vibe coders, Mault acts as guardrails and a learning surface. For fast movers, it stays mostly invisible and just traps high impact drift like hallucinated libraries or architectural breaks. At the org level, strictness shifts into test quality, not just coverage, so green builds actually mean something. If someone feels the need to route around it, we treat that as our failure.
I expanded on this a bit more in a DM I just sent. Would love your feedback.
This reminds me of CodeRabbit's recent report, which analyzed 470 open source GitHub pull requests, where 320 PRs were AI-co-authored and 150 were likely generated by humans alone.
The core finding included (https://www.infoworld.com/article/4109129/ai-assisted-coding-creates-more-problems-report.html):
- Severity escalates with AI, with more critical and major issues happening.
- AI introduced nearly two times more naming inconsistencies; unclear naming, mismatched terminology, and generic identifiers appeared frequently.
- AI code “looks right” at a glance but often violates local idioms or structure.
- AI-generated code often created issues correlated to real-world outages.
- Performance regressions are rare but are disproportionately AI-driven.
- Incorrect ordering, faulty dependency flow, or misuse of concurrency primitives appeared far more frequently in AI pull-requests.
- Formatting problems were 2.66 times more common in the AI pull requests.
Really appreciate you bringing in that CodeRabbit data - it's one of the more rigorous analyses I've seen on this in the last few weeks (and my original draft leaned more heavily into their findings so agree its a helpful read). The "looks right at a glance but violates local idioms" finding is particularly insidious because it's exactly what slips past fatigued reviewers.
The volume problem is real. But in enterprise retail teams there's a second layer - inherited AI-generated code that passes review because the reviewer can validate structure but not domain behaviour.
Nobody flags that the checkout logic doesn't account for promotional pricing rules because that knowledge isn't in the team anymore.
Review catches bugs but it doesn't catch wrong assumptions built on missing domain knowledge.
That 45% security flaw rate for AI code is grim but the flip side caught me off guard. Anthropic recently pointed Claude at decades-old human-written code that's been fuzzed for years and it pulled out 500+ high-severity vulns everything else missed. Same context-aware reasoning you describe vs just pattern matching. Wrote up the full story: https://reading.sh/anthropic-pointed-ai-at-well-reviewed-code-it-found-500-bugs-971a01f75c96?sk=20c0af35eed2d0cd7d6b62ddc066bc84
"If you haven't seen the code do the right thing yourself, it doesn't work" - this is the rule I learned the hard way. Let my agent deploy something I didn't verify. It broke production.
Now I treat AI output like junior dev work: trust but verify. Test coverage became non-negotiable. The speed gains are huge, but only if you have guardrails.
How I approach verification: https://thoughts.jock.pl/p/claude-code-vs-codex-real-comparison-2026
How do you balance verification time against the speed gains from AI coding?
Great piece. I got sucked into reading this right away. Expecting to read more of these stuffs.
Strong take. AI has shifted the bottleneck from writing code to proving it works. Reviews now feel less about syntax and more about intent, risk, and long-term maintainability. The “PR contract” idea is especially practical.
Thank you for sharing!
Very interesting - my bet is that we will see a lot more tools in this space during 2026. Especially shifting the validations from only checking the code to actually testing whether or not the application generated is doing what was intended in the first place.
To achieve this, we need to start versioning or saving the intents, maybe in the form of prompts, or specs, or something else. And then use AI agents to verify that behavior.
Like Deming would say, the QA approach to quality is: "you burn the toast, I scrape it". Of course a PR should contain code with automated tests in 2026, and "solo developers" are not suddenly exempt from understanding the code and providing automated test coverage. Feels like this is an attempt not to offend them while they move the bottleneck to their customers being burdened with the problems they create.