How Far Can AI Self-Validate Rails Code?

A few weeks ago, I wrote Stop Reading AI Code, Start Measuring It: A Rails Playbook. The idea there was that if you stop hand-reviewing every line of AI-generated code, you’ll need an automated gate that an AI can run after every change and that catches the things humans used to catch. In the first iteration, we ended up landing on four gates: SimpleCov line and branch code coverage, Flog and Rubocop Metrics for complexity, and Mutant for mutation kill ratio.

In the closing of that post I was already explicit about what those four gates don’t catch. I called out security (SQL injection and other classic vulnerabilities) and performance (N+1 queries, an includes that loads half the database into memory, an unbounded loop, a 4-second request) as known gaps. "A method can be short, well-tested, mutation-proof, and still bring the app to its knees." This last week, I kept coming back to that admission: can we push further on those exact gaps? I spent a session on trend-radar trying to close them with four more gates targeting security and runtime performance.

I ended up maintaining two of these four gates. During this, I decided to split the gate into a fast local version and a slower CI version because the time it took locally was starting to hurt AI output, and the crazy part is that this split surfaced a bug that had been silently inflating my mutation kill ratio by 7 percentage points.

Brakeman

If you’re not familiar with Brakeman, it’s a static analysis tool built specifically for Rails. It walks your source code without actually executing it, looking for the classic vulnerability patterns: SQL injection, cross-site scripting, mass assignment, command injection, unsafe deserialization, open redirects, hardcoded secrets, unsafe file access, the whole OWASP top 10 plus a bunch of Rails-specific common vulnerabilities. It runs in a few seconds and produces a JSON report, with each finding tagged by confidence level (high, medium, weak) so you can decide where to draw the line.

It’s almost certainly already in your Rails Gemfile, as most Rails apps pick it up through the default omakase setup, but I’d never actually run it on this codebase, let alone add it into our quality task. So the whole job came down to adding a quality:brakeman rake task with a threshold:

desc "Run Brakeman; emit JSON to tmp/quality/brakeman.json; set warnings_max threshold on first run"
task brakeman: :environment do
  FileUtils.mkdir_p(QUALITY_DIR)
  out_path = QUALITY_DIR.join("brakeman.json")
  sh "bundle exec brakeman --format json --confidence-level 2 --no-pager " \
     "-o #{out_path.to_s.shellescape} || true"

  parsed = Quality::BrakemanParser.new(out_path).parse
  set_brakeman_threshold_if_unset!(parsed[:warnings])
end

Two design choices worth flagging:

  1. --confidence-level 2 filters out the weak-confidence warnings and keeps the medium and high ones. Brakeman’s weak tier is mostly informational noise (patterns that could be unsafe in some contexts but usually aren’t), so gating on it produces a steady stream of "yes I see it, accepting" bumps that don’t really protect anything. Medium and high are where the more concrete value lives, with warnings that flag a real pattern Brakeman is fairly sure about, and gating hard on them gives the AI a clear binary to act on without flooding it in false positives.
  2. Hard threshold, no tolerance. Same idea as above. A "+1 warning is fine" tolerance band means you might silently merge a real SQL injection that snuck in alongside an unrelated fix. Better to fail and force a one-line warnings_max: 1 bump as the explicit "yes, I see it, accepting" signal: visible in PR diff, deliberate decision. And the same caveat applies: a zero-tolerance rule like this would feel horrible without AI in the loop, but thank god we now have our intern Claude.

The first run captures the observed value (0 in my case) and writes it to config/quality_thresholds.yml. Next runs that introduce a new warning fail. To verify this actually catches anything, I temporarily added User.where("name = '#{params[:name]}'") to a controller, ran the gate, and watched it go red.

Bullet (N+1)

Bullet was the second easy win. It’s a really well-known gem in the Rails community for catching N+1 queries, and for a good reason: it hooks into ActiveRecord at runtime and watches what your code does with associations during a request. It flags three specific shapes of Rails performance bug: N+1 queries (loading 100 records and then calling .posts on each one without preloading), unused eager loading (you wrote .includes(:posts) but the request never actually touches posts), and missing counter caches (.size on a has_many triggering a COUNT query when a denormalized counter column would do). The detection is heuristic, not perfect, but it catches the most common Rails database performance problems, and it does it just by running your Rails spec suite.

To make Bullet’s detections actually fail the gate, I add it in the spec suite via spec/support/bullet.rb:

require "bullet"

Bullet.enable      = true
Bullet.raise       = true
Bullet.bullet_logger = false
Bullet.console     = false

RSpec.configure do |config|
  config.before(:each) { Bullet.start_request }
  config.after(:each) do
    Bullet.perform_out_of_channel_notifications if Bullet.notification?
    Bullet.end_request
  end
end

Bullet.raise = true is what turns a detection into a spec failure. And a spec failure fails bin/rspec, which fails quality:coverage, which fails the umbrella gate. No new threshold, no JSON artifact, no parser. Binary.

The first time I ran this on trend-radar, six request specs failed, and all six came from two Bullet false positives, zero real N+1s in the codebase. One was validate_per_user_limit in TopicSubscription, calling user.topic_subscriptions.count, a single COUNT query rather than iteration. The other was TopicIndexProps#call doing @user.topic_subscriptions.index_by(&:topic_id), which loads the user’s subscriptions exactly once into memory and lets the very next line consume them.

Bullet’s heuristic flags any association calls on a loaded record without preloading. It can’t tell the difference between a real N+1 and a code path that legitimately uses an association exactly once.

The two cleanest fixes here are to refactor to bypass Bullet’s heuristic (use TopicSubscription.where(user_id: user_id) instead of user.topic_subscriptions), or add Bullet.add_safelist entries. I decided to refactor as in the end it generates the same SQL, no heuristic trigger, no safelist comments to maintain.

The refactors are not free, though. They cost something real: the code is slightly less semantic afterwards. user.topic_subscriptions.count reads as "how many subscriptions does this user have?" and expresses the domain relationship directly. TopicSubscription.where(user_id: user_id).count reads as "count rows in this table where user_id matches": table-centric, one indirection further from the conceptual model. The SQL is probably identical, but we add a cognitive cost that is going to be paid by every future reader of those two methods.

The reason why I’m even considering this tradeoff is precisely because I’m preferring to prioritize AI observability over human readability. The same Bullet rule that fires on these false positives is what will fire on a real N+1 the next time someone writes user.topics.each { |t| t.posts.first }. Either I tolerate Bullet’s noise here and refactor the two affected methods, or I add Bullet.add_safelist entries. The second route keeps the code more semantic, but it opens a different cost: any list of exceptions you have to maintain by hand (safelists, allowlists, ignore files, threshold tables, you name it) eventually becomes a leak. Entries get added in moments of urgency, nobody audits them, and one day a real regression slips through hidden behind a stale exemption.

I’m willing to pay that cost for now, because I’m trying to push how far I can make AI generate reliable code without monitoring using these tools, but on a more robust codebase, I would probably consider another approach like Bullet.add_safelist, to avoid a big refactor while introducing this gate, and would slowly tackle the safe list.

What I pulled back: allocations and per-action SQL counts

This is where it got interesting. I implemented two more gates, made them work, even did a fake experiment to verify they fired on regressions, and decided that they are not worth it.

Suite-wide allocations via GC.stat

Quick reminder in case GC is not a concept you’re familiar with. Ruby’s garbage collector is the runtime subsystem that frees memory you no longer need. You don’t allocate or free objects manually in Ruby; every String.new, every Hash.new, every User.find returns a fresh object, and the GC tracks which of those are still reachable and which can be reclaimed. The GC module exposes a small API: GC.start forces a collection, GC.disable temporarily pauses it, GC.count tells you how many collections have happened, and GC.stat returns a hash of internal counters covering the whole process lifetime (heap pages, slots in use, allocations, marks, promotions, you name it).

The counter we care about is :total_allocated_objects: a monotonically increasing integer that counts every object Ruby has allocated since the process started. Wrap any span of code with two reads of that counter and subtract them, and you get an exact count of how many objects that span created. No sampling, no extra gem, no instrumentation overhead beyond the two reads themselves.

That’s the entire premise of this gate. Wrap the spec suite in before(:suite) / after(:suite) hooks, capture GC.stat(:total_allocated_objects) at start and end, write the diff to tmp/quality/allocations.json, compare against a baseline with a +15% tolerance band:

RSpec.configure do |config|
  config.before(:suite) do
    QualityAllocationsState.start = GC.stat(:total_allocated_objects)
  end

  config.after(:suite) do
    next if defined?(Mutant)  # Mutant's RSpec session would overwrite the artifact
    diff = GC.stat(:total_allocated_objects) - QualityAllocationsState.start
    File.write(out_dir.join("allocations.json"),
               JSON.pretty_generate(total_allocated_objects: diff))
  end
end

On trend-radar, the first run came in around 3.07M allocated objects across the whole suite, which became the baseline. With the +15% tolerance band, the gate’s effective ceiling sat just under 3.53M. To sanity-check that it actually catches a real regression, I dropped 1_000_000.times { String.new("waste") } into a random model spec and ran the gate again. The total climbed to 5.08M, the row went red, and the gate caught it on the first try. Worked exactly as designed.

Per-action SQL count via ActiveSupport::Notifications

For per-action tracking, I needed something more granular than GC.stat‘s suite-wide total. After some brainstorming with AI I decided to go with ActiveSupport::Notifications. Most Rails developers see its output every day without realizing it. It’s the pub/sub pipeline that powers the Started GET "/" Completed 200 OK in 42ms log lines in development, and it’s the same mechanism gems like Skylight and NewRelic build on.

ActiveSupport::Notifications fires a named event at every interesting lifecycle moment (a SQL query running, a controller action starting or finishing, a template rendering, a cache lookup, a mailer firing) with a bunch of details. Anyone in the process can subscribe to those events by name and run code in response. So all I had to do was subscribe to these three events:

  • start_processing.action_controller fires when Rails begins handling a request, with the controller and action name in its payload.
  • sql.active_record fires for every SQL query the request runs.
  • process_action.action_controller fires when the action completes.

Wiring it up is short and mechanical. When a request starts, I zero out a thread-local counter. On every sql.active_record event, I bump it. When the action finishes, I grab whatever the counter is at and update the running max for that Controller#action. After the whole suite runs, I dump those maxes to JSON. All of this lived in spec/support/action_metrics.rb:

require "fileutils"
require "json"

module ActionMetricsTracker
  SKIPPED_QUERY_NAMES = %w[ SCHEMA TRANSACTION CACHE ].freeze
  @data = {}

  class << self
    attr_reader :data

    def reset
      Thread.current[:action_metrics_sql_count] = 0
    end

    def increment
      Thread.current[:action_metrics_sql_count] =
        (Thread.current[:action_metrics_sql_count] || 0) + 1
    end

    def current
      Thread.current[:action_metrics_sql_count] || 0
    end

    def record(controller:, action:, count:)
      key = "#{controller}##{action}"
      @data[key] ||= { sql_count_max: 0 }
      @data[key][:sql_count_max] = [ @data[key][:sql_count_max], count ].max
    end
  end
end

ActiveSupport::Notifications.subscribe("start_processing.action_controller") do |*|
  ActionMetricsTracker.reset
end

ActiveSupport::Notifications.subscribe("sql.active_record") do |*args|
  payload = args.last
  next if ActionMetricsTracker::SKIPPED_QUERY_NAMES.include?(payload[:name])
  ActionMetricsTracker.increment
end

ActiveSupport::Notifications.subscribe("process_action.action_controller") do |*args|
  payload = args.last
  ActionMetricsTracker.record(
    controller: payload[:controller],
    action: payload[:action],
    count: ActionMetricsTracker.current
  )
end

RSpec.configure do |config|
  config.after(:suite) do
    next if defined?(Mutant)  # Mutant's RSpec session would overwrite the artifact
    out_dir = Rails.root.join("tmp/quality")
    FileUtils.mkdir_p(out_dir)
    File.write(
      out_dir.join("action_metrics.json"),
      JSON.pretty_generate(ActionMetricsTracker.data)
    )
  end
end

Which produces output like:

{
  "DashboardController#index":   { "sql_count_max": 3 },
  "TopicSubscriptionsController#create": { "sql_count_max": 5 },
  "MatchesController#dismiss":   { "sql_count_max": 2 }
}

The threshold rule for each action that I decided here is "+25% OR +3 queries, whichever is more permissive." The whole point of pairing a percentage band with an absolute floor is to give the app room to grow naturally without me bumping thresholds on every PR: the percentage absorbs proportional growth on bigger actions, and the absolute floor absorbs small absolute growth on tiny ones (a 3-query action growing to 4 isn’t a 33% regression worth firing on, but if it grew to 7 something real probably changed). I sanity-checked it the same way I checked the allocations gate: dropped 10.times { User.find_by(id: -1) } into DashboardController#index and watched that specific row fail (13 ≤ 6 ✗) while every other action stayed green.

Why I removed both gates

To be fair here, both gates were technically doing their job. They caught the regressions I created, and the numbers looked fine. What I wanted to see next was how they’d hold up against the kind of growth that just happens as a project gets bigger, so I added 50 extra create(:user) calls to a before(:all) and re-ran the suite. Allocations climbed by about 1.6%, nowhere near the 15% limits I added, and the per-action SQL count didn’t move at all, since it only counts queries during the controller invocation itself, not whatever fixtures the test sets up beforehand. So on paper, both gates had plenty of room left: dozens of normal-sized feature PRs could land before either threshold would need a bump.

One thing worth being upfront about: these two gates weren’t ideas I came up with on my own. They were AI suggestions that I picked up while brainstorming ways to close the performance gap, and I implemented them mostly to see what would happen. So this whole section is more about an experiment that didn’t pan out than about something I had a strong conviction in from the start.

But I ended up deciding to remove them anyway, and the reason is more about the kind of output they produce than the numbers I set.

A green allocation gate or a green per-action-SQL gate doesn’t tell you "the code is good." It tells you "the code didn’t grow too much yet." Every new feature we add slowly increases the number, and eventually you’ll hit the limit. You increase it because the growth was legitimate. You increase it one more time. The gate becomes an overhead with mostly false positives, so whenever a real regression appears, the most likely path is that AI and even reviewers just ignore it and this would probably ship.

The gates I kept all have a different signal, they generate a binary signal ("no new SQL injection," "no N+1 patterns," "specs cover this branch") rather than a number that keeps increasing over time. The binary signal generates a clear necessity, this need to change, while the number that keeps increasing generates a signal that AI need to interpret, leaving a lot of room for AI to pile slop over time.

One problem here is that we were not able to solve the performance problem yet, if it is even completely solvable to be honest, but I’ll keep pushing here and once I have any meaningful updates I’ll do a follow up post.

Binary x  Continuous Gates

The local/CI split

As our test suite grows, mutation test started to be way slower, impacting the AI loop quite a lot. On trend-radar, quality:mutation normally takes about 2:50 locally and around 14 minutes on a free GitHub runner, because Mutant has to rerun the entire spec suite once per surviving mutation, and that adds up to hundreds of runs. All the other gates combined finish in around 15 seconds.

As this will only inevitably grow, having to wait this time every time the AI generates new code would start to pile up idle time, and the agent would slowly lose productivity.

bin/rake quality:local   # ~15 seconds; everything except mutation
bin/rake quality         # ~3 min locally / ~15 min CI; full gate including mutation

The local task also cleans up after itself, so a stale mutation.json from a previous full run doesn’t sneak into the next report and confuse whoever’s reading it:

namespace :quality do
  task :clean_mutation_artifact do
    path = QUALITY_DIR.join("mutation.json")
    path.delete if path.exist?
  end

  desc "Fast local quality gate (everything except mutation testing). ~15s runtime."
  task local: %w[
    quality:clean_mutation_artifact
    quality:coverage
    quality:rubocop
    quality:critic
    quality:flog
    quality:brakeman
    quality:report
  ]
end

GitHub Actions still runs the full version on every PR, with concurrency turned on so that rapid-fire commits don’t pile up in the queue. CLAUDE.md tells the AI to lean on quality:local while iterating and only fall back to the full quality once it thinks it’s actually done.

There’s a tradeoff in here, of course. A regression that only mutation testing would catch could in theory slip past quality:local and only fail in CI, but in practice that almost never happens, CI catches it the rare time it does, and the iteration speedup turned out to be worth way more than the risk.

We will inevitably revisit this as our suite grows, as even normal test suites will start to get really slow, so will need a way to only trigger these steps for files that are related to the file changes we made, but for now, this is fine.

The mutation bootstrap bug

The split didn’t just speed up iteration, it ended up uncovering a bug I’d been merging under for weeks. Once I had two independent mutation runs to compare, the numbers started to differ: local bin/rake quality reported a kill ratio of 74.77%, while CI reported 67.89%. The exact same code, exact same Mutant version, exact same Ruby version (4.0.2 on both). A 7 percent gap is way too big to just ignore.

My first instinct was that maybe one of the runs was just noisy, so I ran mutation twice locally to rule that out. Both came back at exactly 74.77%, completely deterministic, which meant the gap was real and something about CI was actually making Mutant behave differently.

This is also one part of the post that I wanted to highlight something, because I think it shows something underappreciated about working with AI right now. Up to this point, I’d been using Claude mostly as a code generator, but here I had a real debugging problem with no obvious place to start: two numbers disagreeing across two environments, and no clear culprit. So instead of trying to figure it out myself first, I just handed the whole investigation over and watched what Claude Code did.

The first thing it suggested: "we can’t actually compare anything until CI uploads its mutation.txt as an artifact, that should be step zero." This sounds obvious, but given how little I know about setting up GitHub Actions, it was a possibility that wasn’t even on my radar at the beginning of this debug. I would probably have read through a bunch of code trying to figure out what was going on and hit a bunch of dead ends along the way. Here is the code it added to my GitHub workflow.

- name: Upload quality artifacts (mutation.txt etc.) for inspection
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: quality-artifacts
    path: tmp/quality/
    retention-days: 14

A really simple change in my CI yaml, and it gives me a downloadable zip after any run, including failures. I re-triggered CI, pulled the zip down, and pointed Claude at both mutation.txt files. Diffing 4361 mutations on each side by hand was never going to happen, but for a model with a file in front of it that’s a five-minute job. My only real instruction was "figure out which mutations are alive on CI but killed locally, group them by file, and tell me what the pattern is."

The answer came back almost immediately, and it was a little embarrassing: script/mutant_bootstrap.rb was not setting RAILS_ENV. CI sets RAILS_ENV: test at the workflow level, so Rails boots cleanly in test mode there. Locally bin/rake quality wasn’t exporting RAILS_ENV anywhere, which meant Rails was booting in development mode the whole time. CSRF protection on, dev middlewares loaded, show_exceptions behaving differently. This resulted in 33 request specs failing locally, every single one returning 403 instead of the expected redirect.

Here is where Mutant’s kill ratio gets a bit unintuitive. Each subject (a class+method pair) gets one "neutral" insertion (the original code, untouched) and N "evil" mutations. A neutral counts as killed only if the spec passes and an evil counts as killed only if the spec fails. Which means that when a spec is broken at baseline, the math gets reversed:

  • Neutral: spec fails → neutral becomes "alive" (1 per broken subject; reported but doesn’t move the kill ratio much)
  • Every evil mutation under that subject: spec also fails → all evils get killed

So 33 broken subjects locally translated into roughly 300 mutations getting credited as kills that those specs didn’t actually earn, which is exactly where the 7 percentage point inflation was coming from. CI’s 67.89% was the truthful number; the local 74.77% I’d been looking at for two weeks was just wrong.

The fix turned out to be two lines:

# script/mutant_bootstrap.rb
ENV["RAILS_ENV"] ||= "test"
require_relative "../config/environment"
Rails.application.eager_load!

After this change the local kill ratio dropped to 67.89%, matching CI’s percentage.

What I take from all this is pretty simple: a quality gate that lies to you is worse than no gate at all. For weeks I’d been merging code under the assumption that 74% of mutations were being killed, when the actual number was closer to 68%. And the only reason any of this surfaced is that the local/CI split gave me two independent runs to compare. Without that, I’d have kept merging under the inflated number forever and probably never noticed.

What’s still missing

Two real gaps I didn’t close.

Diff-aware gate execution

Right now, bin/rake quality runs every check against the entire codebase regardless of what the PR actually did. A one-line change to a controller still triggers full Mutant runs across 4,361 mutations, of which probably ~3,000 are unrelated. Mutant supports --since main to limit mutations to lines changed against a base. RuboCop has similar diff-mode patterns. Brakeman re-scans everything, but the JSON output could be filtered.

A diff-aware version of the gate could be way faster on most PRs, which compounds the AI iteration improvement.

Real performance regression detection

I removed the allocations gate and the per-action SQL gate. They worked, technically, but as we discussed, they enforce a continuous-drift property that produces threshold-bumping ritual rather than real signal.

But the problem is still there: nothing in the current gate catches "this controller now takes 200ms instead of 80ms." Bullet catches certain N+1 patterns; coverage catches "you didn’t write a test"; mutation catches "your test is too shallow." We don’t have a single gate that effectively catches "this code is slower."

To be honest I have some possible assumptions on how we can close that gap, but they are only assumptions, so I’ll probably do a follow up once I have some progress on this front.

Closing thoughts

The headline lesson of the previous post was: stop reading AI-generated code, measure it. The headline lesson of this one is a bit more nuanced.

Not every measurement is a signal. The continuous-drift gates I removed produced numbers that were technically accurate and operationally useless. The categorical gates I kept produced binary signals the AI could actually act on.

A gate that lies is worse than no gate. The mutation bootstrap bug was a 7-point lie I’d been merging under for two weeks. Finding it required not trusting the gate that was telling me good news.

Speed of feedback and thoroughness of feedback are a balance, not a choice. You need fast feedback to stay productive while iterating, and thorough feedback to actually trust what’s coming out the other end. Splitting the gate into a 15-second local pass and a full ~15-minute CI run was how I got both at once: the AI keeps a tight loop while iterating, and the slow checks still run before anything ships.

The full implementation is in the trend-radar repo: lib/tasks/quality.rake, spec/support/bullet.rb, script/mutant_bootstrap.rb, and .github/workflows/quality.yml are the files where most of this work lives.

We want to work with you. Check out our Services page!