Stop Reading AI Code. Start Measuring It. (A Rails Playbook.)

So I was scrolling through Twitter the other day — half-reading, not really looking for anything — and this tweet from Uncle Bob caught my attention:

Screenshot of a tweet by Uncle Bob Martin (@unclebobmartin), posted April 14 2026, arguing that humans should stop reviewing AI-written code line by line and instead measure it via test coverage, dependency structure, cyclomatic complexity, module sizes, and mutation testing. Full text quoted below. The tweet has 143.4K views, 1K likes, 135 retweets, and 59 replies.

My first reaction was: bullshit, you still have to read the code.

Then I actually sat with it for a minute, and I realized I was lying to myself.

I don’t read every line of AI-generated code. Nobody does. We skim the diff, we make sure it looks "roughly right," we run the tests, and we ship. The real question isn’t whether we’re reading the code line by line — we’re not — it’s whether there’s anything else holding the line when we don’t.

Uncle Bob’s list is that something else. So I grabbed a small Rails app I’d been building with Claude and wired every one of those metrics into a single gate. The proposition: if AI touches the codebase, the gate decides whether the code ships. This post walks through what each metric actually measures, the Ruby tool that computes it, and the bits of code I glued together to make bin/rake quality work.

By the end, you’ll have a blueprint for any Rails app, and since you’ll be handing the actual plumbing to your coding agent, this should take a lot less than an afternoon.


The setup

The app is a small Rails 8 service called trend-radar — 5 controllers, 4 models, 1 service, RSpec + rubocop-rails-omakase for baseline. The goal: one command (bin/rake quality) that runs every metric, compares each against a threshold in config/quality_thresholds.yml, and exits non-zero if any gate fails.

Here’s what the final output looks like:

Quality gates
=============

Line coverage             96.6%   >= 95.0%    ✓
Branch coverage           91.1%   >= 90.0%    ✓
Flog max (method)         19.8    <= 20       ✓
Flog max (class)          66.8    <= 70       ✓
Mutation kill ratio       69.6%   >= 69.5%    ✓

5/5 gates passed.

That’s the fence. Every AI-produced change — whether it’s a new controller action, a refactor, or a bug fix — has to clear this bar before I declare the task done.

Let me walk you through each row.


Metric 1 — Test Coverage (the bare minimum)

The concept: test coverage measures what percentage of your source code actually executes during the test suite. There are two flavors:

  • Line coverage — did this line run?
  • Branch coverage — for every if/unless/&&/ternary, did both arms run?

Branch coverage is the stricter one. A method can have 100% line coverage while only exercising the happy path of every conditional; branch coverage catches that.

Why it matters for AI code: coverage is necessary but not sufficient. A low number definitely means "this AI wrote code that no test touches" — unsafe to ship. A high number means "at least the tests run the code" — but it says nothing about whether the tests check anything (we’ll get to that with mutation testing).

The tool: SimpleCov. Ruby’s de-facto coverage gem, wired into RSpec via one require at the top of spec/spec_helper.rb.

The implementation:

# spec/spec_helper.rb — before anything else
require "simplecov"

SimpleCov.start "rails" do
  enable_coverage :branch
  add_filter "/spec/"
  add_filter "/config/"
  add_filter "/db/"
  add_filter "/lib/tasks/"
end

After every spec run, SimpleCov writes coverage/.last_run.json:

{ "result": { "line": 96.65, "branch": 91.07 } }

A tiny parser reads that and hands it to the aggregator:

# lib/quality/coverage_parser.rb
require "json"

module Quality
  class CoverageParser
    def initialize(path)
      @path = path
    end

    def parse
      result = JSON.parse(File.read(@path)).fetch("result")
      { line: result["line"], branch: result["branch"] }
    end
  end
end

Threshold: 95% line, 90% branch. Aspirational — picked to force the AI (and me) to actually test new code, not just write it.


Metric 2 — Cyclomatic Complexity & Method Size (the smell detector)

The concept: how much logic is packed into one method? There are two overlapping measures:

  • Cyclomatic complexity — count the independent paths through a method. Every if, unless, case/when, && in a condition, rescue adds one. High numbers mean the method has many branches to reason about.
  • Flog (ABC score) — a weighted count of Assignments, Branches (method calls, including operators like == and +), and Conditions. The score is sqrt(A² + B² + C²) — your Pythagorean distance from "this method does nothing."

The two metrics overlap but aren’t identical. Cyclomatic asks how many paths, Flog asks how much stuff. A method with one linear flow of 30 method calls has low cyclomatic complexity but a high Flog score.

Why it matters for AI code: AI loves to produce plausible-looking monolithic methods. A controller action that does three database queries, builds a hash with ten keys, conditionally renders one of four ways — that’s exactly what an LLM generates when you ask for a feature. Complexity metrics put a ceiling on how far into that hole it can go before the gate fails.

The tools:

  • Rubocop’s Metrics cops (Metrics/CyclomaticComplexity, Metrics/MethodLength, Metrics/AbcSize, Metrics/ClassLength) — part of standard Rubocop. rubocop-rails-omakase disables them by default; we re-enable them.
  • Flog (the flog gem) — for ABC-based complexity.

The implementation — enabling the Rubocop cops:

# .rubocop.yml
inherit_gem: { rubocop-rails-omakase: rubocop.yml }

Metrics/ClassLength:
  Enabled: true
  Max: 100
  Exclude:
    - 'spec/**/*'
    - 'db/**/*'

Metrics/MethodLength:
  Enabled: true
  Max: 15

Metrics/AbcSize:
  Enabled: true
  Max: 15

Metrics/CyclomaticComplexity:
  Enabled: true
  Max: 6

Metrics/PerceivedComplexity:
  Enabled: true
  Max: 7

The implementation — Flog via its Ruby API:

# lib/quality/flog_parser.rb
require "flog"

module Quality
  class FlogParser
    def initialize(paths)
      @paths = Array(paths)
    end

    def parse
      flog = Flog.new
      flog.flog(*@paths)  # pass ALL paths; Flog#flog resets state per call

      totals = flog.totals
      return { method_max: 0.0, class_max: 0.0 } if totals.empty?

      {
        method_max: totals.values.max,
        class_max: max_class_score(totals)
      }
    end

    private

    def max_class_score(totals)
      totals
        .group_by { |method_name, _| method_name.split(/[#.]/, 2).first }
        .values
        .map { |entries| entries.sum { |_, score| score } }
        .max
    end
  end
end

One gotcha worth calling out: I initially wrote the parser to loop @paths.each { |p| flog.flog(p) } — which silently destroys data, because Flog#flog resets internal state on every call. Always pass paths as a splat.

Thresholds: Method Flog ≤ 20, class Flog ≤ 70. I started with Uncle Bob’s aggressive numbers (15 and 40) and found they punish normal Rails CRUD controllers — a 5-action admin/topics_controller naturally sums to ~60 Flog across all its methods. Calibrating to reality was a deliberate choice, and worth being honest about: the metrics are guidelines, not gospel. If the threshold forces you to refactor working Rails idioms just to pass, the threshold is wrong.

A quick note worth adding here: in the pre-AI era, complexity cops were genuinely annoying to live with. Every time Metrics/MethodLength yelled at you for a 16-line method, you had to stop what you were doing, figure out the cleanest split, run the tests, fix whatever broke during the split, re-run rubocop, lather, rinse repeat. The metrics were useful; the maintenance was a tax. Most teams disabled these cops for a reason.

That equation flips once a coding agent is doing the refactoring. You re-enable Metrics/MethodLength, the gate fails, you paste the message to Claude, it extracts a private method, updates the tests, re-runs the gate — thirty seconds, done. Metrics that were always good ideas but nobody-had-time-for are finally cheap to enforce. Uncle Bob’s argument reads differently through this lens: he’s not replacing human effort with numbers, he’s pointing out that the numbers we always knew were valuable are viable because the machine writing the code is also the machine cleaning it up.


Metric 3 — Module Sizes (the single-responsibility nudge)

The concept: how many lines in a class or module? A User model that’s grown to 400 lines is trying to tell you something: it’s accumulated responsibilities that don’t belong together.

This is a rough-but-cheap proxy for the Single Responsibility Principle — the S in SOLID, and one of Robert Martin’s own rules. SRP says a class should have "only one reason to change." A class that’s sprawled to 300+ lines almost always has several reasons to change, because it’s doing several unrelated jobs under one roof. Line count can’t prove an SRP violation, but it’s the cheapest indicator we have: long class → probably multiple responsibilities → time to split.

There’s a nice irony here, too: the same person tweeting about measuring AI-generated code gave SRP its name thirty years ago. His "measure instead of read" list is, implicitly, a list of mechanical checks for the principles he’s spent a career trying to teach humans manually.

Why it matters for AI code: AI is happy to keep adding methods to whatever file you asked it to modify. There’s no internal voice saying "this should be a new service object." The module-size gate becomes that voice — and because extracting a service is cheap for the coding agent to do (see the Metric 2 note), you can finally hold the SRP line without paying the usual refactoring tax.

The tool: Rubocop’s Metrics/ClassLength and Metrics/ModuleLength cops (already enabled in the snippet above).

Threshold: 100 lines per class, excluding specs and migrations.

There’s no fancy parser here — Rubocop’s JSON output already includes offenses for these cops. The same RubocopParser that reads complexity violations reads length violations too, pulling the measured value out of the offense message via regex:

# lib/quality/rubocop_parser.rb
require "json"

module Quality
  class RubocopParser
    COPS = {
      class_length_max: "Metrics/ClassLength",
      method_length_max: "Metrics/MethodLength",
      abc_size_max: "Metrics/AbcSize",
      cyclomatic_complexity_max: "Metrics/CyclomaticComplexity",
      perceived_complexity_max: "Metrics/PerceivedComplexity"
    }.freeze

    # Captures the first "<measured>/<max>" pair in the offense message.
    # Handles "[143/100]" and "[<7, 18, 3> 19.55/15]" shapes.
    MEASURE_RE = %r{([\d.]+)/[\d.]+}

    def initialize(path)
      @path = path
    end

    def parse
      offenses = JSON.parse(File.read(@path)).fetch("files").flat_map { |f| f["offenses"] }
      COPS.transform_values { |cop_name| max_for(cop_name, offenses) }
    end

    private

    def max_for(cop_name, offenses)
      values = offenses
        .select { |o| o["cop_name"] == cop_name }
        .map { |o| o["message"][MEASURE_RE, 1]&.to_f }
        .compact

      return nil if values.empty?

      picked = values.max
      picked == picked.to_i ? picked.to_i : picked
    end
  end
end

Metric 4 — Mutation Testing (the truth detector)

This is the one most Rails projects skip, and it’s the one that matters most for AI code.

The concept: test coverage tells you a line ran during tests. It says nothing about whether the tests would notice if that line was wrong. Mutation testing proves the difference.

The algorithm:

  1. Your test suite is green.
  2. A tool takes your source code and applies a tiny automated change — swap > for >=, replace true with false, delete a .save! call, negate a condition. This is called a mutation.
  3. Re-run the tests against the mutated code.
  4. If at least one test now fails, the mutation is killed (your tests caught it).
  5. If all tests still pass, the mutation survives — your tests don’t actually assert that behavior.
  6. Kill ratio = killed mutations / total mutations.

A 100% line coverage with a 30% mutation kill ratio means your tests execute the code but barely check anything. Mutation is the only metric that measures whether tests assert behavior vs. just exercise lines.

Why it matters for AI code — crucially: AI can write tests that look thorough but don’t actually check the important things. An LLM might write:

it "creates a subscription" do
  post :create, params: { topic_id: topic.id }
  expect(response).to redirect_to(topics_path)
end

That test has 100% line coverage of create. Mutation testing will immediately notice: if you delete subscription.save, the test still passes (redirect happens either way). The test exercised the action; it didn’t assert what the action was supposed to do.

The tool: Mutant (mbj/mutant) with its RSpec integration. A heads-up on licensing: Mutant runs a dual-license model — free for open-source projects, paid commercial license ($30/month or $250/year per developer) for private codebases. For this research project I’m running under the open-source terms. The main community fork, mutest, exists but hasn’t seen a release since 2019, so it’s not a practical alternative on modern Ruby.

Worth noting: the maintainers of Mutant see the problem we’re discussing the same way we do. The literal tagline on their GitHub homepage reads:

"AI writes your code. AI writes your tests. But who tests the tests?"

— Mutant’s project description on GitHub

The implementation — Mutant needs Rails eager-loaded before it can discover subjects:

# script/mutant_bootstrap.rb
require_relative "../config/environment"
Rails.application.eager_load!

The rake task that runs Mutant, captures its output, parses the kill ratio, and applies the ratchet:

desc "Run Mutant against app/ and lib/quality; ratchet threshold on first run"
task mutation: :environment do
  FileUtils.mkdir_p(QUALITY_DIR)
  txt_path = QUALITY_DIR.join("mutation.txt")

  cmd = [
    "bundle", "exec", "mutant", "run",
    "--integration", "rspec",
    "--require", "./script/mutant_bootstrap.rb",
    "--usage", "opensource",
    "--", *MUTANT_SUBJECTS
  ]
  sh "#{cmd.shelljoin} > #{txt_path.to_s.shellescape} 2>&1 || true"

  parsed = Quality::MutantParser.new(txt_path).parse
  File.write(QUALITY_DIR.join("mutation.json"), JSON.pretty_generate(parsed))

  ratchet_if_unset!(parsed[:kill_ratio])
end

And a parser that extracts the kill ratio from Mutant’s text report:

# lib/quality/mutant_parser.rb
module Quality
  class MutantParser
    class ParseError < StandardError; end

    PATTERNS = {
      mutations: /^Mutations:\s+(\d+)$/,
      kills:     /^Kills:\s+(\d+)$/,
      coverage:  /^Coverage:\s+([\d.]+)%$/
    }.freeze

    def initialize(path)
      @path = path
    end

    def parse
      text = File.read(@path)
      extracted = PATTERNS.transform_values { |re| text.match(re)&.[](1) }

      raise ParseError, "Could not find Coverage line in #{@path}" if extracted[:coverage].nil?

      {
        mutations: extracted[:mutations].to_i,
        kills: extracted[:kills].to_i,
        kill_ratio: extracted[:coverage].to_f
      }
    end
  end
end

The threshold philosophy — this one is special. Getting a codebase from a 50% mutation kill ratio to 90% is a genuine investment. Forcing that upfront would kill the project. So this gate uses a ratchet: on the very first run, whatever ratio comes out becomes the floor. Every future run has to match or beat that number. The gate’s rule is simple: don’t get worse.

Here’s the ratchet logic:

def ratchet_if_unset!(kill_ratio)
  path = Rails.root.join("config/quality_thresholds.yml")
  thresholds = YAML.load_file(path)

  return unless thresholds.dig("mutation", "kill_ratio_min").nil?

  thresholds["mutation"]["kill_ratio_min"] = kill_ratio
  File.write(path, thresholds.to_yaml)
  puts "[quality:mutation] Ratchet set: mutation.kill_ratio_min = #{kill_ratio}"
end

Our initial run produced 69.46% across 1,775 mutations. That’s the baseline.


Metric 5 — Dependency Structure (the one we didn’t implement)

The concept: Uncle Bob’s fifth metric asks: does your architecture have cycles? Do high-level modules depend on low-level ones? Are package boundaries being violated? In principled architecture (hexagonal, clean, etc.) the answer should be no.

The Ruby tool: Packwerk (by Shopify) — define your codebase as packages with public APIs, fail the build when one package reaches into another’s internals.

Why we skipped it: this is a 5-controller Rails monolith. Carving it into three packages with public boundaries would be architecture-theatre — more pack.yaml maintenance than actual feature work. And idiomatic Rails already handles most of what "dependency structure" means in practice: the MVC skeleton imposes a rough layering, ActiveSupport autoloading doesn’t let you create surprising cycles, and a small app has nowhere to hide violations.

The honest blog-post answer: this metric doesn’t map cleanly onto a small Rails monolith. I’m not pretending I measured it when I didn’t. If the app grew to 40 controllers, I’d revisit.

This is worth saying out loud because the temptation with metric-driven development is to invent a number for every category just to check the box. Don’t. A skipped metric, documented honestly, is more valuable than a metric you had to torture your architecture to satisfy.


Putting it all together

Everything above collapses into one rake task. The 4 data-producing sub-tasks each write a JSON blob under tmp/quality/. The aggregator reads them, compares against config/quality_thresholds.yml, prints the table, and exits 0 or 1.

# config/quality_thresholds.yml
coverage:
  line_min: 95.0
  branch_min: 90.0
flog:
  method_max: 20
  class_max: 70
rubocop_metrics:
  class_length_max: 100
  method_length_max: 15
  abc_size_max: 15
  cyclomatic_complexity_max: 6
  perceived_complexity_max: 7
mutation:
  kill_ratio_min: 69.46   # ratcheted from first run

The aggregator is ~50 lines and testable in isolation with synthesized fixture data:

# lib/quality/report.rb (excerpted)
module Quality
  class Report
    GATES = [
      { name: "Line coverage",      measure: [:coverage, :line],      threshold: ["coverage", "line_min"],      cmp: :>=, unit: "%" },
      { name: "Branch coverage",    measure: [:coverage, :branch],    threshold: ["coverage", "branch_min"],    cmp: :>=, unit: "%" },
      { name: "Flog max (method)",  measure: [:flog, :method_max],    threshold: ["flog", "method_max"],        cmp: :<=, unit: "" },
      { name: "Flog max (class)",   measure: [:flog, :class_max],     threshold: ["flog", "class_max"],         cmp: :<=, unit: "" },
      { name: "Mutation kill ratio", measure: [:mutation, :kill_ratio], threshold: ["mutation", "kill_ratio_min"], cmp: :>=, unit: "%" }
    ].freeze

    def initialize(measurements:, thresholds:)
      @measurements = measurements
      @thresholds = thresholds
      @gate_results = build_gate_results
    end

    def passed?
      @gate_results.all?(&:passed?)
    end
  end
end

And the finale — one line added to the project’s CLAUDE.md so the AI itself runs the gate before declaring work done:

Run the quality gate before committing. Run bin/rake quality before declaring a task complete. Do not commit if any gate fails. Report the gate numbers in your response so regressions are visible.


What this buys you

Once this is in place, a few things change about how I work with AI.

I can trust AI more easily and spend my attention where it actually matters. The gate handles the mechanical checks — "Did you test this?", "Did this method just double in complexity?", "Do the tests assert anything real?" — so I don’t have to. My review shifts from line-by-line nitpicking to the things only a human should be deciding: is this the right abstraction? does this match the product intent? are we solving the actual problem? The boring parts are automated; the interesting parts get my full focus.

Regressions become visible instantly. "Line coverage dropped 2 points" is a concrete thing to push back on. "I don’t love this code" is not.

The AI’s output gets better over time. Once Claude knows bin/rake quality is the last step, it starts pre-empting the gate — writing the tests alongside the code, keeping methods short, avoiding monolithic controllers. Not because it became smarter, but because the feedback loop got tighter.

I have real data to argue with. The Uncle Bob tweet is provocative. The numbers are persuasive. When a teammate pushes back on AI code, I can show them a gate table. When AI produces something objectively worse than the baseline, I can show it the same table.


The limits

A few things this doesn’t catch, and it’s worth being clear-eyed about them:

  • Security. Complexity gates don’t catch SQL injection or other classic vulnerabilities.
  • Performance. The gate has no opinion on N+1 queries, a rogue includes that loads half the database into memory, an unbounded loop, or a 4-second request. A method can be short, well-tested, mutation-proof, and still bring the app to its knees under load.
  • Race conditions & concurrency bugs. Mutation testing can tell you whether your tests assert the happy path; it can’t tell you whether two concurrent requests stepping on the same row will corrupt your data. Tests for concurrency are hard to write, and neither the complexity metrics nor the mutation metric will surface the absence of them.
  • Memory leaks. A class-level cache that grows without bounds, a global subscriber list that never unsubscribes, memoized state on a long-lived object — all of these pass every gate on day one, then take the process down after a week in production. Tests run in fresh processes and don’t see accumulation.
  • Idiomatic fit. A method can clear every gate and still be "not how we do it here." That’s what code review is for — but scoped to architecture and taste, not line-by-line correctness.
  • Intent. The gate can’t tell if you built the right feature. It can only tell you the code you built is fit for purpose.

Each of these deserves its own treatment — which tools help, how to bolt them onto the same kind of gate, when they’re worth the overhead — and this post is already long enough. I’ll cover them in a follow-up.

Uncle Bob’s argument isn’t "stop thinking about code." It’s: "stop spending human attention on things metrics can verify, so you can spend it on the things that actually need a human brain."

That, I think, is the move.


Resources


The full implementation in this post is real code from a work-in-progress Rails 8 app I’ve been building with Claude. If you want to see the rest — the rake task wiring, the parsers, the tests — it’s all in the repo.

We want to work with you. Check out our Services page!