Why Gemma-4-31B is a gem for PLC logic — and why I couldn't ship it
Relay, the AI assistant inside Rungs Studio, runs on Google's Gemini-3-Flash. It's currently free, I want to keep it free, and every message sent costs me about a cent. That sounds trivial until you multiply it by a user base that keeps growing — so I went looking for a model that was just as good at PLC code but cheaper to run.
Relay is a tutor — it won't write a student's program for them, it nudges one step at a time. So why benchmark raw code generation at all? Because of a failure mode I kept watching. When a student is stuck on an exercise, Relay sometimes reasons its way to a wrong first answer and only corrects itself a few turns later, once the compiler and the unit tests push back. For a patient student that detour can build real understanding. More often it misleads and wears them out. The fix is a model good enough to one-shot the solution to every exercise on learn.rungs.dev — because one-shotting is the closest proxy I have for actually understands the problem, and a model that understands the problem is far less likely to walk a student into a wall. So I benchmarked models on three things: tutoring conversations, generating Structured Text, and generating Ladder Logic.
What I found was a model that writes Ladder Logic as well as Gemini for roughly a tenth of the price. And then I couldn't use it. This is the story of both halves.
How I graded them: compile and run, no opinions
The whole benchmark rests on one decision: for generated code, a model's answer only counts if it survives execution. There's a better oracle than opinion here — the compiler — so the code numbers carry no LLM judgment at all.
Each model gets an exercise spec, its tags, and the exact test vectors the routine has to pass, and is asked for JSON — the local tags it needs and the LD or ST logic. I take that, assemble a real Add-On Instruction (an AOI — a reusable, self-contained block of logic, the Rockwell® equivalent of a function), compile it with the same compiler that grades student work in Studio, and run the same canonical tests. Pass means it compiled and every test passed. Nothing else counts.
The one place I can't compile is the tutoring set: a conversation has no unit tests. Those 72 conversations are graded by rubric with an LLM judge running at temperature 0. That's a softer signal than execution, so I treat the tutoring numbers as the soft ones throughout, and lean on the machine-graded code tasks for anything load-bearing.
The test sets: 40 Ladder Logic exercises and 40 Structured Text exercises — combinational logic, edge detection, timers, counters, comparisons and math, and timed state sequences like a traffic light — plus the 72 tutoring conversations. Everything runs on promptfoo with a custom provider, repeated where it counts. Only the rubric judge is pinned to temperature 0; the models under test answer at their own defaults, thinking budgets and all — I'm grading them the way a student would actually hit them. Ladder Logic got the deepest treatment, because it's ~75% of what our users actually write.
Where I started: the cheap end of the field
I didn't begin with Gemma and Gemini — I began with a budget. Relay has to stay cheap, so I set a soft ceiling of about $5 per million output tokens and screened roughly a dozen models that came in under it: Gemma, the cheaper Geminis, GLM, Mistral's Devstral, MiniMax, Xiaomi's MiMo, Moonshot's Kimi, a couple of DeepSeeks, Claude-Haiku, GPT-Mini. (Gemini-3.5-Flash went in too, above the ceiling, as the quality benchmark to measure everyone against.) Grading this many models honestly isn't free, by the way — I burned close to $500 in tokens running these evals. The irony of spending $500 to shave fractions of a cent off a per-message bill is not lost on me. I've learned a lot in the process.
Three tasks, the same dozen models, June 2026 — and I kept every column the harness records, because cost, latency, and throughput end up mattering as much as raw accuracy.
Ladder Logic
| Model | Pass | Total cost | Tokens | Avg latency | Tok/s |
|---|---|---|---|---|---|
| Gemini-3-Flash | 90% | $0.49 | 561k | 12.9s | 166 |
| MiMo-v2.5-Pro | 90% | $0.29 | 562k | 40.6s | 65 |
| Gemini-3.5-Flash | 90% | $1.35 | 547k | 18.6s | 96 |
| GLM-5.2 | 83% | $1.37 | 606k | 48.4s | 79 |
| Gemma-4-31B | 83% | $0.10 | 561k | 81.8s | 25 |
| DeepSeek-V4-Pro | 78% | $0.29 | 570k | 29.7s | 86 |
| DeepSeek-V4-Flash | 68% | $0.09 | 558k | 33.3s | 68 |
| Kimi-K2.7 | 38% | $0.44 | 453k | 2.7s | 40 |
| GPT-5.4-Mini | 33% | $0.36 | 454k | 1.4s | 77 |
| MiniMax-M3 | 30% | $0.14 | 463k | 4.2s | 27 |
| Claude-Haiku-4.5 | 25% | $0.56 | 484k | 2.6s | 65 |
| Devstral-2 | 23% | $0.20 | 472k | 3.5s | 38 |
Structured Text
| Model | Pass | Total cost | Tokens | Avg latency | Tok/s |
|---|---|---|---|---|---|
| Gemini-3.5-Flash | 95% | $1.21 | 531k | 14.6s | 95 |
| GLM-5.2 | 93% | $0.95 | 513k | 20.1s | 73 |
| Gemini-3-Flash | 93% | $0.40 | 528k | 8.3s | 158 |
| DeepSeek-V4-Pro | 93% | $0.24 | 511k | 12.9s | 86 |
| Gemma-4-31B | 90% | $0.08 | 510k | 49.0s | 22 |
| MiMo-v2.5-Pro | 85% | $0.23 | 496k | 16.4s | 59 |
| DeepSeek-V4-Flash | 83% | $0.08 | 523k | 22.0s | 64 |
| Kimi-K2.7 | 80% | $0.44 | 452k | 1.7s | 57 |
| GPT-5.4-Mini | 80% | $0.36 | 454k | 1.2s | 85 |
| MiniMax-M3 | 73% | $0.14 | 464k | 4.1s | 31 |
| Claude-Haiku-4.5 | 70% | $0.55 | 523k | 2.0s | 87 |
| Devstral-2 | 65% | $0.20 | 472k | 4.9s | 28 |
Tutoring
| Model | Pass | Total cost | Tokens | Avg latency | Tok/s |
|---|---|---|---|---|---|
| Gemini-3.5-Flash | 99% | $2.20 | 1016k | 9.2s | 135 |
| Gemma-4-31B | 99% | $0.15 | 989k | 40.4s | 20 |
| Gemini-3-Flash | 97% | $0.65 | 987k | 6.6s | 128 |
| MiMo-v2.5-Pro | 94% | $0.44 | 953k | 15.7s | 50 |
| DeepSeek-V4-Flash | 94% | $0.14 | 967k | 16.0s | 46 |
| DeepSeek-V4-Pro | 94% | $0.45 | 970k | 11.8s | 66 |
| GLM-5.2 | 93% | $1.70 | 972k | 17.8s | 63 |
| Kimi-K2.7 | 92% | $1.21 | 973k | 16.5s | 79 |
| GPT-5.4-Mini | 92% | $0.69 | 888k | 1.3s | 81 |
| Claude-Haiku-4.5 | 90% | $1.08 | 1021k | 3.8s | 53 |
| MiniMax-M3 | 89% | $0.28 | 909k | 5.6s | 32 |
| Devstral-2 | 82% | $0.39 | 925k | 5.9s | 33 |
Read down the Pass columns and the core finding is right there: Ladder Logic rips the field open (23% → 90%), Structured Text compresses it (65–95%), tutoring is a wash (82–99%). Ladder is the only task that tells these models apart. The ones that bomb LD recover on ST — Claude-Haiku 25%→70%, Devstral 23%→65%, GPT-Mini 33%→80% — because ST is plain text while LD is a strict, vendor-specific DSL, and the one I test is Rockwell's. Most models have barely seen it.
The Latency column hides a second story. The models that win on ladder are the ones that think — Gemma 82s, GLM 48s, MiMo 41s a generation — while the brand-names answer in one to four seconds and get it wrong. They're not faster; they're failing fast.
And Google anchors the top of all three. The only non-Google model that keeps pace on ladder is the surprise of the screen: MiMo-v2.5-Pro (Xiaomi) — 90% on ladder, $0.29 a run. If you want a cheap non-Google PLC generator, that's the one I'd point you at.
One row in that table looks more tempting than it is. DeepSeek-V4-Flash is the cheapest thing in the whole field ($0.09 a run), and V4-Pro is genuinely decent at 78% on ladder. But cheap isn't the bar — correct on ladder is, and on the one task that actually separates these models both DeepSeeks land below the top tier. V4-Pro's 78% and V4-Flash's 68% are a clear step down from the 90% frontier, so they drop out on performance, not price. DeepSeek stays a benchmark reference, not a candidate.
But the screen pointed somewhere specific: the two Google families — Gemini and Gemma — were the ones worth taking apart in detail. So I did.
The gem: Gemini-class ladder for a tenth of the price
Here's the headline. Gemini-3-Flash against Gemma-4-31B, head to head on the full 40-exercise Ladder Logic set, single pass:
| Model | Pass rate (clean prompt) | Pass rate (with examples) |
|---|---|---|
| Gemini-3-Flash | 39/40 (98%) | 37/40 (93%) |
| Gemma-4-31B | 35/40 (88%) | 38/40 (95%) |
| Gemini-3.1-Flash-Lite | 28/40 (70%) | 32/40 (80%) |
| Gemma-4-26B-A4B | 23/40 (58%) | 27/40 (68%) |
These are head-to-head on a lean generation prompt; the broad tables above are on the heavier production prompt, where the same models sit a few points lower (Gemini-3-Flash 90% there vs 98% here). The ranking holds — the numbers just get noisier.
Read it slowly. On a clean prompt Gemini-3-Flash is a touch ahead — but Gemma-4-31B, Google's open-weight model, lands in the same tier at a fraction of the cost, and once the prompt carries a couple of worked examples Gemma actually takes the lead on ladder. Structured Text is a quieter story: there Gemma runs around 90%, a point or two behind the Geminis rather than ahead. ST is the task where every decent model clusters, so it can't separate them the way ladder does — Gemma's real value shows up on the graphical language, not the textual one.
For a tool that has to be both cheap and correct, "Gemini-class ladder accuracy at a tenth of the price, and it tops the chart with examples" is the whole ballgame. I thought I was done.
Why the cheap models fail: it's syntax, not reasoning
Before I get to why I'm not using Gemma, the single most useful thing I learned — and the thing I'd tell anyone generating ladder. I bucketed every failure from the cheapest Gemini tier (Flash-Lite) against Gemma on the hardest exercises:
| Failure mode | Flash-Lite | Gemma-4-31B |
|---|---|---|
| Doesn't even compile | 59% | 11% |
| Compiles, wrong logic | 6% | 19% |
| Pass | 35% | 71% |
Six out of ten Flash-Lite answers don't compile at all. It isn't worse at the PLC problem — it's worse at writing valid ladder. It wraps comparisons in contacts that don't exist, invents operators like == and < that Ladder Logic simply doesn't have, breaks rung structure. Gemma gets the grammar right 89% of the time; its misses are valid ladder with the wrong idea — meaning it's actually competing on the problem instead of tripping on the syntax.
Why? Ladder is a niche domain specific language (DSL) with a stack of hard constraints that all have to hold at once — function-call form, comparisons-as-contacts, every rung ending in an output, no infix operators, specific timer and counter member names. Holding all of them at once is the hard part, and the cheap, fast tier is exactly where it slips.
Thinking is non-negotiable on ladder — and I was wrong about the level
Gemini-3-Flash lets you dial a thinking budget — the reasoning tokens it burns before answering. My first read of it was wrong, and it's worth saying so. Early single runs made the default (heavy) budget look like it was over-thinking into worse answers, with medium strictly better. Running each level five times across all 40 ladder exercises — 200 attempts a level — told a calmer story:
| Thinking level | Ladder pass | Avg output tokens |
|---|---|---|
| Off (low) | 104/200 (52%) | ~100 |
| Medium | 174/200 (87%) | ~3,700 |
| High | 175/200 (88%) | ~7,100 |
Read across it. With thinking effectively off the model emits ~100 tokens — barely more than the answer itself — and collapses to 52%; it genuinely needs to deliberate to write valid LD. Medium and high are a dead heat at 87–88%, but high burns roughly double the tokens to buy that single point.
So the conclusion survived — ship medium — but for the cost reason, not because high hurts. "More thinking made it worse" was a small-sample mirage. Thinking level does, by the way, nothing for tutoring: conversational quality is a wash across every level. It's purely a code-generation lever.
That last point had a direct payoff. When I checked the live Relay handler, it had no thinking level pinned at all — it was running the heavy default budget on a task (tutoring) where thinking buys nothing. Pure overspend. I pinned it to medium and trimmed the bill with zero quality loss. The benchmark paid for itself before I'd even chosen a model.
Worked examples: a crutch for weak models, a trap for strong ones
I tried dropping real, passing solutions into the prompt and re-running the hardest exercises five times each. The result split the field cleanly:
| Model | No examples | With examples | Δ |
|---|---|---|---|
| Gemini-3-Flash | 73/85 (86%) | 68/85 (80%) | −5 |
| Gemma-4-31B | 60/85 (71%) | 75/85 (88%) | +15 |
| Gemini-3.1-Flash-Lite | 30/85 (35%) | 50/85 (59%) | +20 |
Three things, all stable across the runs. First, examples don't teach the exercise they're examples of — the traffic-light solution sat in the prompt verbatim and every model still scored zero on traffic-light. Complex multi-rung patterns don't get copied. Second, the gains are pure transfer to other exercises — the weak models learned general idioms (preset-via-move, accumulator comparisons, one-shot edge detection) and applied them broadly. Third, and the part that surprised me: examples reliably hurt the strong model. Gemini already knows the idioms, so transfer gains it nothing, and a worked example anchors it onto a complex pattern it then reproduces with subtle errors — worse than its own instinct.
The lesson for prompt design: examples are a lever for the cheap models, not the strong one. They patch exactly the syntax-shaped weakness that holds small models back, and get in the way of a model that doesn't have it.
The trap: the same model isn't the same on every host
Gemma's numbers, remember, depend on thinking being on. Here's what I learned the expensive way: a model name doesn't tell you what a host actually serves you.
On Google, Gemma emits ~3,000 reasoning tokens an answer and scores 88% on ladder. On every gateway and the third-party providers behind them, the same model runs with thinking off — zero reasoning tokens — and drops to 73%. The fifteen-point gap isn't reliability. It's that only Google's serving turns Gemma's thinking on by default. What a host serves — thinking on or off, how heavily, the quantization — matters as much as the model you asked for.
I chased the thinking-on configuration across four hosts:
| Host | Thinking | The catch |
|---|---|---|
| Google AI Studio | on (~3,000 tok, 88%) | hard quota cap on Gemma — survives even on a paid key (the billing lifts Gemini, not Gemma) |
| OpenRouter, reasoning on | on (~900 tok, 83%) | ran out of credits mid-run; depends on a babysat balance |
| Vercel AI Gateway | off (73%) | reliable and cheap, but no switch turns Gemma's thinking back on |
| SambaNova | on | tier rate-limited hard enough to be unusable |
Every box that has the thinking has a fatal operational flaw. Google has it but caps it. OpenRouter has it but needs hand-fed credits. Vercel is rock-solid but serves it lobotomized. SambaNova throttles. None of it cleared the bar.
And the tables already showed the other half of it: Gemma is the slowest model in the field — 82 seconds a ladder answer with thinking on, against Gemini-3-Flash's 13. Cheap tokens don't help a student who's already closed the tab.
So I dropped Gemma. Not on quality — on operability. A model you can't access reliably in the configuration that makes it good is not a model you can ship. That's the blunt lesson, and it's the one I most want other people benchmarking these models to internalize: the leaderboard number is the easy part; the question is whether you can get that number, every request, at 2 a.m., without babysitting a credit balance. The production finalists became the two Gemini models — a little pricier, gloriously boring, reliable on a paid key. In production, boring is the entire point.
Then I tried to make the prompt better — and Structured Text gave the secret away
With Gemma gone, I spent a while trying to push the two Gemini models higher on ladder by improving the prompt. This is where I nearly fooled myself, and where the most interesting finding fell out.
My first prompt tweaks looked like they helped some exercises and hurt others — until I noticed two completely different edits producing the identical per-exercise flips. That's not the prompt. That's variance: run the same model five times and individual exercises swing on their own. At small sample sizes you're reading tea leaves. So I rebuilt the harness to run only the handful of genuinely hard exercises many times over and compare the aggregate against a real confidence interval — a change only counts if it clears the noise.
With honest measurement, a clear pattern emerged. Teaching the model timer semantics — what the preset and accumulator and done-bit actually do — was a real, durable win. Keeping a few idiom examples in the prompt was load-bearing; removing them cost a brutal 17 points. But every attempt to tell the model how to order its rungs backfired the same way: it would fix the simple timing cases and break the trickier seal-in patterns by exactly as much. The model over-applies any ordering rule you hand it. I reverted all of them.
And then the tell. Take traffic-light, the hardest timed sequence in the set. Gemini writes it correctly in Structured Text every single time, across every run — a clean 100%. In ladder it scores 0% — not once passing, in any run I did. Same model, same behavior, opposite outcome, and it's the most stable result in the whole benchmark.
It isn't only traffic-light. The timer exercises lean the same way, just noisier: pulse and on-delay are a clean 100% in Structured Text but swing hard in ladder — I've watched pulse land anywhere from 13% to 100% depending on the run. Structured Text pins them; ladder can't.
The model knows the behavior; it just can't reliably serialize it into ladder's rung-order form. That's not a knowledge gap — it's a representational one. Which is exactly why no amount of prompt facts closes it: you can't teach a model something it already knows. The prompt converged around 91% on ladder, and the rest of the gap isn't a missing instruction — it's a limit in how well today's models map timing logic onto a graphical language. The lever with real headroom isn't a better prompt at all; it's letting the model generate, run the tests, and fix its own mistakes. That's the next experiment.
What I'm shipping, and what's next
So: Relay stays on Gemini-3-Flash at medium thinking — cheaper than it was last week, reliable, and, on the only metric that's hard to fake, correct. That means living with ~90% on ladder for now, and I'm treating that as a floor, not a finish line. Gemma was the better deal on paper and I'd switch in a heartbeat the day someone serves it thinking-on without a quota wall or a credit meter.
And this isn't a one-time verdict. Harder exercises are landing on learn.rungs.dev, so I'll re-run the whole benchmark on a bigger, meaner exercise set against whatever models exist by then. The point of all this is to keep Relay on the leanest model that's still right for the people learning on it — and that target keeps moving.
If you're running these models yourself, the short version is: grade them by compiling the output, not by vibe; rank them on ladder, because nothing else separates them; prefer the models that take time to think over the fast ones for the syntax; and check what your host is actually serving before you trust a leaderboard.
I'm keeping Relay's core free (for now), and these benchmarks are part of how I make that sustainable — finding the leanest model that's still right. If you're generating PLC code with LLMs, teaching with them, or just experimenting, I'd genuinely like to hear what you're seeing; come argue with my numbers in the community discussion. One rung at a time.
