The On-Premise LLM Lottery

Hero image for The On-Premise LLM Lottery

I lost a full long weekend to OpenClaw once. Friday to Sunday, trying to get it running locally with an on-premise LLM on my personal machine. The machine was reasonably capable, though nothing close to a datacenter. I had watched something like forty hours of tutorials first, so I went in confident. Every video made it look like four steps: download, install, connect a model, start automating.

By Sunday evening nothing worked. The wall I hit had nothing to do with my configuration. It was a real limitation in the tool itself, on the hardware I had, with no path around it from the user side. The clean take in the tutorial had been produced in a controlled setup, and the hours of debugging that preceded it never made it into the video.

That weekend taught me something I keep coming back to when someone tells me they want to run their models on-premise. I hadn't picked the wrong app. I'd treated "run an LLM locally" as a task with a fixed answer, when it is actually a search.

There is a whole genre of content that presents local inference as a tidy checklist. Download the app, check your RAM, pick a model size that fits, done. And for the audience those guides are written for — one person, one laptop, one chat window — that genuinely is the whole job. They are answering a smaller question than the one an enterprise is asking, and they answer it well. The problem starts when someone in a regulated company reads that checklist and assumes their version is the same problem with more zeros in the hardware budget.

It isn't. In the enterprise version, your hardware is already fixed by procurement decisions made before you arrived. Your runtime is constrained by what your ops team will agree to support. Regulatory and compliance requirements narrow which models you can even consider before you benchmark a single one. The checklist collapses under those constraints, and what's left underneath is a matchmaking problem with several moving axes.

The axes nobody puts in the checklist

After that weekend I stopped looking for the right app and started running an actual search. I tested roughly twenty models before one fit. That number wasn't diligence for its own sake. There are many possible combinations of model, quantization, and runtime, and most of them simply don't work on any given hardware, so finding one that does takes that many attempts.

Those combinations vary along a few axes, and the axes interact with each other. Model size is the obvious one, but the parameter count on the label tells you almost nothing until you pair it with quantization. A model that won't load at full precision runs comfortably at a lower quantization, with a quality cost you have to measure rather than assume. Then the runtime layer sits on top of that: I settled on vLLM and Ollama depending on the case, because the runtime is what turns "the weights exist on disk" into "the thing answers a request at an acceptable latency." And all of that has to land on hardware you don't get to choose.

The published guides almost always cover exactly one point in this space and present it as universal. A tutorial that shows a specific model at a specific quantization on a specific GPU is accurate. It is also useless the moment your GPU is different, which it always is. The tutorial isn't lying, it just isn't your situation, and the gap between a working demo and a working installation in your environment is the entire job.

Gemma turned out to be the fit for my hardware, and the reason had little to do with rankings: at the quantization and runtime I could actually deploy, it gave me acceptable output at a latency I could live with. Someone else, on different hardware, with a different tolerance for latency, lands somewhere else entirely. There is only the best model for a specific combination of circumstances, and you find it by running the combinations.

Why teams quit before they finish

The reason this matters commercially is that most teams give up partway through the search and don't realize that's what happened. They test three or four models, hit the same kind of wall I hit that weekend, and conclude either that local inference doesn't work or that they need to spend more on hardware. Sometimes more hardware is the answer. Often it isn't, and they've just stopped short of the combination that would have worked on what they already own.

The search costs real engineering time, and that's the part leadership doesn't budget for. Benchmarking a model isn't downloading it and asking it a question. It's loading it at several quantization levels, measuring throughput and latency under something resembling real load, checking output quality against your actual tasks, and doing that across enough candidates to be confident you've found a fit rather than the first thing that didn't crash. On a fixed hardware target, that is days to weeks of an engineer's time before you write a line of application code.

There is also a quieter failure mode. A model that loads and answers in a demo can fall over under concurrent requests, or produce output that's fine for a chat toy and unacceptable for the compliance-sensitive task you actually need. The runtime tuning I did — adjusting for responsiveness so the thing felt usable rather than technically functional — was its own cycle of test, measure, adjust. That work never shows up in the four-step version because the four-step version was never asked to serve more than one person..

The case for doing the search anyway

None of this is an argument against on-premise inference. For a healthcare or finance client who can't send patient records or transaction data to a third-party API, on-prem is the constraint, not a preference. When that's your situation, the matchmaking phase stops being optional overhead and becomes the foundation the rest of the project stands on.

The teams that finish the search end up with something durable. They know, concretely, which model runs on their hardware, at which quantization, under which runtime, at what latency, for which tasks. That knowledge is specific to their environment and expensive to reproduce, which is exactly what makes it a moat. A competitor who wants the same capability has to run the same search on their own hardware. There's no shortcut around it, which is the whole point.

A team that hasn't run the search and claims they can run AI on-prem is describing a future promise, not a current capability.

The distinction sounds pedantic until the deadline arrives and someone has to explain to leadership why "just run it locally" turned into a month of benchmarking. I've watched that conversation happen. It goes better when the month was budgeted up front as part of the strategy rather than discovered halfway through as a surprise.

So before you commit to on-prem as a strategy, run a scoped version of the search first. Pick your three or four most plausible model-and-quantization combinations for the hardware you already have, benchmark them against a real task, and measure how far you got and how far you have left. If your team can't tell you which combination fits their hardware today, that's the gap to close before anyone promises a timeline. In our workflow audits, that scoping exercise is usually the first thing we do, because it's the difference between a plan and a wish.