April 8, 20269 min readLLMArchitectureCost

Choosing the right LLM for enterprise workflows

GPT-class isn't always the answer. We walk through the decision matrix we actually use — closed vs. open-weight, hosted vs. private, and how to think about cost, latency, and risk together.

“Should we use GPT?” is the wrong first question. The right first question is “what does this workflow actually need?” — and the answer almost never points at one single model.

Here's the decision matrix we walk every enterprise team through before recommending a model.

Five questions before model choice

Data sensitivity. Can this data leave your perimeter at all? If the answer is no, hosted closed-weight is off the table — full stop.
Latency budget. Are you in a sub-300ms transactional path, or a 30-second async pipeline? The answer changes the model class entirely.
Volume. 10k calls/day or 10M? Token cost compounds fast — small models often win at scale.
Domain specificity. Generic reasoning or deeply domain-coded text? Fine-tuned smaller models often beat frontier models on narrow tasks at fraction of cost.
Failure tolerance. A wrong answer here costs $5 in a refund or $5M in a regulator inquiry?

The four buckets we typically land in

1. Frontier hosted (GPT/Claude/Gemini)

Best for: open-ended reasoning, agentic flows, anything where capability matters more than cost. Watch for: data-residency, vendor lock-in, surprise pricing model changes.

2. Hosted small models (Haiku, Mini-class)

Best for: high-volume classification, extraction, lightweight summarization. Often 10-30× cheaper with negligible quality drop on narrow tasks. Underused.

3. Open-weight self-hosted (Llama, Qwen, Mistral)

Best for: data that can't leave the perimeter, or workloads big enough that the GPU bill beats per-token pricing. Watch for: hidden ops cost — you're now running an inference platform.

4. Fine-tuned domain models

Best for: high-volume, narrow, repetitive tasks where a small fine-tune lifts a small model above a frontier baseline. Almost always worth piloting if the task is repetitive enough.

The trap to avoid

Picking a model first and the workflow second. The workflow constraints decide the model — not the other way around.

Most enterprise AI bills are larger than they need to be because someone picked the most capable model for tasks that didn't need it. Match the model to the workflow, not the marketing.

Working on something like this?

We help enterprise teams ship AI features that survive the round-trip from POC to production.

Talk to our team

More insights

How to scope an AI proof-of-concept that actually ships

Most enterprise AI POCs fail not on the model — they fail at the boundaries. Here's the scoping discipline that separates pilots that survive from pilots that die in the lab.

The hidden costs of AI adoption (and how to budget for them)

Inference is rarely the biggest line item. Eval infra, human review, vector storage, and observability are where AI projects quietly burn budget — here's how to plan for it.