All writingArchitecture6 May 20269 min read

Self-hosted LLMs vs the OpenAI API: when each makes sense for B2B

A practical decision framework for B2B leaders weighing self-hosted open-weight models against managed LLM APIs — covering compliance, latency, cost at scale, and the operational tax that's rarely on the slide.

If you've sat in a room where a CTO and a CISO disagree about whether to call OpenAI or stand up Llama on your own GPUs, you already know that this isn't really a model question. It's a commitment question — about data, about cost shape, about the kind of organisation you're building. The answer depends on five honest trade-offs, and most of the noise in this debate exists because people argue from one of them while quietly ignoring the other four.

Here's the framework we use with clients deciding between a managed LLM API and a self-hosted open-weight model.

1. Data sensitivity and regulatory exposure

If your inputs to the model contain personally identifiable information that crosses jurisdictional boundaries, protected health information, material non-public financial data, or anything covered by an enforceable customer contract that prohibits third-party processing — self-host. The default OpenAI / Anthropic / Google enterprise contracts are good. They're not better than your contractual obligations to your own customers and regulators.

If your inputs are public, internal-only-but-not-sensitive, or already flowing through dozens of SaaS providers anyway, the data argument for self-hosting collapses. Your data is already in a hundred clouds. Adding one more changes nothing material.

2. Cost shape, not cost level

The cost comparison most people show is wrong because it compares the wrong things. The right comparison is the cost shape — fixed versus variable — at your actual usage volume.

Managed APIs are pure variable cost. You pay per token. At low volume this is dramatically cheaper than self-hosting because you're not paying for idle GPUs.
Self-hosting is mostly fixed cost — GPU rental or capex, then a small marginal cost per inference. At high sustained volume the per-token math eventually crosses over and self-hosting wins.
The crossover point depends on your model, hardware, utilisation, and whether you can keep GPUs busy enough to amortise them. For mid-size open-weight models on rented H100s, the crossover is typically somewhere between 50 and 200 million tokens per month, depending on how good you are at batching.

3. Latency and reliability requirements

Network round-trip to a managed API typically adds 50-200ms over a self-hosted model in the same data centre. For most B2B workflows — drafting emails, summarising tickets, answering knowledge-base questions — this is invisible. For sub-second interactive products or trading-floor use cases, it isn't.

Reliability cuts both ways. Managed APIs have outages you can't fix; you're at the mercy of OpenAI's incident response. Self-hosted systems have outages you must fix — and if your team has never owned production GPU infrastructure before, you'll discover that the operational tax is higher than the cost difference suggests. Most companies underweight this.

4. Capability ceiling

The frontier closed models are still genuinely better than the best open-weight models on the hardest tasks. If your application requires GPT-class reasoning on complex multi-step problems, self-hosting an open-weight model means accepting a meaningful capability gap. For routine knowledge work — classification, summarisation, structured extraction, retrieval-augmented Q&A — the gap is much smaller, and well-tuned open-weight models are competitive.

Be honest with yourself about which task class you're actually building for. Most production B2B AI is closer to the second category than the first.

5. Strategic optionality

Self-hosting buys you optionality you can't easily buy back later: model independence, the ability to fine-tune, freedom from upstream pricing changes, and a moat against the day a competitor builds something on top of the same managed API and prices you out of the market.

Calling a managed API buys you optionality of a different kind: time-to-market, easy upgrades when better models ship, and the freedom to focus your engineering team on the integration problem rather than the infrastructure problem.

A simple decision rule

We tell clients to start with the managed API for almost every greenfield case. Ship the workflow, prove the value, get real usage data. Then revisit the self-hosting question with hard numbers — your actual token volume, your actual sensitivity profile, your actual latency requirements. The decision becomes obvious once you have data.

The exception is when point one above is non-negotiable: regulated data that legally cannot leave your perimeter. In that case, self-host from day one and accept the operational tax as a cost of being in the business you're in.

If you're trying to make this decision for a real workload right now, we run a 1-week architectural assessment that produces a written recommendation tailored to your data, volume, and constraints — no fluffy slides. Get in touch if that would help.

All writingArchitecture6 May 20269 min read

Self-hosted LLMs vs the OpenAI API: when each makes sense for B2B

Here's the framework we use with clients deciding between a managed LLM API and a self-hosted open-weight model.

1. Data sensitivity and regulatory exposure

2. Cost shape, not cost level

The cost comparison most people show is wrong because it compares the wrong things. The right comparison is the cost shape — fixed versus variable — at your actual usage volume.

Managed APIs are pure variable cost. You pay per token. At low volume this is dramatically cheaper than self-hosting because you're not paying for idle GPUs.
Self-hosting is mostly fixed cost — GPU rental or capex, then a small marginal cost per inference. At high sustained volume the per-token math eventually crosses over and self-hosting wins.
The crossover point depends on your model, hardware, utilisation, and whether you can keep GPUs busy enough to amortise them. For mid-size open-weight models on rented H100s, the crossover is typically somewhere between 50 and 200 million tokens per month, depending on how good you are at batching.

3. Latency and reliability requirements

4. Capability ceiling

Be honest with yourself about which task class you're actually building for. Most production B2B AI is closer to the second category than the first.

Self-hosted LLMs vs the OpenAI API: when each makes sense for B2B

1. Data sensitivity and regulatory exposure

2. Cost shape, not cost level

3. Latency and reliability requirements

4. Capability ceiling

5. Strategic optionality

A simple decision rule

Other field notes

What does a custom AI agent actually cost? An honest breakdown

How to prevent prompt injection: defense patterns that actually work

If this is useful, the consultancy is more so.

Self-hosted LLMs vs the OpenAI API: when each makes sense for B2B

1. Data sensitivity and regulatory exposure

2. Cost shape, not cost level

3. Latency and reliability requirements

4. Capability ceiling

5. Strategic optionality

A simple decision rule

Other field notes

What does a custom AI agent actually cost? An honest breakdown

How to prevent prompt injection: defense patterns that actually work

If this is useful, the consultancy is more so.