Aftershock Network
Aftershock · Answers

Self-Hosted LLM for Business — Real Use Cases and What It Actually Costs

Self-hosted LLMs make sense for businesses in 2026 in three specific scenarios — data sensitivity that prohibits external API use, usage volume where per-token API pricing becomes painful, and customization needs that hosted APIs don't support. For everything else, the OpenAI and Anthropic APIs are still the easier default. But the situations where self-hosted is genuinely the right answer have grown substantially as open-weight models (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2) have closed most of the quality gap with frontier hosted models, and as runtime tooling (Ollama, vLLM, llama.cpp) has made deployment significantly easier.

This article walks through the real business use cases where self-hosted wins, what deployment actually looks like, and what it costs.

When self-hosted is genuinely the right call

Data sensitivity that prohibits external APIs

The most common driver. Industries where shipping business data through OpenAI or Anthropic's APIs is either prohibited by regulation or strongly discouraged by client contracts:

For these workloads, self-hosted isn't a cost optimization — it's the only structurally honest answer.

High-volume workloads where API costs compound

OpenAI and Anthropic charge per token. At scale, the math gets expensive fast:

Each of these workloads runs on self-hosted infrastructure for $400-$2,000/month total — independent of usage volume. The break-even point varies but is typically 6-18 months at production volume.

Customization beyond hosted API limits

Hosted APIs are configurable but not infinitely customizable. Scenarios where self-hosted enables capabilities the APIs don't:

Production-ready self-hosted models in 2026

The current landscape of open-weight models suitable for business workloads:

Llama 3.3 70B (Meta) — strong general-purpose model, excellent at most business tasks. Most widely deployed open-weight model in production today. Solid instruction following, reliable structured output, broad capability coverage.

Qwen 2.5 72B (Alibaba) — particularly strong at structured extraction and analysis. Excellent for document processing, classification, and information extraction tasks. Multilingual capability is genuinely strong.

Mistral Large 2 — balanced quality and efficiency. Good fit for production workloads where inference speed matters alongside output quality.

DeepSeek V3 (DeepSeek) — strong reasoning for the parameter count. Particularly good for analytical tasks, code understanding, and multi-step problem solving.

Llama 3.1 8B / Qwen 2.5 7B — smaller models for high-throughput simple tasks. Surprisingly capable for classification, summarization, extraction. Run on commodity hardware including CPUs.

Specialized models — Code Llama for code tasks, DeepSeek Coder for code generation, medical-tuned models for healthcare applications, legal-tuned models for legal review. Specialized models often outperform general models on their specific domain.

For most business workloads in 2026, the sweet spot is a 70B-class general model for high-stakes work and a 7B-13B model for high-throughput simple tasks, running in parallel on the same infrastructure.

The runtime landscape

Ollama — the easiest to deploy. Single-binary installation, model management built in, clean API. Good for production under moderate scale (under 100 concurrent users). Used by ShockSign and Anubis Memphis as the default inference layer.

vLLM — the highest-throughput option. Significantly faster than Ollama at scale through PagedAttention and continuous batching. More operational complexity. Right for production deployments serving many concurrent users.

llama.cpp — runs on commodity hardware including CPU-only setups. Useful for edge deployments, air-gapped environments, or cost-sensitive small-scale deployments. Quantized models (GGUF format) make even 70B models runnable on consumer hardware.

Text Generation Inference (TGI) — Hugging Face's standard. Good integration with the Hugging Face ecosystem. Used in some enterprise deployments.

LiteLLM as proxy — not a runtime itself, but a unified API layer that lets applications call self-hosted or hosted models through one interface. Useful for hybrid deployments.

Real business use cases that work well

Internal document Q&A

Employees ask questions in natural language; the system retrieves relevant policy documents, contracts, knowledge base articles, and returns an answer with citations. The classic RAG (retrieval-augmented generation) pattern.

For internal use, self-hosted is structurally simpler because internal documents stay inside the network. Build cost: $30,000-$60,000 for a focused deployment.

Customer support automation

Inbound tickets are classified, routed, and (where appropriate) drafted with a response that a human reviews and sends. AI handles the volume; humans handle the judgment.

Self-hosted wins here when ticket content includes customer PII or business sensitive information that the customer support team's existing tools handle but external AI doesn't have approval to process.

Sales and marketing content drafting

Internal-facing content drafts — proposals, customer-facing emails, marketing copy first drafts, sales playbook updates. Self-hosted wins when the content references specific deal information, customer context, or internal positioning that shouldn't leave the company.

Code review and code generation for proprietary codebases

Code review assistants, refactoring suggestions, test generation for code that contains proprietary business logic. Most companies prohibit sending proprietary source code to external APIs; self-hosted enables AI-assisted development without that constraint.

Document extraction at production volume

Invoice processing, form parsing, contract data extraction. Self-hosted scales economically as volume grows.

Conversational interfaces over internal data

"Show me last quarter's revenue by region" — a natural language interface to internal data warehouses. Self-hosted wins because the queries reveal information about your business that shouldn't be sent to external APIs (even if the actual data isn't).

What deployment actually looks like

A typical self-hosted LLM deployment for a business:

Week 1-2: Discovery and planning

Week 2-4: Infrastructure setup

Week 3-8: Application development

Week 6-10: Testing and refinement

Week 8-12: Production rollout

A focused single-use-case deployment typically lands in 6-10 weeks. Multi-use deployments take longer but the marginal cost of each additional use case drops substantially after the first.

Cost reality

Infrastructure costs (cloud GPU):

Build costs (focused use case):

Comparison to hosted APIs at production volume:

Break-even on the build is typically 6-18 months from API cost savings alone. The structural advantages (data sovereignty, customization, no vendor dependency) are bonus.

When hosted is still the right answer

Not every workload should be self-hosted. Hosted APIs win when:

For these, build with OpenAI or Anthropic APIs and migrate to self-hosted later when one of the self-hosted-favorable conditions emerges.

When upfront cost is the constraint

A serious self-hosted AI deployment is real money up front. Aftershock Network's Operator Model structures the engagement with a small down payment and monthly installments over an agreed term — important because the operational savings from migrating off hosted APIs often start covering the monthly payment within the first 6 months.

More about the Operator Model →

How to start

If you're seriously evaluating self-hosted AI for your business:

Every Aftershock Network self-hosted AI engagement starts with a real conversation about your data, your existing infrastructure, and the specific business problem you're trying to solve — not a generic AI pitch.

Frequently asked questions

When should a business use a self-hosted LLM instead of OpenAI or Anthropic?

Self-hosted LLMs make sense when data sensitivity prohibits external API egress (regulated industries, third-party confidential information, internal HR/legal data), when usage volume makes API costs unmanageable (high-frequency or long-context workloads), or when the business needs deep customization beyond what hosted APIs offer (fine-tuning on proprietary data, specialized model deployments). For most businesses with moderate volume and non-sensitive workloads, hosted APIs are still the right default.

Which self-hosted LLMs are actually production-ready in 2026?

Open-weight models in the 32B-72B parameter range — Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2, DeepSeek 67B — match or approach GPT-4-class quality for most business workloads. Smaller models (7B-13B) handle high-throughput simpler tasks (classification, extraction, summarization) at significantly lower cost. Frontier hosted models (GPT-5, Claude Opus 4.7) still lead on the most complex reasoning tasks, but the gap has closed substantially for production workloads.

What hardware do I need to run a self-hosted LLM?

For 70B-class models: a single GPU with 80GB VRAM (A100, H100) or 2-4 GPUs with 24-48GB each for distributed inference. For 32B models: a single GPU with 24-48GB (RTX 4090, A6000, A100 40GB). For 7B-13B models: any GPU with 12-24GB VRAM, or even CPU-only inference with llama.cpp on commodity hardware. Cloud deployment (AWS, GCP, Azure, Lambda Labs) is the most common starting point — typical operational cost is $200-$2,500/month.

What runtime should I use for self-hosted LLMs?

Ollama is the easiest to deploy and operate — great for getting started and for production workloads under modest scale. vLLM is the highest-throughput option for production — significantly faster than Ollama at scale, but more operational complexity. llama.cpp runs on CPU-only hardware, useful for edge cases. Text Generation Inference (TGI) is the Hugging Face standard. Pick based on scale: Ollama for under 100 concurrent users, vLLM for higher throughput, llama.cpp for resource-constrained deployments.

What business workloads are self-hosted LLMs good at?

Strong fits include — internal document analysis and Q&A (knowledge bases, contracts, policies, HR documents), customer support automation (drafting responses, classifying tickets, summarizing case histories), code generation and review on proprietary codebases, content drafting for internal use (reports, summaries, communications), data extraction from semi-structured documents (invoices, forms, applications), and conversational interfaces over internal data. Weaker fits include cutting-edge reasoning tasks where frontier hosted models still lead.

How much does it cost to deploy self-hosted AI for a business?

A focused application using self-hosted LLM typically costs $30,000-$80,000 to build (the application layer plus model deployment infrastructure). A larger multi-use AI platform deployment runs $80,000-$200,000+. Ongoing operational cost is typically $400-$2,500/month for the inference infrastructure depending on workload. Compared to OpenAI/Anthropic API costs at production scale ($2,000-$30,000/month for high-volume workloads), self-hosted typically breaks even in 6-18 months.

Can I use my existing AI infrastructure for new business applications?

Yes — once a self-hosted LLM is deployed for one use case, adding additional applications targeting the same inference infrastructure is much cheaper than the initial deployment. We routinely build the second, third, fourth application as a small project ($15,000-$30,000) once the underlying LLM infrastructure exists. This is one of the strongest arguments for self-hosted: the marginal cost of new AI features drops dramatically after the initial deployment.

Related answers

Want AI inside your business without sending data to OpenAI?

Aftershock Network builds AI-powered software on self-hosted LLMs — Ollama, vLLM, llama.cpp — for businesses that can't (or won't) send their data to external APIs. ShockSign and Anubis Memphis both ship with self-hosted AI built in. Tell us what you want AI to do and we'll show you what's possible.

Start a conversation →