Aftershock · Answers

Self-Hosted LLM for Business — Real Use Cases and What It Actually Costs

Self-hosted LLMs make sense for businesses in 2026 in three specific scenarios — data sensitivity that prohibits external API use, usage volume where per-token API pricing becomes painful, and customization needs that hosted APIs don't support. For everything else, the OpenAI and Anthropic APIs are still the easier default. But the situations where self-hosted is genuinely the right answer have grown substantially as open-weight models (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2) have closed most of the quality gap with frontier hosted models, and as runtime tooling (Ollama, vLLM, llama.cpp) has made deployment significantly easier.

This article walks through the real business use cases where self-hosted wins, what deployment actually looks like, and what it costs.

When self-hosted is genuinely the right call

Data sensitivity that prohibits external APIs

The most common driver. Industries where shipping business data through OpenAI or Anthropic's APIs is either prohibited by regulation or strongly discouraged by client contracts:

Healthcare — patient data subject to HIPAA, where the API vendor's BAA covers some scenarios but creates ongoing vendor management complexity
Legal services — client documents subject to privilege; many clients now prohibit outside counsel from using external AI on their documents
Financial services — client information subject to GLBA, SEC, and FINRA rules; sensitive trading or M&A data
Defense and aerospace — ITAR-controlled technical data, classified material, export-restricted information
Internal HR and legal — employee data, separation agreements, internal investigations, board materials

For these workloads, self-hosted isn't a cost optimization — it's the only structurally honest answer.

High-volume workloads where API costs compound

OpenAI and Anthropic charge per token. At scale, the math gets expensive fast:

A customer support automation processing 10,000 tickets/month with multi-turn reasoning: $3,000-$10,000/month in API costs
A document analysis pipeline processing 500 contracts/month: $500-$2,000/month
A code review assistant active across a 50-engineer team: $1,500-$5,000/month
A knowledge base Q&A system serving 1,000 employees: $2,000-$8,000/month

Each of these workloads runs on self-hosted infrastructure for $400-$2,000/month total — independent of usage volume. The break-even point varies but is typically 6-18 months at production volume.

Customization beyond hosted API limits

Hosted APIs are configurable but not infinitely customizable. Scenarios where self-hosted enables capabilities the APIs don't:

Fine-tuning on proprietary data — Llama, Qwen, and Mistral can be fine-tuned on your specific data, terminology, and tasks. Hosted models support this to a limited degree (OpenAI fine-tuning, Anthropic constitutional AI) but with constraints.
Custom model deployments — specialized models for specific tasks (code generation, legal review, medical coding) deployed to optimize for the specific workload
Air-gapped deployment — environments with no external network access at all (classified systems, certain regulated infrastructure)
Modified inference parameters — control over temperature, sampling, output constraints in ways hosted APIs don't fully expose
Custom safety filtering — your own content policies rather than the vendor's, important when the vendor's safety filters reject legitimate business content

Production-ready self-hosted models in 2026

The current landscape of open-weight models suitable for business workloads:

Llama 3.3 70B (Meta) — strong general-purpose model, excellent at most business tasks. Most widely deployed open-weight model in production today. Solid instruction following, reliable structured output, broad capability coverage.

Qwen 2.5 72B (Alibaba) — particularly strong at structured extraction and analysis. Excellent for document processing, classification, and information extraction tasks. Multilingual capability is genuinely strong.

Mistral Large 2 — balanced quality and efficiency. Good fit for production workloads where inference speed matters alongside output quality.

DeepSeek V3 (DeepSeek) — strong reasoning for the parameter count. Particularly good for analytical tasks, code understanding, and multi-step problem solving.

Llama 3.1 8B / Qwen 2.5 7B — smaller models for high-throughput simple tasks. Surprisingly capable for classification, summarization, extraction. Run on commodity hardware including CPUs.

Specialized models — Code Llama for code tasks, DeepSeek Coder for code generation, medical-tuned models for healthcare applications, legal-tuned models for legal review. Specialized models often outperform general models on their specific domain.

For most business workloads in 2026, the sweet spot is a 70B-class general model for high-stakes work and a 7B-13B model for high-throughput simple tasks, running in parallel on the same infrastructure.

The runtime landscape

Ollama — the easiest to deploy. Single-binary installation, model management built in, clean API. Good for production under moderate scale (under 100 concurrent users). Used by ShockSign and Anubis Memphis as the default inference layer.

vLLM — the highest-throughput option. Significantly faster than Ollama at scale through PagedAttention and continuous batching. More operational complexity. Right for production deployments serving many concurrent users.

llama.cpp — runs on commodity hardware including CPU-only setups. Useful for edge deployments, air-gapped environments, or cost-sensitive small-scale deployments. Quantized models (GGUF format) make even 70B models runnable on consumer hardware.

Text Generation Inference (TGI) — Hugging Face's standard. Good integration with the Hugging Face ecosystem. Used in some enterprise deployments.

LiteLLM as proxy — not a runtime itself, but a unified API layer that lets applications call self-hosted or hosted models through one interface. Useful for hybrid deployments.

Real business use cases that work well

Internal document Q&A

Employees ask questions in natural language; the system retrieves relevant policy documents, contracts, knowledge base articles, and returns an answer with citations. The classic RAG (retrieval-augmented generation) pattern.

For internal use, self-hosted is structurally simpler because internal documents stay inside the network. Build cost: $30,000-$60,000 for a focused deployment.

Customer support automation

Inbound tickets are classified, routed, and (where appropriate) drafted with a response that a human reviews and sends. AI handles the volume; humans handle the judgment.

Self-hosted wins here when ticket content includes customer PII or business sensitive information that the customer support team's existing tools handle but external AI doesn't have approval to process.

Sales and marketing content drafting

Internal-facing content drafts — proposals, customer-facing emails, marketing copy first drafts, sales playbook updates. Self-hosted wins when the content references specific deal information, customer context, or internal positioning that shouldn't leave the company.

Code review and code generation for proprietary codebases

Code review assistants, refactoring suggestions, test generation for code that contains proprietary business logic. Most companies prohibit sending proprietary source code to external APIs; self-hosted enables AI-assisted development without that constraint.

Document extraction at production volume

Invoice processing, form parsing, contract data extraction. Self-hosted scales economically as volume grows.

Conversational interfaces over internal data

"Show me last quarter's revenue by region" — a natural language interface to internal data warehouses. Self-hosted wins because the queries reveal information about your business that shouldn't be sent to external APIs (even if the actual data isn't).

What deployment actually looks like

A typical self-hosted LLM deployment for a business:

Week 1-2: Discovery and planning

Identify the specific use case(s) to start with
Map the data sources and integration points
Choose model(s) based on quality and throughput requirements
Architect the infrastructure (cloud, on-prem, hybrid)

Week 2-4: Infrastructure setup

Provision GPU infrastructure (cloud or on-prem)
Install runtime (Ollama, vLLM, etc.)
Download and configure models
Set up monitoring, logging, and observability

Week 3-8: Application development

Build the application layer (the actual feature the user interacts with)
Implement RAG (retrieval-augmented generation) if needed for knowledge base scenarios
Build prompt engineering and output validation
Integration with existing systems

Week 6-10: Testing and refinement

Real-world testing with actual users
Prompt iteration based on observed outputs
Edge case handling
Performance tuning

Week 8-12: Production rollout

Phased rollout to user base
Monitoring and alerting on production traffic
Iterative improvement based on usage patterns

A focused single-use-case deployment typically lands in 6-10 weeks. Multi-use deployments take longer but the marginal cost of each additional use case drops substantially after the first.

Cost reality

Infrastructure costs (cloud GPU):

70B model on A100 80GB: $1,200-$2,500/month dedicated, $200-$600/month on-demand
32B model on A10G or smaller A100: $300-$800/month
7B-13B model on commodity GPU: $80-$300/month
CPU-only inference with llama.cpp: $40-$150/month for small workloads

Build costs (focused use case):

Application development: $30,000-$80,000
Infrastructure setup and deployment: included in above for cloud deployments, additional for on-prem
Ongoing maintenance: $1,000-$4,000/month

Comparison to hosted APIs at production volume:

High-volume customer support automation: $3,000-$10,000/month hosted vs $500-$1,500/month self-hosted infrastructure
Document analysis at scale: $1,000-$5,000/month hosted vs $400-$1,200/month self-hosted
Internal Q&A serving thousands of users: $2,000-$8,000/month hosted vs $600-$2,000/month self-hosted

Break-even on the build is typically 6-18 months from API cost savings alone. The structural advantages (data sovereignty, customization, no vendor dependency) are bonus.

When hosted is still the right answer

Not every workload should be self-hosted. Hosted APIs win when:

Volume is low (under a few thousand requests/month)
Data isn't sensitive (public information, no client confidentiality concerns)
You need the most cutting-edge model capabilities (frontier reasoning, very long context)
You don't have infrastructure capacity and aren't ready to build it
Time-to-value matters more than ownership (need it shipping next month, not in a quarter)

For these, build with OpenAI or Anthropic APIs and migrate to self-hosted later when one of the self-hosted-favorable conditions emerges.

When upfront cost is the constraint

A serious self-hosted AI deployment is real money up front. Aftershock Network's Operator Model structures the engagement with a small down payment and monthly installments over an agreed term — important because the operational savings from migrating off hosted APIs often start covering the monthly payment within the first 6 months.

More about the Operator Model →

How to start

If you're seriously evaluating self-hosted AI for your business:

Currently using OpenAI or Anthropic and hitting cost / data-sovereignty walls: migration project, 8-12 weeks for a focused use case. Start with a scoping conversation.

Building AI features for the first time, want to start self-hosted: focused proof-of-value on one well-defined use case, then expand. 6-10 weeks for first production deployment.

Have existing self-hosted infrastructure, want to add new AI applications: cheapest path — the second application targeting existing infrastructure is dramatically faster than the first. Typically $15,000-$30,000 for additional applications.

Every Aftershock Network self-hosted AI engagement starts with a real conversation about your data, your existing infrastructure, and the specific business problem you're trying to solve — not a generic AI pitch.

Frequently asked questions

When should a business use a self-hosted LLM instead of OpenAI or Anthropic?

Self-hosted LLMs make sense when data sensitivity prohibits external API egress (regulated industries, third-party confidential information, internal HR/legal data), when usage volume makes API costs unmanageable (high-frequency or long-context workloads), or when the business needs deep customization beyond what hosted APIs offer (fine-tuning on proprietary data, specialized model deployments). For most businesses with moderate volume and non-sensitive workloads, hosted APIs are still the right default.

Which self-hosted LLMs are actually production-ready in 2026?

Open-weight models in the 32B-72B parameter range — Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2, DeepSeek 67B — match or approach GPT-4-class quality for most business workloads. Smaller models (7B-13B) handle high-throughput simpler tasks (classification, extraction, summarization) at significantly lower cost. Frontier hosted models (GPT-5, Claude Opus 4.7) still lead on the most complex reasoning tasks, but the gap has closed substantially for production workloads.

What hardware do I need to run a self-hosted LLM?

For 70B-class models: a single GPU with 80GB VRAM (A100, H100) or 2-4 GPUs with 24-48GB each for distributed inference. For 32B models: a single GPU with 24-48GB (RTX 4090, A6000, A100 40GB). For 7B-13B models: any GPU with 12-24GB VRAM, or even CPU-only inference with llama.cpp on commodity hardware. Cloud deployment (AWS, GCP, Azure, Lambda Labs) is the most common starting point — typical operational cost is $200-$2,500/month.

What runtime should I use for self-hosted LLMs?

Ollama is the easiest to deploy and operate — great for getting started and for production workloads under modest scale. vLLM is the highest-throughput option for production — significantly faster than Ollama at scale, but more operational complexity. llama.cpp runs on CPU-only hardware, useful for edge cases. Text Generation Inference (TGI) is the Hugging Face standard. Pick based on scale: Ollama for under 100 concurrent users, vLLM for higher throughput, llama.cpp for resource-constrained deployments.

What business workloads are self-hosted LLMs good at?

Strong fits include — internal document analysis and Q&A (knowledge bases, contracts, policies, HR documents), customer support automation (drafting responses, classifying tickets, summarizing case histories), code generation and review on proprietary codebases, content drafting for internal use (reports, summaries, communications), data extraction from semi-structured documents (invoices, forms, applications), and conversational interfaces over internal data. Weaker fits include cutting-edge reasoning tasks where frontier hosted models still lead.

How much does it cost to deploy self-hosted AI for a business?

A focused application using self-hosted LLM typically costs $30,000-$80,000 to build (the application layer plus model deployment infrastructure). A larger multi-use AI platform deployment runs $80,000-$200,000+. Ongoing operational cost is typically $400-$2,500/month for the inference infrastructure depending on workload. Compared to OpenAI/Anthropic API costs at production scale ($2,000-$30,000/month for high-volume workloads), self-hosted typically breaks even in 6-18 months.

Can I use my existing AI infrastructure for new business applications?

Yes — once a self-hosted LLM is deployed for one use case, adding additional applications targeting the same inference infrastructure is much cheaper than the initial deployment. We routinely build the second, third, fourth application as a small project ($15,000-$30,000) once the underlying LLM infrastructure exists. This is one of the strongest arguments for self-hosted: the marginal cost of new AI features drops dramatically after the initial deployment.