Self-Hosted LLM for Business — Real Use Cases and What It Actually Costs
Self-hosted LLMs make sense for businesses in 2026 in three specific scenarios — data sensitivity that prohibits external API use, usage volume where per-token API pricing becomes painful, and customization needs that hosted APIs don't support. For everything else, the OpenAI and Anthropic APIs are still the easier default. But the situations where self-hosted is genuinely the right answer have grown substantially as open-weight models (Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2) have closed most of the quality gap with frontier hosted models, and as runtime tooling (Ollama, vLLM, llama.cpp) has made deployment significantly easier.
This article walks through the real business use cases where self-hosted wins, what deployment actually looks like, and what it costs.
When self-hosted is genuinely the right call
Data sensitivity that prohibits external APIs
The most common driver. Industries where shipping business data through OpenAI or Anthropic's APIs is either prohibited by regulation or strongly discouraged by client contracts:
- Healthcare — patient data subject to HIPAA, where the API vendor's BAA covers some scenarios but creates ongoing vendor management complexity
- Legal services — client documents subject to privilege; many clients now prohibit outside counsel from using external AI on their documents
- Financial services — client information subject to GLBA, SEC, and FINRA rules; sensitive trading or M&A data
- Defense and aerospace — ITAR-controlled technical data, classified material, export-restricted information
- Internal HR and legal — employee data, separation agreements, internal investigations, board materials
For these workloads, self-hosted isn't a cost optimization — it's the only structurally honest answer.
High-volume workloads where API costs compound
OpenAI and Anthropic charge per token. At scale, the math gets expensive fast:
- A customer support automation processing 10,000 tickets/month with multi-turn reasoning: $3,000-$10,000/month in API costs
- A document analysis pipeline processing 500 contracts/month: $500-$2,000/month
- A code review assistant active across a 50-engineer team: $1,500-$5,000/month
- A knowledge base Q&A system serving 1,000 employees: $2,000-$8,000/month
Each of these workloads runs on self-hosted infrastructure for $400-$2,000/month total — independent of usage volume. The break-even point varies but is typically 6-18 months at production volume.
Customization beyond hosted API limits
Hosted APIs are configurable but not infinitely customizable. Scenarios where self-hosted enables capabilities the APIs don't:
- Fine-tuning on proprietary data — Llama, Qwen, and Mistral can be fine-tuned on your specific data, terminology, and tasks. Hosted models support this to a limited degree (OpenAI fine-tuning, Anthropic constitutional AI) but with constraints.
- Custom model deployments — specialized models for specific tasks (code generation, legal review, medical coding) deployed to optimize for the specific workload
- Air-gapped deployment — environments with no external network access at all (classified systems, certain regulated infrastructure)
- Modified inference parameters — control over temperature, sampling, output constraints in ways hosted APIs don't fully expose
- Custom safety filtering — your own content policies rather than the vendor's, important when the vendor's safety filters reject legitimate business content
Production-ready self-hosted models in 2026
The current landscape of open-weight models suitable for business workloads:
Llama 3.3 70B (Meta) — strong general-purpose model, excellent at most business tasks. Most widely deployed open-weight model in production today. Solid instruction following, reliable structured output, broad capability coverage.
Qwen 2.5 72B (Alibaba) — particularly strong at structured extraction and analysis. Excellent for document processing, classification, and information extraction tasks. Multilingual capability is genuinely strong.
Mistral Large 2 — balanced quality and efficiency. Good fit for production workloads where inference speed matters alongside output quality.
DeepSeek V3 (DeepSeek) — strong reasoning for the parameter count. Particularly good for analytical tasks, code understanding, and multi-step problem solving.
Llama 3.1 8B / Qwen 2.5 7B — smaller models for high-throughput simple tasks. Surprisingly capable for classification, summarization, extraction. Run on commodity hardware including CPUs.
Specialized models — Code Llama for code tasks, DeepSeek Coder for code generation, medical-tuned models for healthcare applications, legal-tuned models for legal review. Specialized models often outperform general models on their specific domain.
For most business workloads in 2026, the sweet spot is a 70B-class general model for high-stakes work and a 7B-13B model for high-throughput simple tasks, running in parallel on the same infrastructure.
The runtime landscape
Ollama — the easiest to deploy. Single-binary installation, model management built in, clean API. Good for production under moderate scale (under 100 concurrent users). Used by ShockSign and Anubis Memphis as the default inference layer.
vLLM — the highest-throughput option. Significantly faster than Ollama at scale through PagedAttention and continuous batching. More operational complexity. Right for production deployments serving many concurrent users.
llama.cpp — runs on commodity hardware including CPU-only setups. Useful for edge deployments, air-gapped environments, or cost-sensitive small-scale deployments. Quantized models (GGUF format) make even 70B models runnable on consumer hardware.
Text Generation Inference (TGI) — Hugging Face's standard. Good integration with the Hugging Face ecosystem. Used in some enterprise deployments.
LiteLLM as proxy — not a runtime itself, but a unified API layer that lets applications call self-hosted or hosted models through one interface. Useful for hybrid deployments.
Real business use cases that work well
Internal document Q&A
Employees ask questions in natural language; the system retrieves relevant policy documents, contracts, knowledge base articles, and returns an answer with citations. The classic RAG (retrieval-augmented generation) pattern.
For internal use, self-hosted is structurally simpler because internal documents stay inside the network. Build cost: $30,000-$60,000 for a focused deployment.
Customer support automation
Inbound tickets are classified, routed, and (where appropriate) drafted with a response that a human reviews and sends. AI handles the volume; humans handle the judgment.
Self-hosted wins here when ticket content includes customer PII or business sensitive information that the customer support team's existing tools handle but external AI doesn't have approval to process.
Sales and marketing content drafting
Internal-facing content drafts — proposals, customer-facing emails, marketing copy first drafts, sales playbook updates. Self-hosted wins when the content references specific deal information, customer context, or internal positioning that shouldn't leave the company.
Code review and code generation for proprietary codebases
Code review assistants, refactoring suggestions, test generation for code that contains proprietary business logic. Most companies prohibit sending proprietary source code to external APIs; self-hosted enables AI-assisted development without that constraint.
Document extraction at production volume
Invoice processing, form parsing, contract data extraction. Self-hosted scales economically as volume grows.
Conversational interfaces over internal data
"Show me last quarter's revenue by region" — a natural language interface to internal data warehouses. Self-hosted wins because the queries reveal information about your business that shouldn't be sent to external APIs (even if the actual data isn't).
What deployment actually looks like
A typical self-hosted LLM deployment for a business:
Week 1-2: Discovery and planning
- Identify the specific use case(s) to start with
- Map the data sources and integration points
- Choose model(s) based on quality and throughput requirements
- Architect the infrastructure (cloud, on-prem, hybrid)
Week 2-4: Infrastructure setup
- Provision GPU infrastructure (cloud or on-prem)
- Install runtime (Ollama, vLLM, etc.)
- Download and configure models
- Set up monitoring, logging, and observability
Week 3-8: Application development
- Build the application layer (the actual feature the user interacts with)
- Implement RAG (retrieval-augmented generation) if needed for knowledge base scenarios
- Build prompt engineering and output validation
- Integration with existing systems
Week 6-10: Testing and refinement
- Real-world testing with actual users
- Prompt iteration based on observed outputs
- Edge case handling
- Performance tuning
Week 8-12: Production rollout
- Phased rollout to user base
- Monitoring and alerting on production traffic
- Iterative improvement based on usage patterns
A focused single-use-case deployment typically lands in 6-10 weeks. Multi-use deployments take longer but the marginal cost of each additional use case drops substantially after the first.
Cost reality
Infrastructure costs (cloud GPU):
- 70B model on A100 80GB: $1,200-$2,500/month dedicated, $200-$600/month on-demand
- 32B model on A10G or smaller A100: $300-$800/month
- 7B-13B model on commodity GPU: $80-$300/month
- CPU-only inference with llama.cpp: $40-$150/month for small workloads
Build costs (focused use case):
- Application development: $30,000-$80,000
- Infrastructure setup and deployment: included in above for cloud deployments, additional for on-prem
- Ongoing maintenance: $1,000-$4,000/month
Comparison to hosted APIs at production volume:
- High-volume customer support automation: $3,000-$10,000/month hosted vs $500-$1,500/month self-hosted infrastructure
- Document analysis at scale: $1,000-$5,000/month hosted vs $400-$1,200/month self-hosted
- Internal Q&A serving thousands of users: $2,000-$8,000/month hosted vs $600-$2,000/month self-hosted
Break-even on the build is typically 6-18 months from API cost savings alone. The structural advantages (data sovereignty, customization, no vendor dependency) are bonus.
When hosted is still the right answer
Not every workload should be self-hosted. Hosted APIs win when:
- Volume is low (under a few thousand requests/month)
- Data isn't sensitive (public information, no client confidentiality concerns)
- You need the most cutting-edge model capabilities (frontier reasoning, very long context)
- You don't have infrastructure capacity and aren't ready to build it
- Time-to-value matters more than ownership (need it shipping next month, not in a quarter)
For these, build with OpenAI or Anthropic APIs and migrate to self-hosted later when one of the self-hosted-favorable conditions emerges.
When upfront cost is the constraint
A serious self-hosted AI deployment is real money up front. Aftershock Network's Operator Model structures the engagement with a small down payment and monthly installments over an agreed term — important because the operational savings from migrating off hosted APIs often start covering the monthly payment within the first 6 months.
More about the Operator Model →
How to start
If you're seriously evaluating self-hosted AI for your business:
- Currently using OpenAI or Anthropic and hitting cost / data-sovereignty walls: migration project, 8-12 weeks for a focused use case. Start with a scoping conversation.
- Building AI features for the first time, want to start self-hosted: focused proof-of-value on one well-defined use case, then expand. 6-10 weeks for first production deployment.
- Have existing self-hosted infrastructure, want to add new AI applications: cheapest path — the second application targeting existing infrastructure is dramatically faster than the first. Typically $15,000-$30,000 for additional applications.
Every Aftershock Network self-hosted AI engagement starts with a real conversation about your data, your existing infrastructure, and the specific business problem you're trying to solve — not a generic AI pitch.
Frequently asked questions
When should a business use a self-hosted LLM instead of OpenAI or Anthropic?
Self-hosted LLMs make sense when data sensitivity prohibits external API egress (regulated industries, third-party confidential information, internal HR/legal data), when usage volume makes API costs unmanageable (high-frequency or long-context workloads), or when the business needs deep customization beyond what hosted APIs offer (fine-tuning on proprietary data, specialized model deployments). For most businesses with moderate volume and non-sensitive workloads, hosted APIs are still the right default.
Which self-hosted LLMs are actually production-ready in 2026?
Open-weight models in the 32B-72B parameter range — Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 2, DeepSeek 67B — match or approach GPT-4-class quality for most business workloads. Smaller models (7B-13B) handle high-throughput simpler tasks (classification, extraction, summarization) at significantly lower cost. Frontier hosted models (GPT-5, Claude Opus 4.7) still lead on the most complex reasoning tasks, but the gap has closed substantially for production workloads.
What hardware do I need to run a self-hosted LLM?
For 70B-class models: a single GPU with 80GB VRAM (A100, H100) or 2-4 GPUs with 24-48GB each for distributed inference. For 32B models: a single GPU with 24-48GB (RTX 4090, A6000, A100 40GB). For 7B-13B models: any GPU with 12-24GB VRAM, or even CPU-only inference with llama.cpp on commodity hardware. Cloud deployment (AWS, GCP, Azure, Lambda Labs) is the most common starting point — typical operational cost is $200-$2,500/month.
What runtime should I use for self-hosted LLMs?
Ollama is the easiest to deploy and operate — great for getting started and for production workloads under modest scale. vLLM is the highest-throughput option for production — significantly faster than Ollama at scale, but more operational complexity. llama.cpp runs on CPU-only hardware, useful for edge cases. Text Generation Inference (TGI) is the Hugging Face standard. Pick based on scale: Ollama for under 100 concurrent users, vLLM for higher throughput, llama.cpp for resource-constrained deployments.
What business workloads are self-hosted LLMs good at?
Strong fits include — internal document analysis and Q&A (knowledge bases, contracts, policies, HR documents), customer support automation (drafting responses, classifying tickets, summarizing case histories), code generation and review on proprietary codebases, content drafting for internal use (reports, summaries, communications), data extraction from semi-structured documents (invoices, forms, applications), and conversational interfaces over internal data. Weaker fits include cutting-edge reasoning tasks where frontier hosted models still lead.
How much does it cost to deploy self-hosted AI for a business?
A focused application using self-hosted LLM typically costs $30,000-$80,000 to build (the application layer plus model deployment infrastructure). A larger multi-use AI platform deployment runs $80,000-$200,000+. Ongoing operational cost is typically $400-$2,500/month for the inference infrastructure depending on workload. Compared to OpenAI/Anthropic API costs at production scale ($2,000-$30,000/month for high-volume workloads), self-hosted typically breaks even in 6-18 months.
Can I use my existing AI infrastructure for new business applications?
Yes — once a self-hosted LLM is deployed for one use case, adding additional applications targeting the same inference infrastructure is much cheaper than the initial deployment. We routinely build the second, third, fourth application as a small project ($15,000-$30,000) once the underlying LLM infrastructure exists. This is one of the strongest arguments for self-hosted: the marginal cost of new AI features drops dramatically after the initial deployment.
Related answers
Want AI inside your business without sending data to OpenAI?
Aftershock Network builds AI-powered software on self-hosted LLMs — Ollama, vLLM, llama.cpp — for businesses that can't (or won't) send their data to external APIs. ShockSign and Anubis Memphis both ship with self-hosted AI built in. Tell us what you want AI to do and we'll show you what's possible.
Start a conversation →