You Need More Than a Big LLM

Why enterprise AI needs more than just GPUs and LLMs - exploring the hardware architecture for real AI systems behind the firewall.

Nov 30, 2025 · 17 min read · AI, Technology, Infrastructure

Most “enterprise AI strategies” I hear boil down to:

Buy a big GPU, pick a big LLM, put it near our data, done.

That’s not a strategy. That’s a shopping list.

This article builds on the foundation established in “AI systems that stay behind your firewall”, which outlines why enterprises need internal AI solutions. Here, we explore how to architect those solutions.

Chat Interfaces vs. Process Automation

Here’s what often gets missed: LLMs with chat interfaces are excellent for ad-hoc knowledge work, and I’m not arguing against their value. When an employee needs to draft a proposal, research a topic, or brainstorm ideas, a conversational AI interface is genuinely useful.

But businesses don’t run on ad-hoc queries. They run on repeatable processes: thousands of invoices flowing through approval workflows, support tickets getting triaged and routed, compliance checks happening on every transaction, inventory systems responding to demand signals.

The difference matters:

Chat interface: Employee asks “Can you summarize this contract?” and gets an answer
Process automation: System processes 10,000 invoices per day, extracting fields, validating against business rules, routing exceptions, and updating ERP systems—without anyone opening a chat window

Most organizations need both, but conflate them. They deploy a chat interface and wonder why it hasn’t transformed their operations. Chat is for exploration and knowledge work; workflow automation is for the repetitive, high-volume tasks that actually drive business efficiency.

The architecture challenge isn’t just “put an LLM behind our firewall.” It’s designing a system where AI components integrate into existing workflows to solve specific process bottlenecks—not just provide another place for employees to type questions.

Inside a real business, AI workloads fall into three very different categories:

1. Perception & Triage

What’s in this frame? Is this event anomalous? Which queue should this ticket go to?

These are continuous monitoring tasks that need to run 24/7 with minimal latency. They’re about filtering and routing large volumes of incoming data before it reaches human attention.

2. Transformation & Understanding

Extract fields from invoices, summarise long documents, draft internal replies.

This is where the heavy lifting happens - complex analysis, natural language processing, and content generation that requires deep understanding of context and nuance.

3. Decision & Automation

Approve/reject, escalate/close, trigger workflows in ERP/CRM.

These are the final actions that turn AI insights into business outcomes, often requiring integration with existing systems and business rules.

Only Layer 2 Actually Needs a Serious GPU-Backed LLM

The surprise for many organizations is that only layer 2 - transformation and understanding - truly benefits from expensive GPU infrastructure and large language models.

Layers 1 and 3 are usually better served by small, dedicated accelerators that are purpose-built for their specific tasks.

Edge Accelerators for Perception & Triage

Modern edge AI accelerators are purpose-built for continuous, low-power inference. These devices excel at continuous, low-power processing of sensor data, images, and basic classification tasks, consuming 5-15W compared to 300-450W for enterprise GPUs.

Key categories include:

NPUs (Neural Processing Units) - Dedicated silicon optimized for AI inference
M.2 accelerator cards - PCIe-attached devices that plug into standard servers
Edge compute modules - Complete systems like NVIDIA Jetson Orin (15-60W) or Raspberry Pi with Coral USB accelerator (2-5W)
AI Mini PCs - APU-based mini computers with integrated CPU, GPU, and NPU for local LLM deployment and edge processing
Micro-accelerators - USB sticks and embedded modules acting as “intelligent sensors” that send events, not raw data streams

These hardware categories often run lightweight, quantized models (INT8/FP16) optimized for edge inference, enabling efficient ticket and email triage before data reaches more powerful LLMs.

A typical edge NPU can process 1000+ inferences per second while consuming less power than a standard lightbulb, making them ideal for 24/7 monitoring applications.

For a comprehensive catalog of edge accelerators, NPUs, and alternative hardware options, see “Beyond NVIDIA: A Catalog of Alternative AI Accelerators”.

AI Mini PCs as Local LLM Nodes

Between tiny NPUs and full GPU servers there is a new class of hardware that is easy to miss: “AI mini PCs” based on APUs like AMD’s Ryzen AI Max+ 395 (Strix Halo). Systems such as the Beelink GTR9 Pro combine a 16-core Zen 5 CPU, a 40-CU integrated GPU, and an XDNA 2 NPU, fed by up to 128 GB of LPDDR5X-8000 memory (≈256 GB/s unified bandwidth) and dual 10 GbE networking.

What They Are:

High-end APUs (Strix Halo and similar) with integrated GPU + NPU
64–128 GB unified memory with high bandwidth
Packaged as mini PCs with proper cooling, 10 GbE, and NVMe storage
System TDP typically around 140W, marketed with triple-digit “AI TOPS” figures (e.g., 126 TOPS total, ~50 TOPS from NPU alone)

What They’re Good At:

Single-node LLM/RAG deployment - Running 7–13B models locally with standard toolchains (llama.cpp, exllama), even some 30B-class models with quantization
Development and PoC workstation - Ideal for SMEs that need an AI development box but don’t want rack infrastructure
Branch office AI node - Acts as a local inference endpoint with 10 GbE connectivity back to central storage
Homelab and research - Bridging the gap between hobbyist setups and enterprise infrastructure

What They Are Not:

They do not replace a 300–450 W discrete GPU for heavy multi-tenant inference or production-scale serving
TOPS figures are primarily marketing; real-world performance is bound by unified memory bandwidth and system thermal limits
Not suitable for distributed training or large-scale model serving
Limited scalability compared to proper GPU clusters

How They Fit in the Architecture:

Below: Single-purpose NPUs (Hailo, Metis, Coral) for specialized CV/triage tasks
Above: Rack-mounted GPU servers or Tenstorrent cards for serious throughput and multi-user workloads
Sweet spot: “LLM appliances” or branch-office AI nodes in your “AI behind the firewall” story

These systems are particularly attractive for:

SMEs doing their first on-prem AI deployment - Lower barrier to entry than rack servers
Edge locations needing local LLM capability - Branch offices, retail locations, remote facilities
Development environments - Testing and validation before deploying to production GPU infrastructure
Cost-conscious deployments - When you need LLM capability but can’t justify dedicated GPU servers

Example Use Case: A retail chain deploys GTR9 Pro systems at each store location for local customer service chatbots and inventory analysis. The 10 GbE connectivity allows fast synchronization with central data, while local inference keeps customer interactions responsive even if WAN connectivity is degraded. Edge NPUs handle camera-based inventory tracking, while the integrated GPU handles LLM queries about product information and recommendations.

Treat these systems as single-node appliances rather than building blocks for a distributed cluster. They excel at bringing LLM capability to locations where rack infrastructure isn’t practical, but they’re not a replacement for purpose-built accelerator clusters when you need serious scale.

December 2025 AI Mini PC Market Snapshot

The AI mini PC landscape has matured significantly, with four distinct classes serving different use cases:

1. Strix Halo Mini Workstations (AMD Ryzen AI Max+ 395, 126 TOPS)

Examples: Beelink GTR9 Pro, GMKtec EVO-X2, GEEKOM A9 Mega, Bosgame M5, Minisforum MS-S1 Max, Framework Desktop

These are the most interesting for “LLM behind the firewall” deployments. With 16-core Zen 5 CPUs, 40-CU Radeon GPUs, and XDNA 2 NPUs, they offer desktop-grade performance in 140W packages. Real-world benchmarks show:

gpt-oss-120B: 30–45 tokens/sec via LM Studio/llama.cpp
gpt-oss-20B MXFP4: 46–66 tokens/sec on EVO-X2
Gemma-3 1B Q4K: 160 tokens/sec on integrated GPU

2. Ryzen AI 300/HX 370 Boxes (80 TOPS, upgradable RAM)

Examples: GEEKOM A9 Max, GMKtec EVO-X1, various Beelink/Minisforum models

The sensible default for SMB AI servers. With 12-core configurations (4× Zen 5 + 8× Zen 5c), Radeon 890M GPUs, and upgradeable DDR5-5600 RAM, these hit the sweet spot for RAG on 7–13B models while staying under 60W CPU TDP.

3. Intel Core Ultra AI Mini PCs (Copilot+ focus)

Examples: Beelink GTi14 Ultra, ASUS NUC 14 Pro AI, GEEKOM GT1 Mega

Optimized for Windows + Copilot+ ecosystems. The GTi14 Ultra offers 16 cores (6P+8E+2LP-E), Intel Arc GPUs, and 34 TOPS NPUs. Best for Windows-centric organizations needing official Copilot+ support.

4. Ascend-based AI Bricks (NPU-only appliances)

Examples: Orange Pi AI Studio / Pro

Pure inference appliances with Huawei Ascend 310 chips (176 TOPS single, 352 TOPS dual). Targeted at DeepSeek-R1 execution but require external CPU/GPU for general computing. The single USB4 interface design is unconventional but works for controlled appliance deployments.

Workflow Integration for Decisions

Rather than expensive GPUs making every decision, use AI as one component in a rules-based workflow:

Simple classifiers for routine approvals
Rule engines that combine AI insights with business logic
Integration points that trigger actions in existing ERP/CRM systems

The Architecture That Works Behind the Firewall

The pattern that delivers both performance and compliance looks like this:

Core Node (1-2 GPUs)

The central processing hub for complex AI workloads:

Large language models (7B-70B parameters) for complex analysis, summarization, and content generation
Embeddings and vector search for semantic document understanding and retrieval-augmented generation (RAG)
Orchestration and workflow management - Coordinating tasks across edge nodes and managing model serving
Batch processing - Handling non-real-time workloads like document analysis and report generation
Handles the “heavy thinking” tasks that require deep context understanding

Typical configuration: Single server with 1-2 enterprise GPUs, 128-256GB RAM, and high-speed NVMe storage for model weights. See the hardware selection section below for specific GPU options and alternatives.

Edge Nodes with NPUs

Distributed intelligence at the data source:

Continuous perception - Real-time video analysis, sensor monitoring, and anomaly detection
Cheap filtering - Pre-processing and triage before data reaches the core
Low-power, high-efficiency processing - 5-60W total system power vs 450W+ for GPU servers
Distributed across multiple locations - Deploy at branch offices, manufacturing floors, or remote sites
Local decision making - Immediate responses without network latency

Hardware selection depends on your specific use case, power constraints, and integration requirements. See the hardware selection section below and the comprehensive catalog in “Beyond NVIDIA: A Catalog of Alternative AI Accelerators” for detailed options.

Workflow Layer

The integration and governance layer:

Rules engines and business logic - Combining AI outputs with deterministic business rules
Integration with existing systems - APIs connecting to ERP (SAP, Oracle), CRM (Salesforce), and ticketing systems
Audit trails and compliance controls - Logging all AI decisions, data access, and model usage
AI gateways and firewalls - Security layers that inspect prompts, filter outputs, and prevent data leakage
Zero Trust architecture - Identity verification and least-privilege access for all AI interactions
Uses AI as a component, not as magic - Clear boundaries between AI inference and business logic

Why This Architecture Matters

Power Efficiency

You stop using 450W GPUs for work that a 10W NPU can handle. Consider the math:

Single GPU server: 450W continuous power = ~3,942 kWh/year = ~$400-800/year in electricity (depending on rates)
Edge NPU node: 10W continuous power = ~88 kWh/year = ~$9-18/year in electricity
100 edge nodes: Still only ~8,800 kWh/year vs 394,200 kWh for 100 GPU servers

In a data center with hundreds of edge devices, this translates to:

90%+ reduction in power consumption for Perception & Triage workloads
Significant cost savings on both electricity and cooling infrastructure
Reduced environmental impact - lower carbon footprint and reduced heat generation
Better scalability - Add edge nodes without proportional increases in power and cooling capacity

Predictable Latency

Each component has a narrow, clear role. Perception happens at the edge (sub-millisecond), transformation in the core (seconds), and decisions are routed through existing workflows (milliseconds).

Compliance and Security

Data stays within your perimeter, but security requires more than just physical boundaries:

Zero Trust architecture - Every AI interaction requires authentication and authorization, even within the internal network
AI firewalls and gateways - Specialized security layers that inspect prompts, filter outputs, and prevent prompt injection attacks
Data sovereignty - Full control over where data is stored and processed, critical for GDPR, HIPAA, and industry-specific regulations
Audit trails - Complete logging of all AI interactions, model usage, and data access for compliance reporting
Model governance - Cryptographic signing of model artifacts, version control, and deployment approval workflows
Network segmentation - Isolated networks for AI workloads using VPCs and strict firewall rules
Compliance teams can understand the architecture - Clear separation of concerns makes security reviews and audits straightforward

Traditional security tools may not address AI-specific threats like prompt injection, model extraction, or adversarial attacks. AI-specific firewalls and gateways provide an additional security layer designed for LLM interactions.

Scalability

You can add edge nodes without exponentially increasing infrastructure costs. The core can scale independently of the edge processing needs.

Security and Compliance Considerations

Deploying AI behind the firewall requires more than just physical boundaries. AI systems introduce unique security challenges that traditional firewalls may not address.

AI-Specific Security Threats

Prompt Injection Attacks:

Malicious inputs designed to manipulate LLM outputs
Can lead to data leakage, unauthorized actions, or system compromise
Requires specialized detection and filtering at the AI gateway level

Model Extraction:

Adversaries attempting to reverse-engineer proprietary models
Requires rate limiting, input validation, and monitoring of unusual query patterns

Data Poisoning:

Malicious training data designed to corrupt model behavior
Requires careful data validation and continuous monitoring for model drift

Adversarial Attacks:

Specially crafted inputs designed to fool AI models
Can cause misclassification in vision systems or incorrect outputs in LLMs

Security Architecture

AI Firewalls and Gateways:

Specialized security layers that inspect prompts and filter outputs
Prevent prompt injection, data exfiltration, and toxic outputs
Provide real-time input/output security for LLM interactions
Examples: Cloudflare Firewall for AI, Securiti Context-Aware LLM Firewalls, Radware LLM Firewall

Zero Trust Architecture:

“Never trust, always verify” - Every AI interaction requires authentication
Strict identity-based access controls for users, devices, and AI agents
Least-privilege access principles applied to AI endpoints
Continuous monitoring and verification of all AI interactions

Network Segmentation:

Isolated networks for AI workloads using VPCs and subnets
Strict firewall rules preventing direct exposure of production models
Separate networks for training, inference, and development environments
Private connectivity to prevent data exposure

Model Governance:

Cryptographic signing of model artifacts to ensure integrity
Version control and deployment approval workflows
Containerized deployments with restricted privileges
Continuous vulnerability scanning of models and dependencies

Monitoring and Auditing:

Complete logging of all AI interactions, model usage, and data access
Real-time anomaly detection for unusual patterns or attacks
Compliance reporting and audit trails for regulatory requirements
Incident response plans tailored to AI-specific threats

Implementation Considerations

Evaluating Accelerators

Look beyond just GPU specifications. Consider these factors:

Performance Metrics:

TOPS (Tera Operations Per Second) - Raw compute capability, but not always indicative of real-world performance
Power consumption per inference - Critical for edge deployments and operational costs
Latency requirements - Real-time (<100ms) vs near-real-time (<1s) vs batch processing
Throughput - Inferences per second for your specific model and input size
Memory bandwidth - Important for large models and high-resolution inputs

Practical Considerations:

Integration capabilities - Does it work with your existing infrastructure? Docker/Kubernetes support? Standard APIs?
Model format support - ONNX, TensorFlow Lite, PyTorch, TensorRT compatibility
Development ecosystem - SDK quality, documentation, community support, and available pre-trained models
Deployment complexity - How easy is it to deploy, update, and maintain?
Vendor lock-in - Proprietary formats vs open standards

Total Cost of Ownership (3-5 years):

Hardware acquisition - Initial purchase price
Power and cooling - Ongoing operational costs
Development time - Integration and optimization effort
Maintenance and support - Updates, patches, and vendor support costs
Scalability costs - How expensive is it to add capacity?

Real-World Example: For a video surveillance system processing 100 streams:

GPU approach: 2x A100 GPUs ($20,000) + server ($5,000) + 900W power = $25,000 + $3,600/year power
Edge NPU approach: 100x Jetson Orin Nano ($500 each) = $50,000 + $1,800/year power
Break-even: ~8 years, but edge approach provides better latency, redundancy, and distributed processing

Hardware Selection

When selecting hardware for your three-layer architecture, consider:

For Edge Nodes (Perception & Triage):

Power efficiency (5-60W total system power)
Integration capabilities (PCIe, USB, embedded)
Model format support (ONNX, TensorFlow Lite, PyTorch)
Deployment environment (temperature, space, network connectivity)

For Core Nodes (Transformation & Understanding):

GPU performance matching your model size and throughput needs
Memory bandwidth for large models and vector databases
Storage requirements for model weights
Redundancy and failover capabilities

For detailed hardware specifications, vendor comparisons, and alternative accelerator options (including Chinese NPUs, edge accelerators, and exotic architectures), see “Beyond NVIDIA: A Catalog of Alternative AI Accelerators”.

Industry-Specific Patterns

Finance:

Edge nodes - Real-time fraud detection on transaction streams, analyzing patterns before data leaves branch offices
Core processing - Complex risk analysis, regulatory reporting, and document analysis (contracts, loan applications)
Workflow integration - Automated compliance checks, KYC/AML processing, and integration with core banking systems
Example: Edge device at each ATM analyzing transaction patterns locally, only sending anomalies to core for deep analysis

Operations & Facilities:

Edge sensors - Monitoring equipment health, HVAC systems, and building security in real-time
Core analysis - Predictive maintenance scheduling, energy optimization, and facility management reporting
Workflow integration - Automated work order generation, parts ordering, and technician dispatch
Example: Edge NPUs on each floor monitoring temperature, occupancy, and equipment vibration, triggering alerts only when thresholds are exceeded

Manufacturing:

Vision systems - Production line quality control, defect detection, and component verification
Core optimization - Production scheduling, supply chain optimization, and demand forecasting
Workflow integration - Automated inventory management, reorder triggers, and ERP system updates
Example: Edge vision systems on each production line doing real-time quality checks, with core LLM analyzing production reports and optimizing schedules

Healthcare:

Edge devices - Patient monitoring, medical imaging preprocessing, and real-time alert generation
Core processing - Clinical decision support, medical record analysis, and research data processing
Workflow integration - EMR updates, appointment scheduling, and billing system integration
Example: Edge devices in patient rooms monitoring vital signs and alerting nurses, with core LLM analyzing patient histories for treatment recommendations

Retail:

Edge systems - In-store analytics, inventory tracking, and customer behavior analysis
Core processing - Demand forecasting, pricing optimization, and supply chain management
Workflow integration - POS system integration, inventory management, and marketing automation
Example: Edge cameras analyzing foot traffic and product interactions in real-time, with core LLM optimizing inventory and pricing strategies

Getting Started

If your AI plan today is just “GPU + LLM”, you don’t have an architecture yet. Start by:

1. Audit Your Current AI Usage

Shadow IT discovery - Where are employees using external tools (ChatGPT, Claude, etc.)?
Task analysis - What specific tasks are they trying to automate?
Data flow mapping - What sensitive data is leaving your perimeter?
Cost analysis - What are you spending on external AI services?
Risk assessment - What compliance and security risks exist?

2. Map Workloads to Categories

Classify your AI needs into the three categories:

Perception & Triage - Real-time monitoring, filtering, routing
Transformation & Understanding - Document analysis, summarization, content generation
Decision & Automation - Approvals, escalations, workflow triggers

For each workload, document:

Latency requirements (real-time vs batch)
Data sensitivity and compliance needs
Current solution (if any) and its limitations
Expected volume and growth projections

3. Design the Three-Layer Architecture

Plan your components:

Core Node:

GPU selection based on model size and throughput needs
Storage requirements for model weights and vector databases
Network bandwidth for serving multiple edge nodes
Redundancy and failover strategies

Edge Nodes:

Hardware selection based on power, performance, and integration needs
Deployment locations (branch offices, manufacturing floors, etc.)
Network connectivity and bandwidth requirements
Management and update mechanisms

Workflow Layer:

Integration points with existing systems (APIs, databases, message queues)
Rules engine selection and business logic implementation
Security and compliance tooling (AI firewalls, audit logging)
Monitoring and observability infrastructure

4. Start Small

Pilot with one workload category before expanding:

Choose a low-risk, high-value use case - Something that provides clear ROI but won’t disrupt critical operations
Prove the architecture - Validate that the three-layer approach works for your environment
Measure everything - Performance, costs, power consumption, user satisfaction
Iterate based on learnings - Refine the architecture before scaling

5. Measure and Iterate

Track key metrics:

Performance - Latency, throughput, accuracy
Costs - Hardware, power, development, maintenance
Compliance - Audit trail completeness, data residency, security incidents
Business impact - Time saved, errors reduced, revenue impact

Establish a feedback loop:

Regular reviews of architecture decisions
Cost optimization opportunities
Technology updates and new accelerator options
Evolving business requirements

Common Pitfalls to Avoid

Over-engineering the core - Don’t buy the biggest GPU “just in case”
Under-estimating edge complexity - Edge nodes need management, updates, and monitoring too
Ignoring the workflow layer - AI without integration is just a demo
Skipping security - AI-specific threats require AI-specific defenses
Forgetting compliance - Design for auditability from day one

The Bottom Line

The goal isn’t to eliminate external AI tools through prohibition - it’s to provide internal alternatives that are faster, cheaper, and more secure than the shadow IT that inevitably emerges when employees can’t get their work done.

Remember: Your employees will use AI one way or another. The question is whether you’ll provide them with tools that keep your data secure and your compliance team happy, or if you’ll force them to work around you with inferior solutions.

Key Takeaways

Not all AI workloads need GPUs - Most perception and decision tasks are better served by dedicated edge accelerators
Architecture matters more than hardware - A well-designed three-layer system outperforms expensive hardware in a poorly architected setup
Power efficiency has real costs - 90% power reduction for edge workloads translates to significant operational savings
Security requires AI-specific tools - Traditional firewalls don’t address prompt injection, model extraction, or other AI-specific threats
Start small, measure everything - Pilot with one workload category, prove the architecture, then scale based on learnings
Compliance is a feature, not a burden - Proper architecture makes compliance easier, not harder

The future of enterprise AI isn’t about buying the biggest GPU - it’s about building the right architecture for your specific needs, with the right hardware in the right places, and the right security and compliance controls throughout.

/Users/damirmukimov/projects/samyrai.github.io/layouts/partials/hardware-schema.html