Zum Hauptinhalt springen
AI Technology Infrastructure

You Need More Than a Big LLM

Why enterprise AI needs more than just GPUs and LLMs - exploring the hardware architecture for real AI systems behind the firewall.

You Need More Than a Big LLM
You Need More Than a Big LLM

Most “enterprise AI strategies” I hear boil down to:

Buy a big GPU, pick a big LLM, put it near our data, done.

That’s not a strategy. That’s a shopping list.

This article builds on the foundation established in “AI systems that stay behind your firewall”, which outlines why enterprises need internal AI solutions. Here, we explore how to architect those solutions.

Chat Interfaces vs. Process Automation

Here’s what often gets missed: LLMs with chat interfaces are excellent for ad-hoc knowledge work, and I’m not arguing against their value. When an employee needs to draft a proposal, research a topic, or brainstorm ideas, a conversational AI interface is genuinely useful.

But businesses don’t run on ad-hoc queries. They run on repeatable processes: thousands of invoices flowing through approval workflows, support tickets getting triaged and routed, compliance checks happening on every transaction, inventory systems responding to demand signals.

The difference matters:

  • Chat interface: Employee asks “Can you summarize this contract?” and gets an answer
  • Process automation: System processes 10,000 invoices per day, extracting fields, validating against business rules, routing exceptions, and updating ERP systems—without anyone opening a chat window

Most organizations need both, but conflate them. They deploy a chat interface and wonder why it hasn’t transformed their operations. Chat is for exploration and knowledge work; workflow automation is for the repetitive, high-volume tasks that actually drive business efficiency.

The architecture challenge isn’t just “put an LLM behind our firewall.” It’s designing a system where AI components integrate into existing workflows to solve specific process bottlenecks—not just provide another place for employees to type questions.

Inside a real business, AI workloads fall into three very different categories:

1. Perception & Triage

What’s in this frame? Is this event anomalous? Which queue should this ticket go to?

These are continuous monitoring tasks that need to run 24/7 with minimal latency. They’re about filtering and routing large volumes of incoming data before it reaches human attention.

2. Transformation & Understanding

Extract fields from invoices, summarise long documents, draft internal replies.

This is where the heavy lifting happens - complex analysis, natural language processing, and content generation that requires deep understanding of context and nuance.

3. Decision & Automation

Approve/reject, escalate/close, trigger workflows in ERP/CRM.

These are the final actions that turn AI insights into business outcomes, often requiring integration with existing systems and business rules.

Only Layer 2 Actually Needs a Serious GPU-Backed LLM

The surprise for many organizations is that only layer 2 - transformation and understanding - truly benefits from expensive GPU infrastructure and large language models.

Layers 1 and 3 are usually better served by small, dedicated accelerators that are purpose-built for their specific tasks.

Edge Accelerators for Perception & Triage

Modern edge AI accelerators are purpose-built for continuous, low-power inference. These devices excel at continuous, low-power processing of sensor data, images, and basic classification tasks, consuming 5-15W compared to 300-450W for enterprise GPUs.

Key categories include:

  • NPUs (Neural Processing Units) - Dedicated silicon optimized for AI inference
  • M.2 accelerator cards - PCIe-attached devices that plug into standard servers
  • Edge compute modules - Complete systems like NVIDIA Jetson Orin (15-60W) or Raspberry Pi with Coral USB accelerator (2-5W)
  • AI Mini PCs - APU-based mini computers with integrated CPU, GPU, and NPU for local LLM deployment and edge processing
  • Micro-accelerators - USB sticks and embedded modules acting as “intelligent sensors” that send events, not raw data streams

These hardware categories often run lightweight, quantized models (INT8/FP16) optimized for edge inference, enabling efficient ticket and email triage before data reaches more powerful LLMs.

A typical edge NPU can process 1000+ inferences per second while consuming less power than a standard lightbulb, making them ideal for 24/7 monitoring applications.

For a comprehensive catalog of edge accelerators, NPUs, and alternative hardware options, see “Beyond NVIDIA: A Catalog of Alternative AI Accelerators”.

AI Mini PCs as Local LLM Nodes

Between tiny NPUs and full GPU servers there is a new class of hardware that is easy to miss: “AI mini PCs” based on APUs like AMD’s Ryzen AI Max+ 395 (Strix Halo). Systems such as the Beelink GTR9 Pro combine a 16-core Zen 5 CPU, a 40-CU integrated GPU, and an XDNA 2 NPU, fed by up to 128 GB of LPDDR5X-8000 memory (≈256 GB/s unified bandwidth) and dual 10 GbE networking.

What They Are:

  • High-end APUs (Strix Halo and similar) with integrated GPU + NPU
  • 64–128 GB unified memory with high bandwidth
  • Packaged as mini PCs with proper cooling, 10 GbE, and NVMe storage
  • System TDP typically around 140W, marketed with triple-digit “AI TOPS” figures (e.g., 126 TOPS total, ~50 TOPS from NPU alone)

What They’re Good At:

  • Single-node LLM/RAG deployment - Running 7–13B models locally with standard toolchains (llama.cpp, exllama), even some 30B-class models with quantization
  • Development and PoC workstation - Ideal for SMEs that need an AI development box but don’t want rack infrastructure
  • Branch office AI node - Acts as a local inference endpoint with 10 GbE connectivity back to central storage
  • Homelab and research - Bridging the gap between hobbyist setups and enterprise infrastructure

What They Are Not:

  • They do not replace a 300–450 W discrete GPU for heavy multi-tenant inference or production-scale serving
  • TOPS figures are primarily marketing; real-world performance is bound by unified memory bandwidth and system thermal limits
  • Not suitable for distributed training or large-scale model serving
  • Limited scalability compared to proper GPU clusters

How They Fit in the Architecture:

  • Below: Single-purpose NPUs (Hailo, Metis, Coral) for specialized CV/triage tasks
  • Above: Rack-mounted GPU servers or Tenstorrent cards for serious throughput and multi-user workloads
  • Sweet spot: “LLM appliances” or branch-office AI nodes in your “AI behind the firewall” story

These systems are particularly attractive for:

  • SMEs doing their first on-prem AI deployment - Lower barrier to entry than rack servers
  • Edge locations needing local LLM capability - Branch offices, retail locations, remote facilities
  • Development environments - Testing and validation before deploying to production GPU infrastructure
  • Cost-conscious deployments - When you need LLM capability but can’t justify dedicated GPU servers

Example Use Case: A retail chain deploys GTR9 Pro systems at each store location for local customer service chatbots and inventory analysis. The 10 GbE connectivity allows fast synchronization with central data, while local inference keeps customer interactions responsive even if WAN connectivity is degraded. Edge NPUs handle camera-based inventory tracking, while the integrated GPU handles LLM queries about product information and recommendations.

Treat these systems as single-node appliances rather than building blocks for a distributed cluster. They excel at bringing LLM capability to locations where rack infrastructure isn’t practical, but they’re not a replacement for purpose-built accelerator clusters when you need serious scale.

December 2025 AI Mini PC Market Snapshot

The AI mini PC landscape has matured significantly, with four distinct classes serving different use cases:

1. Strix Halo Mini Workstations (AMD Ryzen AI Max+ 395, 126 TOPS)

Examples: Beelink GTR9 Pro, GMKtec EVO-X2, GEEKOM A9 Mega, Bosgame M5, Minisforum MS-S1 Max, Framework Desktop

These are the most interesting for “LLM behind the firewall” deployments. With 16-core Zen 5 CPUs, 40-CU Radeon GPUs, and XDNA 2 NPUs, they offer desktop-grade performance in 140W packages. Real-world benchmarks show:

  • gpt-oss-120B: 30–45 tokens/sec via LM Studio/llama.cpp
  • gpt-oss-20B MXFP4: 46–66 tokens/sec on EVO-X2
  • Gemma-3 1B Q4K: 160 tokens/sec on integrated GPU

2. Ryzen AI 300/HX 370 Boxes (80 TOPS, upgradable RAM)

Examples: GEEKOM A9 Max, GMKtec EVO-X1, various Beelink/Minisforum models

The sensible default for SMB AI servers. With 12-core configurations (4× Zen 5 + 8× Zen 5c), Radeon 890M GPUs, and upgradeable DDR5-5600 RAM, these hit the sweet spot for RAG on 7–13B models while staying under 60W CPU TDP.

3. Intel Core Ultra AI Mini PCs (Copilot+ focus)

Examples: Beelink GTi14 Ultra, ASUS NUC 14 Pro AI, GEEKOM GT1 Mega

Optimized for Windows + Copilot+ ecosystems. The GTi14 Ultra offers 16 cores (6P+8E+2LP-E), Intel Arc GPUs, and 34 TOPS NPUs. Best for Windows-centric organizations needing official Copilot+ support.

4. Ascend-based AI Bricks (NPU-only appliances)

Examples: Orange Pi AI Studio / Pro

Pure inference appliances with Huawei Ascend 310 chips (176 TOPS single, 352 TOPS dual). Targeted at DeepSeek-R1 execution but require external CPU/GPU for general computing. The single USB4 interface design is unconventional but works for controlled appliance deployments.

Workflow Integration for Decisions

Rather than expensive GPUs making every decision, use AI as one component in a rules-based workflow:

  • Simple classifiers for routine approvals
  • Rule engines that combine AI insights with business logic
  • Integration points that trigger actions in existing ERP/CRM systems

The Architecture That Works Behind the Firewall

The pattern that delivers both performance and compliance looks like this:

Core Node (1-2 GPUs)

The central processing hub for complex AI workloads:

  • Large language models (7B-70B parameters) for complex analysis, summarization, and content generation
  • Embeddings and vector search for semantic document understanding and retrieval-augmented generation (RAG)
  • Orchestration and workflow management - Coordinating tasks across edge nodes and managing model serving
  • Batch processing - Handling non-real-time workloads like document analysis and report generation
  • Handles the “heavy thinking” tasks that require deep context understanding

Typical configuration: Single server with 1-2 enterprise GPUs, 128-256GB RAM, and high-speed NVMe storage for model weights. See the hardware selection section below for specific GPU options and alternatives.

Edge Nodes with NPUs

Distributed intelligence at the data source:

  • Continuous perception - Real-time video analysis, sensor monitoring, and anomaly detection
  • Cheap filtering - Pre-processing and triage before data reaches the core
  • Low-power, high-efficiency processing - 5-60W total system power vs 450W+ for GPU servers
  • Distributed across multiple locations - Deploy at branch offices, manufacturing floors, or remote sites
  • Local decision making - Immediate responses without network latency

Hardware selection depends on your specific use case, power constraints, and integration requirements. See the hardware selection section below and the comprehensive catalog in “Beyond NVIDIA: A Catalog of Alternative AI Accelerators” for detailed options.

Workflow Layer

The integration and governance layer:

  • Rules engines and business logic - Combining AI outputs with deterministic business rules
  • Integration with existing systems - APIs connecting to ERP (SAP, Oracle), CRM (Salesforce), and ticketing systems
  • Audit trails and compliance controls - Logging all AI decisions, data access, and model usage
  • AI gateways and firewalls - Security layers that inspect prompts, filter outputs, and prevent data leakage
  • Zero Trust architecture - Identity verification and least-privilege access for all AI interactions
  • Uses AI as a component, not as magic - Clear boundaries between AI inference and business logic

Why This Architecture Matters

Power Efficiency

You stop using 450W GPUs for work that a 10W NPU can handle. Consider the math:

  • Single GPU server: 450W continuous power = ~3,942 kWh/year = ~$400-800/year in electricity (depending on rates)
  • Edge NPU node: 10W continuous power = ~88 kWh/year = ~$9-18/year in electricity
  • 100 edge nodes: Still only ~8,800 kWh/year vs 394,200 kWh for 100 GPU servers

In a data center with hundreds of edge devices, this translates to:

  • 90%+ reduction in power consumption for Perception & Triage workloads
  • Significant cost savings on both electricity and cooling infrastructure
  • Reduced environmental impact - lower carbon footprint and reduced heat generation
  • Better scalability - Add edge nodes without proportional increases in power and cooling capacity

Predictable Latency

Each component has a narrow, clear role. Perception happens at the edge (sub-millisecond), transformation in the core (seconds), and decisions are routed through existing workflows (milliseconds).

Compliance and Security

Data stays within your perimeter, but security requires more than just physical boundaries:

  • Zero Trust architecture - Every AI interaction requires authentication and authorization, even within the internal network
  • AI firewalls and gateways - Specialized security layers that inspect prompts, filter outputs, and prevent prompt injection attacks
  • Data sovereignty - Full control over where data is stored and processed, critical for GDPR, HIPAA, and industry-specific regulations
  • Audit trails - Complete logging of all AI interactions, model usage, and data access for compliance reporting
  • Model governance - Cryptographic signing of model artifacts, version control, and deployment approval workflows
  • Network segmentation - Isolated networks for AI workloads using VPCs and strict firewall rules
  • Compliance teams can understand the architecture - Clear separation of concerns makes security reviews and audits straightforward

Traditional security tools may not address AI-specific threats like prompt injection, model extraction, or adversarial attacks. AI-specific firewalls and gateways provide an additional security layer designed for LLM interactions.

Scalability

You can add edge nodes without exponentially increasing infrastructure costs. The core can scale independently of the edge processing needs.

Security and Compliance Considerations

Deploying AI behind the firewall requires more than just physical boundaries. AI systems introduce unique security challenges that traditional firewalls may not address.

AI-Specific Security Threats

Prompt Injection Attacks:

  • Malicious inputs designed to manipulate LLM outputs
  • Can lead to data leakage, unauthorized actions, or system compromise
  • Requires specialized detection and filtering at the AI gateway level

Model Extraction:

  • Adversaries attempting to reverse-engineer proprietary models
  • Requires rate limiting, input validation, and monitoring of unusual query patterns

Data Poisoning:

  • Malicious training data designed to corrupt model behavior
  • Requires careful data validation and continuous monitoring for model drift

Adversarial Attacks:

  • Specially crafted inputs designed to fool AI models
  • Can cause misclassification in vision systems or incorrect outputs in LLMs

Security Architecture

AI Firewalls and Gateways:

  • Specialized security layers that inspect prompts and filter outputs
  • Prevent prompt injection, data exfiltration, and toxic outputs
  • Provide real-time input/output security for LLM interactions
  • Examples: Cloudflare Firewall for AI, Securiti Context-Aware LLM Firewalls, Radware LLM Firewall

Zero Trust Architecture:

  • “Never trust, always verify” - Every AI interaction requires authentication
  • Strict identity-based access controls for users, devices, and AI agents
  • Least-privilege access principles applied to AI endpoints
  • Continuous monitoring and verification of all AI interactions

Network Segmentation:

  • Isolated networks for AI workloads using VPCs and subnets
  • Strict firewall rules preventing direct exposure of production models
  • Separate networks for training, inference, and development environments
  • Private connectivity to prevent data exposure

Model Governance:

  • Cryptographic signing of model artifacts to ensure integrity
  • Version control and deployment approval workflows
  • Containerized deployments with restricted privileges
  • Continuous vulnerability scanning of models and dependencies

Monitoring and Auditing:

  • Complete logging of all AI interactions, model usage, and data access
  • Real-time anomaly detection for unusual patterns or attacks
  • Compliance reporting and audit trails for regulatory requirements
  • Incident response plans tailored to AI-specific threats

Implementation Considerations

Evaluating Accelerators

Look beyond just GPU specifications. Consider these factors:

Performance Metrics:

  • TOPS (Tera Operations Per Second) - Raw compute capability, but not always indicative of real-world performance
  • Power consumption per inference - Critical for edge deployments and operational costs
  • Latency requirements - Real-time (<100ms) vs near-real-time (<1s) vs batch processing
  • Throughput - Inferences per second for your specific model and input size
  • Memory bandwidth - Important for large models and high-resolution inputs

Practical Considerations:

  • Integration capabilities - Does it work with your existing infrastructure? Docker/Kubernetes support? Standard APIs?
  • Model format support - ONNX, TensorFlow Lite, PyTorch, TensorRT compatibility
  • Development ecosystem - SDK quality, documentation, community support, and available pre-trained models
  • Deployment complexity - How easy is it to deploy, update, and maintain?
  • Vendor lock-in - Proprietary formats vs open standards

Total Cost of Ownership (3-5 years):

  • Hardware acquisition - Initial purchase price
  • Power and cooling - Ongoing operational costs
  • Development time - Integration and optimization effort
  • Maintenance and support - Updates, patches, and vendor support costs
  • Scalability costs - How expensive is it to add capacity?

Real-World Example: For a video surveillance system processing 100 streams:

  • GPU approach: 2x A100 GPUs ($20,000) + server ($5,000) + 900W power = $25,000 + $3,600/year power
  • Edge NPU approach: 100x Jetson Orin Nano ($500 each) = $50,000 + $1,800/year power
  • Break-even: ~8 years, but edge approach provides better latency, redundancy, and distributed processing

Hardware Selection

When selecting hardware for your three-layer architecture, consider:

For Edge Nodes (Perception & Triage):

  • Power efficiency (5-60W total system power)
  • Integration capabilities (PCIe, USB, embedded)
  • Model format support (ONNX, TensorFlow Lite, PyTorch)
  • Deployment environment (temperature, space, network connectivity)

For Core Nodes (Transformation & Understanding):

  • GPU performance matching your model size and throughput needs
  • Memory bandwidth for large models and vector databases
  • Storage requirements for model weights
  • Redundancy and failover capabilities

For detailed hardware specifications, vendor comparisons, and alternative accelerator options (including Chinese NPUs, edge accelerators, and exotic architectures), see “Beyond NVIDIA: A Catalog of Alternative AI Accelerators”.

Industry-Specific Patterns

Finance:

  • Edge nodes - Real-time fraud detection on transaction streams, analyzing patterns before data leaves branch offices
  • Core processing - Complex risk analysis, regulatory reporting, and document analysis (contracts, loan applications)
  • Workflow integration - Automated compliance checks, KYC/AML processing, and integration with core banking systems
  • Example: Edge device at each ATM analyzing transaction patterns locally, only sending anomalies to core for deep analysis

Operations & Facilities:

  • Edge sensors - Monitoring equipment health, HVAC systems, and building security in real-time
  • Core analysis - Predictive maintenance scheduling, energy optimization, and facility management reporting
  • Workflow integration - Automated work order generation, parts ordering, and technician dispatch
  • Example: Edge NPUs on each floor monitoring temperature, occupancy, and equipment vibration, triggering alerts only when thresholds are exceeded

Manufacturing:

  • Vision systems - Production line quality control, defect detection, and component verification
  • Core optimization - Production scheduling, supply chain optimization, and demand forecasting
  • Workflow integration - Automated inventory management, reorder triggers, and ERP system updates
  • Example: Edge vision systems on each production line doing real-time quality checks, with core LLM analyzing production reports and optimizing schedules

Healthcare:

  • Edge devices - Patient monitoring, medical imaging preprocessing, and real-time alert generation
  • Core processing - Clinical decision support, medical record analysis, and research data processing
  • Workflow integration - EMR updates, appointment scheduling, and billing system integration
  • Example: Edge devices in patient rooms monitoring vital signs and alerting nurses, with core LLM analyzing patient histories for treatment recommendations

Retail:

  • Edge systems - In-store analytics, inventory tracking, and customer behavior analysis
  • Core processing - Demand forecasting, pricing optimization, and supply chain management
  • Workflow integration - POS system integration, inventory management, and marketing automation
  • Example: Edge cameras analyzing foot traffic and product interactions in real-time, with core LLM optimizing inventory and pricing strategies

Getting Started

If your AI plan today is just “GPU + LLM”, you don’t have an architecture yet. Start by:

1. Audit Your Current AI Usage

  • Shadow IT discovery - Where are employees using external tools (ChatGPT, Claude, etc.)?
  • Task analysis - What specific tasks are they trying to automate?
  • Data flow mapping - What sensitive data is leaving your perimeter?
  • Cost analysis - What are you spending on external AI services?
  • Risk assessment - What compliance and security risks exist?

2. Map Workloads to Categories

Classify your AI needs into the three categories:

  • Perception & Triage - Real-time monitoring, filtering, routing
  • Transformation & Understanding - Document analysis, summarization, content generation
  • Decision & Automation - Approvals, escalations, workflow triggers

For each workload, document:

  • Latency requirements (real-time vs batch)
  • Data sensitivity and compliance needs
  • Current solution (if any) and its limitations
  • Expected volume and growth projections

3. Design the Three-Layer Architecture

Plan your components:

Core Node:

  • GPU selection based on model size and throughput needs
  • Storage requirements for model weights and vector databases
  • Network bandwidth for serving multiple edge nodes
  • Redundancy and failover strategies

Edge Nodes:

  • Hardware selection based on power, performance, and integration needs
  • Deployment locations (branch offices, manufacturing floors, etc.)
  • Network connectivity and bandwidth requirements
  • Management and update mechanisms

Workflow Layer:

  • Integration points with existing systems (APIs, databases, message queues)
  • Rules engine selection and business logic implementation
  • Security and compliance tooling (AI firewalls, audit logging)
  • Monitoring and observability infrastructure

4. Start Small

Pilot with one workload category before expanding:

  • Choose a low-risk, high-value use case - Something that provides clear ROI but won’t disrupt critical operations
  • Prove the architecture - Validate that the three-layer approach works for your environment
  • Measure everything - Performance, costs, power consumption, user satisfaction
  • Iterate based on learnings - Refine the architecture before scaling

5. Measure and Iterate

Track key metrics:

  • Performance - Latency, throughput, accuracy
  • Costs - Hardware, power, development, maintenance
  • Compliance - Audit trail completeness, data residency, security incidents
  • Business impact - Time saved, errors reduced, revenue impact

Establish a feedback loop:

  • Regular reviews of architecture decisions
  • Cost optimization opportunities
  • Technology updates and new accelerator options
  • Evolving business requirements

Common Pitfalls to Avoid

  • Over-engineering the core - Don’t buy the biggest GPU “just in case”
  • Under-estimating edge complexity - Edge nodes need management, updates, and monitoring too
  • Ignoring the workflow layer - AI without integration is just a demo
  • Skipping security - AI-specific threats require AI-specific defenses
  • Forgetting compliance - Design for auditability from day one

The Bottom Line

The goal isn’t to eliminate external AI tools through prohibition - it’s to provide internal alternatives that are faster, cheaper, and more secure than the shadow IT that inevitably emerges when employees can’t get their work done.

Remember: Your employees will use AI one way or another. The question is whether you’ll provide them with tools that keep your data secure and your compliance team happy, or if you’ll force them to work around you with inferior solutions.

Key Takeaways

  1. Not all AI workloads need GPUs - Most perception and decision tasks are better served by dedicated edge accelerators

  2. Architecture matters more than hardware - A well-designed three-layer system outperforms expensive hardware in a poorly architected setup

  3. Power efficiency has real costs - 90% power reduction for edge workloads translates to significant operational savings

  4. Security requires AI-specific tools - Traditional firewalls don’t address prompt injection, model extraction, or other AI-specific threats

  5. Start small, measure everything - Pilot with one workload category, prove the architecture, then scale based on learnings

  6. Compliance is a feature, not a burden - Proper architecture makes compliance easier, not harder

The future of enterprise AI isn’t about buying the biggest GPU - it’s about building the right architecture for your specific needs, with the right hardware in the right places, and the right security and compliance controls throughout.

/Users/damirmukimov/projects/samyrai.github.io/layouts/partials/hardware-schema.html