AI Hardware Capacity Framework
Overview
This framework translates technical AI hardware specifications into business-relevant capacity metrics. Instead of focusing on TOPS, tokens/second, or other technical measures, we express capabilities in terms of people served and business processes supported.
Core Philosophy: Coarse buckets over fake precision. Focus on sustainable workloads, not peak theoretical performance.
Business Workload Categories
We define 4 generic workload categories that cover most enterprise AI use cases:
1. Internal Q&A + Drafting (Knowledge Workers)
- Who uses it: Support teams, operations, finance, HR, legal
- Unit: Active concurrent users
- Sustainability note: Based on interactive usage patterns (questions, drafting, research)
2. Document Processing
- What it does: OCR, classification, field extraction, summarization
- Unit: Documents per day (sustained processing)
- Examples: Invoices, contracts, reports, forms
3. Tickets/Emails Triage
- What it does: Classification, routing, priority assignment, basic responses
- Unit: Tickets/emails per day
- Examples: Support tickets, customer emails, internal requests
4. Vision/Monitoring (Optional)
- What it does: Real-time analysis, anomaly detection, alerts
- Unit: Camera streams or events per second
- Examples: Security feeds, quality control, equipment monitoring
Capacity Buckets
We use coarse, business-relevant buckets rather than precise numbers:
User Scale
- Small team: ≤ 20 active users
- Medium team: 20-50 active users
- Large team: 100+ active users
Document Processing Scale
- Basic triage: 500 docs/day (classification only)
- With AI analysis: 2,000 docs/day (extraction + summarization)
- High volume: 10,000+ docs/day (triage-focused)
Ticket/Email Scale
- Small operation: 1,000 tickets/day
- Medium operation: 5,000 tickets/day
- Large operation: 20,000+ tickets/day
Reporting Spike Benchmark
Secondary metric for batch processing scenarios (quarter-end, audits, reporting):
- Test case: 4,000 mixed documents end-to-end
- SLA targets:
- Fast: ≤ 4 hours (same-day processing)
- Overnight: ≤ 12 hours (weekend processing)
- Batch: > 12 hours (asynchronous processing)
This represents “what happens when someone dumps a backlog on Friday at 5 PM.”
Hardware Capacity Table
| Hardware Tier | Users | Docs/Day | Tickets/Day | 4k Spike | Notes |
|---|---|---|---|---|---|
| Strix Halo (AMD 395) | Medium (20-50) | 2,000 AI | 5,000 | 4-6 hrs | Dept server |
| HX370 (AMD 370) | Small (≤20) | 500 AI | 1k-2k | 8-12 hrs | Workstation |
| Hailo-8 (M.2) | N/A | 10k+ triage | 20k+ | 1-2 hrs | Classifier |
| Intel NUC (14 Pro) | Small (≤15) | 300 AI | 800-1.5k | 12-18 hrs | Copilot+ |
| Orange Pi (Ascend) | N/A | 15k+ triage | 30k+ | 45 min | NPU pod |
Model Recommendations
| Hardware Tier | Models | Use Cases | Performance |
|---|---|---|---|
| Strix Halo | GPT-OSS 20B/120B, Qwen 14B/32B, Llama 13B/30B | Full LLM, RAG | 30-66 t/s |
| HX370 | GPT-OSS 7B/13B, Qwen 7B/14B, Llama 7B/13B | RAG, summary | 40-80 t/s |
| Hailo-8 | YOLOv8/9, EfficientNet, MobileNet | Vision, detection | 1000+ inf/s |
| Intel NUC | Phi-3/3.5, Gemma 2B/7B, Copilot+ | Windows, Copilot | 50-100 t/s |
| Orange Pi | DeepSeek-R1, Ascend models | Inference tasks | Ascend opt. |
Workflow-Specific Model Recommendations
1. Q&A + Knowledge Worker Support
Model Types: Conversational LLMs with strong reasoning
- Large models (20B+ parameters): GPT-OSS 120B, Qwen 32B, Llama 30B
- Medium models (7-14B parameters): GPT-OSS 13B, Qwen 14B, Llama 13B
- Small models (1-3B parameters): Phi-3, Gemma 2B, Qwen 1.5B
Key Requirements: Context understanding, multi-turn conversation, tool use Performance Target: < 5 seconds response time for interactive use
2. Document Processing & Analysis
Model Types: Specialized for text extraction and summarization
- OCR + Layout models: TrOCR, LayoutParser, Donut
- Field extraction: Fine-tuned BERT/RoBERTa variants
- Summarization: Pegasus, BART, T5-based models
- Classification: EfficientNet for document type detection
Key Requirements: Accuracy over speed, handles complex layouts Performance Target: < 30 seconds per document for interactive analysis
3. Ticket/Email Triage & Routing
Model Types: Fast classification and lightweight LLMs
- Intent classification: DistilBERT, MobileBERT, TinyLLaMA
- Priority scoring: Rule-based + ML classifiers
- Auto-response: Template-based with small LLMs
- Routing: Multi-label classification models
Key Requirements: High throughput, low latency, high accuracy on routing Performance Target: < 1 second per ticket, 1000+ tickets/hour
4. Vision Monitoring & Surveillance
Model Types: Real-time computer vision models
- Object detection: YOLOv8/9/10, RT-DETR
- Anomaly detection: Autoencoders, variational models
- Classification: EfficientNet, MobileNet, RegNet
- Tracking: DeepSORT, ByteTrack with ReID
Key Requirements: Real-time processing, low power consumption Performance Target: 30+ FPS per stream, < 100ms latency
Model Specifications by Hardware Tier
| Hardware | Model | Params | Memory | Perf | Power | Use Case |
|---|---|---|---|---|---|---|
| Strix Halo | GPT-OSS 120B | 120B | 256GB | 30-45 t/s | 140W | Complex RAG |
| Qwen 32B | 32B | 64GB | 50-70 t/s | 140W | Conversations | |
| Llama 30B | 30B | 64GB | 40-60 t/s | 140W | Doc analysis | |
| GPT-OSS 20B | 20B | 48GB | 60-80 t/s | 140W | Balanced | |
| HX370 | GPT-OSS 13B | 13B | 32GB | 40-60 t/s | 60W | Office RAG |
| Qwen 14B | 14B | 32GB | 45-65 t/s | 60W | Knowledge work | |
| Llama 13B | 13B | 32GB | 50-70 t/s | 60W | Summarization | |
| GPT-OSS 7B | 7B | 16GB | 70-100 t/s | 60W | Light assist | |
| Hailo-8 | YOLOv8-Large | 43.7M | 2GB | 1000+ inf/s | 15W | Detection |
| EfficientNet-B7 | 66M | 1GB | 500+ inf/s | 15W | Classification | |
| MobileNetV3-Large | 5.4M | 512MB | 2000+ inf/s | 15W | Fast triage | |
| Intel NUC | Phi-3.5 Mini | 3.8B | 8GB | 50-80 t/s | 45W | Windows int |
| Gemma 7B | 7B | 16GB | 40-60 t/s | 45W | Copilot | |
| Phi-3 Medium | 14B | 32GB | 30-50 t/s | 45W | Advanced | |
| Orange Pi | DeepSeek-R1 | 67B | 128GB | 20-40 t/s | 25W | Inference opt |
| Ascend-LLM 32B | 32B | 64GB | 30-50 t/s | 25W | Specialized | |
| Ascend-Vision | 1.2B | 4GB | 500+ inf/s | 25W | Vision proc |
Usage Guidelines
For Sales/Sizing Conversations
- Start with business needs: “How many people need AI assistance? How many documents do you process daily?”
- Map to buckets: Use the coarse categories above
- Show hardware options: Present 2-3 viable tiers
- Discuss spikes separately: “For reporting periods, we can handle 4k docs in X hours”
For Architecture Decisions
- Design for sustained load: Use the daily capacity numbers
- Plan for spikes: Ensure reporting benchmarks meet business SLAs
- Avoid over-engineering: Don’t size for theoretical peaks
- Consider hybrid approaches: Combine different hardware types for different workloads
For Client Presentations
- Use business language: “This box can handle your customer support team’s questions plus their ticket backlog”
- Show real scenarios: “If finance dumps 4k invoices on Friday, it’s processed by Monday morning”
- Avoid technical jargon: No TOPS, tokens/sec, or model names unless asked
- Focus on outcomes: “Faster response times, reduced manual work, better compliance”
Validation Principles
Sanity Checks
- Token math: 1 doc ≈ 1k tokens total (text + prompts + instructions)
- Interactive latency: < 5 seconds for Q&A, < 30 seconds for complex tasks
- Batch efficiency: 4k docs should complete within business-relevant timeframes
- Real-world factors: Account for I/O, preprocessing, postprocessing, and network latency
When to Revisit
- New hardware releases: Update benchmarks quarterly
- Client feedback: Adjust buckets based on real deployment experiences
- Technology changes: Re-evaluate when new models or optimization techniques emerge
- Market shifts: Update when business needs change significantly
Implementation Notes
Data Sources
- Benchmarks: Based on real LM Studio testing with GPT-OSS, Qwen, and Llama models
- Workload assumptions: Interactive usage patterns, not continuous batch processing
- Hardware specs: Vendor claims validated with independent testing where possible
Limitations
- Model-dependent: Performance varies by model size and quantization
- Software stack: Results depend on optimization (ROCm, Vulkan, DirectML)
- Real-world variance: Network, storage, and concurrent workloads affect performance
- Future-proofing: Framework designed to accommodate new hardware categories
Last updated: December 2025 Framework version: 1.2