Jan 1, 0001 · 6 min read

AI Hardware Capacity Framework

Overview

This framework translates technical AI hardware specifications into business-relevant capacity metrics. Instead of focusing on TOPS, tokens/second, or other technical measures, we express capabilities in terms of people served and business processes supported.

Core Philosophy: Coarse buckets over fake precision. Focus on sustainable workloads, not peak theoretical performance.

Business Workload Categories

We define 4 generic workload categories that cover most enterprise AI use cases:

1. Internal Q&A + Drafting (Knowledge Workers)

Who uses it: Support teams, operations, finance, HR, legal
Unit: Active concurrent users
Sustainability note: Based on interactive usage patterns (questions, drafting, research)

2. Document Processing

What it does: OCR, classification, field extraction, summarization
Unit: Documents per day (sustained processing)
Examples: Invoices, contracts, reports, forms

3. Tickets/Emails Triage

What it does: Classification, routing, priority assignment, basic responses
Unit: Tickets/emails per day
Examples: Support tickets, customer emails, internal requests

4. Vision/Monitoring (Optional)

What it does: Real-time analysis, anomaly detection, alerts
Unit: Camera streams or events per second
Examples: Security feeds, quality control, equipment monitoring

Capacity Buckets

We use coarse, business-relevant buckets rather than precise numbers:

User Scale

Small team: ≤ 20 active users
Medium team: 20-50 active users
Large team: 100+ active users

Document Processing Scale

Basic triage: 500 docs/day (classification only)
With AI analysis: 2,000 docs/day (extraction + summarization)
High volume: 10,000+ docs/day (triage-focused)

Ticket/Email Scale

Small operation: 1,000 tickets/day
Medium operation: 5,000 tickets/day
Large operation: 20,000+ tickets/day

Reporting Spike Benchmark

Secondary metric for batch processing scenarios (quarter-end, audits, reporting):

Test case: 4,000 mixed documents end-to-end
SLA targets:
- Fast: ≤ 4 hours (same-day processing)
- Overnight: ≤ 12 hours (weekend processing)
- Batch: > 12 hours (asynchronous processing)

This represents “what happens when someone dumps a backlog on Friday at 5 PM.”

Hardware Capacity Table

Hardware Tier	Users	Docs/Day	Tickets/Day	4k Spike	Notes
Strix Halo (AMD 395)	Medium (20-50)	2,000 AI	5,000	4-6 hrs	Dept server
HX370 (AMD 370)	Small (≤20)	500 AI	1k-2k	8-12 hrs	Workstation
Hailo-8 (M.2)	N/A	10k+ triage	20k+	1-2 hrs	Classifier
Intel NUC (14 Pro)	Small (≤15)	300 AI	800-1.5k	12-18 hrs	Copilot+
Orange Pi (Ascend)	N/A	15k+ triage	30k+	45 min	NPU pod

Model Recommendations

Hardware Tier	Models	Use Cases	Performance
Strix Halo	GPT-OSS 20B/120B, Qwen 14B/32B, Llama 13B/30B	Full LLM, RAG	30-66 t/s
HX370	GPT-OSS 7B/13B, Qwen 7B/14B, Llama 7B/13B	RAG, summary	40-80 t/s
Hailo-8	YOLOv8/9, EfficientNet, MobileNet	Vision, detection	1000+ inf/s
Intel NUC	Phi-3/3.5, Gemma 2B/7B, Copilot+	Windows, Copilot	50-100 t/s
Orange Pi	DeepSeek-R1, Ascend models	Inference tasks	Ascend opt.

Workflow-Specific Model Recommendations

1. Q&A + Knowledge Worker Support

Model Types: Conversational LLMs with strong reasoning

Large models (20B+ parameters): GPT-OSS 120B, Qwen 32B, Llama 30B
Medium models (7-14B parameters): GPT-OSS 13B, Qwen 14B, Llama 13B
Small models (1-3B parameters): Phi-3, Gemma 2B, Qwen 1.5B

Key Requirements: Context understanding, multi-turn conversation, tool use Performance Target: < 5 seconds response time for interactive use

2. Document Processing & Analysis

Model Types: Specialized for text extraction and summarization

OCR + Layout models: TrOCR, LayoutParser, Donut
Field extraction: Fine-tuned BERT/RoBERTa variants
Summarization: Pegasus, BART, T5-based models
Classification: EfficientNet for document type detection

Key Requirements: Accuracy over speed, handles complex layouts Performance Target: < 30 seconds per document for interactive analysis

3. Ticket/Email Triage & Routing

Model Types: Fast classification and lightweight LLMs

Intent classification: DistilBERT, MobileBERT, TinyLLaMA
Priority scoring: Rule-based + ML classifiers
Auto-response: Template-based with small LLMs
Routing: Multi-label classification models

Key Requirements: High throughput, low latency, high accuracy on routing Performance Target: < 1 second per ticket, 1000+ tickets/hour

4. Vision Monitoring & Surveillance

Model Types: Real-time computer vision models

Object detection: YOLOv8/9/10, RT-DETR
Anomaly detection: Autoencoders, variational models
Classification: EfficientNet, MobileNet, RegNet
Tracking: DeepSORT, ByteTrack with ReID

Key Requirements: Real-time processing, low power consumption Performance Target: 30+ FPS per stream, < 100ms latency

Model Specifications by Hardware Tier

Hardware	Model	Params	Memory	Perf	Power	Use Case
Strix Halo	GPT-OSS 120B	120B	256GB	30-45 t/s	140W	Complex RAG
	Qwen 32B	32B	64GB	50-70 t/s	140W	Conversations
	Llama 30B	30B	64GB	40-60 t/s	140W	Doc analysis
	GPT-OSS 20B	20B	48GB	60-80 t/s	140W	Balanced
HX370	GPT-OSS 13B	13B	32GB	40-60 t/s	60W	Office RAG
	Qwen 14B	14B	32GB	45-65 t/s	60W	Knowledge work
	Llama 13B	13B	32GB	50-70 t/s	60W	Summarization
	GPT-OSS 7B	7B	16GB	70-100 t/s	60W	Light assist
Hailo-8	YOLOv8-Large	43.7M	2GB	1000+ inf/s	15W	Detection
	EfficientNet-B7	66M	1GB	500+ inf/s	15W	Classification
	MobileNetV3-Large	5.4M	512MB	2000+ inf/s	15W	Fast triage
Intel NUC	Phi-3.5 Mini	3.8B	8GB	50-80 t/s	45W	Windows int
	Gemma 7B	7B	16GB	40-60 t/s	45W	Copilot
	Phi-3 Medium	14B	32GB	30-50 t/s	45W	Advanced
Orange Pi	DeepSeek-R1	67B	128GB	20-40 t/s	25W	Inference opt
	Ascend-LLM 32B	32B	64GB	30-50 t/s	25W	Specialized
	Ascend-Vision	1.2B	4GB	500+ inf/s	25W	Vision proc

Usage Guidelines

For Sales/Sizing Conversations

Start with business needs: “How many people need AI assistance? How many documents do you process daily?”
Map to buckets: Use the coarse categories above
Show hardware options: Present 2-3 viable tiers
Discuss spikes separately: “For reporting periods, we can handle 4k docs in X hours”

For Architecture Decisions

Design for sustained load: Use the daily capacity numbers
Plan for spikes: Ensure reporting benchmarks meet business SLAs
Avoid over-engineering: Don’t size for theoretical peaks
Consider hybrid approaches: Combine different hardware types for different workloads

For Client Presentations

Use business language: “This box can handle your customer support team’s questions plus their ticket backlog”
Show real scenarios: “If finance dumps 4k invoices on Friday, it’s processed by Monday morning”
Avoid technical jargon: No TOPS, tokens/sec, or model names unless asked
Focus on outcomes: “Faster response times, reduced manual work, better compliance”

Validation Principles

Sanity Checks

Token math: 1 doc ≈ 1k tokens total (text + prompts + instructions)
Interactive latency: < 5 seconds for Q&A, < 30 seconds for complex tasks
Batch efficiency: 4k docs should complete within business-relevant timeframes
Real-world factors: Account for I/O, preprocessing, postprocessing, and network latency

When to Revisit

New hardware releases: Update benchmarks quarterly
Client feedback: Adjust buckets based on real deployment experiences
Technology changes: Re-evaluate when new models or optimization techniques emerge
Market shifts: Update when business needs change significantly

Implementation Notes

Data Sources

Benchmarks: Based on real LM Studio testing with GPT-OSS, Qwen, and Llama models
Workload assumptions: Interactive usage patterns, not continuous batch processing
Hardware specs: Vendor claims validated with independent testing where possible

Limitations

Model-dependent: Performance varies by model size and quantization
Software stack: Results depend on optimization (ROCm, Vulkan, DirectML)
Real-world variance: Network, storage, and concurrent workloads affect performance
Future-proofing: Framework designed to accommodate new hardware categories

Last updated: December 2025 Framework version: 1.2

/Users/damirmukimov/projects/samyrai.github.io/layouts/partials/hardware-schema.html