Zum Hauptinhalt springen

AI Hardware Capacity Framework

Overview

This framework translates technical AI hardware specifications into business-relevant capacity metrics. Instead of focusing on TOPS, tokens/second, or other technical measures, we express capabilities in terms of people served and business processes supported.

Core Philosophy: Coarse buckets over fake precision. Focus on sustainable workloads, not peak theoretical performance.

Business Workload Categories

We define 4 generic workload categories that cover most enterprise AI use cases:

1. Internal Q&A + Drafting (Knowledge Workers)

  • Who uses it: Support teams, operations, finance, HR, legal
  • Unit: Active concurrent users
  • Sustainability note: Based on interactive usage patterns (questions, drafting, research)

2. Document Processing

  • What it does: OCR, classification, field extraction, summarization
  • Unit: Documents per day (sustained processing)
  • Examples: Invoices, contracts, reports, forms

3. Tickets/Emails Triage

  • What it does: Classification, routing, priority assignment, basic responses
  • Unit: Tickets/emails per day
  • Examples: Support tickets, customer emails, internal requests

4. Vision/Monitoring (Optional)

  • What it does: Real-time analysis, anomaly detection, alerts
  • Unit: Camera streams or events per second
  • Examples: Security feeds, quality control, equipment monitoring

Capacity Buckets

We use coarse, business-relevant buckets rather than precise numbers:

User Scale

  • Small team: ≤ 20 active users
  • Medium team: 20-50 active users
  • Large team: 100+ active users

Document Processing Scale

  • Basic triage: 500 docs/day (classification only)
  • With AI analysis: 2,000 docs/day (extraction + summarization)
  • High volume: 10,000+ docs/day (triage-focused)

Ticket/Email Scale

  • Small operation: 1,000 tickets/day
  • Medium operation: 5,000 tickets/day
  • Large operation: 20,000+ tickets/day

Reporting Spike Benchmark

Secondary metric for batch processing scenarios (quarter-end, audits, reporting):

  • Test case: 4,000 mixed documents end-to-end
  • SLA targets:
    • Fast: ≤ 4 hours (same-day processing)
    • Overnight: ≤ 12 hours (weekend processing)
    • Batch: > 12 hours (asynchronous processing)

This represents “what happens when someone dumps a backlog on Friday at 5 PM.”

Hardware Capacity Table

Hardware TierUsersDocs/DayTickets/Day4k SpikeNotes
Strix Halo
(AMD 395)
Medium (20-50)2,000 AI5,0004-6 hrsDept server
HX370
(AMD 370)
Small (≤20)500 AI1k-2k8-12 hrsWorkstation
Hailo-8
(M.2)
N/A10k+ triage20k+1-2 hrsClassifier
Intel NUC
(14 Pro)
Small (≤15)300 AI800-1.5k12-18 hrsCopilot+
Orange Pi
(Ascend)
N/A15k+ triage30k+45 minNPU pod

Model Recommendations

Hardware TierModelsUse CasesPerformance
Strix HaloGPT-OSS 20B/120B, Qwen 14B/32B, Llama 13B/30BFull LLM, RAG30-66 t/s
HX370GPT-OSS 7B/13B, Qwen 7B/14B, Llama 7B/13BRAG, summary40-80 t/s
Hailo-8YOLOv8/9, EfficientNet, MobileNetVision, detection1000+ inf/s
Intel NUCPhi-3/3.5, Gemma 2B/7B, Copilot+Windows, Copilot50-100 t/s
Orange PiDeepSeek-R1, Ascend modelsInference tasksAscend opt.

Workflow-Specific Model Recommendations

1. Q&A + Knowledge Worker Support

Model Types: Conversational LLMs with strong reasoning

  • Large models (20B+ parameters): GPT-OSS 120B, Qwen 32B, Llama 30B
  • Medium models (7-14B parameters): GPT-OSS 13B, Qwen 14B, Llama 13B
  • Small models (1-3B parameters): Phi-3, Gemma 2B, Qwen 1.5B

Key Requirements: Context understanding, multi-turn conversation, tool use Performance Target: < 5 seconds response time for interactive use

2. Document Processing & Analysis

Model Types: Specialized for text extraction and summarization

  • OCR + Layout models: TrOCR, LayoutParser, Donut
  • Field extraction: Fine-tuned BERT/RoBERTa variants
  • Summarization: Pegasus, BART, T5-based models
  • Classification: EfficientNet for document type detection

Key Requirements: Accuracy over speed, handles complex layouts Performance Target: < 30 seconds per document for interactive analysis

3. Ticket/Email Triage & Routing

Model Types: Fast classification and lightweight LLMs

  • Intent classification: DistilBERT, MobileBERT, TinyLLaMA
  • Priority scoring: Rule-based + ML classifiers
  • Auto-response: Template-based with small LLMs
  • Routing: Multi-label classification models

Key Requirements: High throughput, low latency, high accuracy on routing Performance Target: < 1 second per ticket, 1000+ tickets/hour

4. Vision Monitoring & Surveillance

Model Types: Real-time computer vision models

  • Object detection: YOLOv8/9/10, RT-DETR
  • Anomaly detection: Autoencoders, variational models
  • Classification: EfficientNet, MobileNet, RegNet
  • Tracking: DeepSORT, ByteTrack with ReID

Key Requirements: Real-time processing, low power consumption Performance Target: 30+ FPS per stream, < 100ms latency

Model Specifications by Hardware Tier

HardwareModelParamsMemoryPerfPowerUse Case
Strix HaloGPT-OSS 120B120B256GB30-45 t/s140WComplex RAG
Qwen 32B32B64GB50-70 t/s140WConversations
Llama 30B30B64GB40-60 t/s140WDoc analysis
GPT-OSS 20B20B48GB60-80 t/s140WBalanced
HX370GPT-OSS 13B13B32GB40-60 t/s60WOffice RAG
Qwen 14B14B32GB45-65 t/s60WKnowledge work
Llama 13B13B32GB50-70 t/s60WSummarization
GPT-OSS 7B7B16GB70-100 t/s60WLight assist
Hailo-8YOLOv8-Large43.7M2GB1000+ inf/s15WDetection
EfficientNet-B766M1GB500+ inf/s15WClassification
MobileNetV3-Large5.4M512MB2000+ inf/s15WFast triage
Intel NUCPhi-3.5 Mini3.8B8GB50-80 t/s45WWindows int
Gemma 7B7B16GB40-60 t/s45WCopilot
Phi-3 Medium14B32GB30-50 t/s45WAdvanced
Orange PiDeepSeek-R167B128GB20-40 t/s25WInference opt
Ascend-LLM 32B32B64GB30-50 t/s25WSpecialized
Ascend-Vision1.2B4GB500+ inf/s25WVision proc

Usage Guidelines

For Sales/Sizing Conversations

  1. Start with business needs: “How many people need AI assistance? How many documents do you process daily?”
  2. Map to buckets: Use the coarse categories above
  3. Show hardware options: Present 2-3 viable tiers
  4. Discuss spikes separately: “For reporting periods, we can handle 4k docs in X hours”

For Architecture Decisions

  1. Design for sustained load: Use the daily capacity numbers
  2. Plan for spikes: Ensure reporting benchmarks meet business SLAs
  3. Avoid over-engineering: Don’t size for theoretical peaks
  4. Consider hybrid approaches: Combine different hardware types for different workloads

For Client Presentations

  1. Use business language: “This box can handle your customer support team’s questions plus their ticket backlog”
  2. Show real scenarios: “If finance dumps 4k invoices on Friday, it’s processed by Monday morning”
  3. Avoid technical jargon: No TOPS, tokens/sec, or model names unless asked
  4. Focus on outcomes: “Faster response times, reduced manual work, better compliance”

Validation Principles

Sanity Checks

  • Token math: 1 doc ≈ 1k tokens total (text + prompts + instructions)
  • Interactive latency: < 5 seconds for Q&A, < 30 seconds for complex tasks
  • Batch efficiency: 4k docs should complete within business-relevant timeframes
  • Real-world factors: Account for I/O, preprocessing, postprocessing, and network latency

When to Revisit

  • New hardware releases: Update benchmarks quarterly
  • Client feedback: Adjust buckets based on real deployment experiences
  • Technology changes: Re-evaluate when new models or optimization techniques emerge
  • Market shifts: Update when business needs change significantly

Implementation Notes

Data Sources

  • Benchmarks: Based on real LM Studio testing with GPT-OSS, Qwen, and Llama models
  • Workload assumptions: Interactive usage patterns, not continuous batch processing
  • Hardware specs: Vendor claims validated with independent testing where possible

Limitations

  • Model-dependent: Performance varies by model size and quantization
  • Software stack: Results depend on optimization (ROCm, Vulkan, DirectML)
  • Real-world variance: Network, storage, and concurrent workloads affect performance
  • Future-proofing: Framework designed to accommodate new hardware categories

Last updated: December 2025 Framework version: 1.2

/Users/damirmukimov/projects/samyrai.github.io/layouts/partials/hardware-schema.html