Appendix B: Hardware Requirements & Performance Benchmarks¶
This appendix provides detailed performance benchmarks and hardware requirements for THETA's three Qwen3-Embedding models (0.6B/4B/8B) across different hardware configurations.
Executive Summary¶
Key Findings¶
- GPU Memory Usage
- 0.6B model: 1.1-5.4 GB (all configurations runnable)
- 4B model: 7.5-14.5 GB (all configurations runnable)
-
8B model: 14.1-18.0 GB (OOM at batch_size=64 + seq_len=512)
-
CPU vs GPU Performance
- CPU inference speed: 0.2-0.3 docs/s (extremely slow, not recommended for production)
-
GPU inference speed:
- 0.6B: 80-87 docs/s (267-290× faster than CPU)
- 4B: 55-56 docs/s (183× faster than CPU)
- 8B: 35 docs/s (117× faster than CPU)
-
Maximum Configurations on 24GB GPU
- 0.6B: batch_size=64, seq_len=512 (supported)
- 4B: batch_size=64, seq_len=512 (supported)
-
8B: batch_size=64, seq_len=256 (supported; OOM at seq_len=512)
-
Recommended Production Configurations (with 20% memory headroom)
- 0.6B: batch_size=32, seq_len=256 (peak memory 2.2 GB)
- 4B: batch_size=32, seq_len=256 (peak memory 9.3 GB)
- 8B: batch_size=16, seq_len=256 (peak memory 15.1 GB)
B.1 Model Memory Usage Reference¶
Peak GPU memory usage (GB) for different batch_size values at fixed seq_len=256.
| Model Size | batch=1 | batch=4 | batch=8 | batch=16 | batch=32 | batch=64 |
|---|---|---|---|---|---|---|
| 0.6B | 1.15 GB | 1.25 GB | 1.39 GB | 1.66 GB | 2.20 GB | 3.27 GB |
| 4B | 7.61 GB | 7.79 GB | 7.98 GB | 8.42 GB | 9.28 GB | 11.01 GB |
| 8B | 14.16 GB | 14.35 GB | 14.59 GB | 15.07 GB | 16.04 GB | 17.98 GB |
Memory Usage Patterns¶
- Base Model Memory:
- 0.6B: 1.12 GB
- 4B: 7.55 GB (~6.7× larger than 0.6B)
-
8B: 14.10 GB (~1.9× larger than 4B)
-
Memory Growth Rules:
- Doubling batch_size → +0.5-2 GB memory
- Doubling seq_len → +0.3-1.5 GB memory
- Memory usage ≈ Base model + batch_size × seq_len × constant
B.2 OOM Boundaries & Safe Configurations¶
0.6B Model¶
- Base model memory: 1.12 GB
- Max safe batch_size (seq_len=256): 64 (peak 3.27 GB)
- Max safe seq_len (batch_size=32): 512 (peak 3.27 GB)
- Recommended production config: batch_size=32, seq_len=256 (peak 2.20 GB, 20% headroom)
- OOM cases: None (all 18 configurations passed)
4B Model¶
- Base model memory: 7.55 GB
- Max safe batch_size (seq_len=256): 64 (peak 11.01 GB)
- Max safe seq_len (batch_size=32): 512 (peak 11.01 GB)
- Recommended production config: batch_size=32, seq_len=256 (peak 9.28 GB, 20% headroom)
- OOM cases: None (all 18 configurations passed)
8B Model¶
- Base model memory: 14.10 GB
- Max safe batch_size (seq_len=256): 64 (peak 17.98 GB)
- Max safe seq_len (batch_size=32): 512 (peak 17.98 GB)
- Recommended production config: batch_size=16, seq_len=256 (peak 15.07 GB, 20% headroom)
- OOM cases: 1 OOM (batch_size=64, seq_len=512 exceeds 24GB limit)
OOM Boundary Summary¶
On 24GB GPU: - Safe boundary: Peak memory < 19.2 GB (80% utilization) - 8B model OOM config: batch_size=64 + seq_len=512 (requires >24 GB) - Avoiding OOM: - Use batch_size ≤ 32 or seq_len ≤ 256 for 8B model - Or use gradient accumulation instead of large batch_size
B.3 CPU vs GPU Decision Guide¶
Choose appropriate device based on data scale and model size.
| Data Scale | 0.6B Model | 4B Model | 8B Model |
|---|---|---|---|
| 1K docs | GPU recommended (CPU: 67.5min) | GPU (0.3min) | GPU (0.5min) |
| 5K docs | GPU recommended (CPU: 314.6min) | GPU (1.5min) | GPU (2.4min) |
| 10K docs | GPU recommended (CPU: 637.5min) | GPU (3.0min) | GPU (4.8min) |
| 50K docs | N/A | GPU (14.9min) | GPU (23.8min) |
Device Selection Guidelines¶
B.4 Performance Summary¶
Comprehensive performance summary across all tested configurations.
| Model | Device | n_docs | batch_size | docs/s | Peak Memory (GB) |
|---|---|---|---|---|---|
| 0.6B | CPU | 100 | 32 | 0.2 | N/A |
| 0.6B | CPU | 500 | 32 | 0.2 | N/A |
| 0.6B | CPU | 1000 | 32 | 0.2 | N/A |
| 0.6B | CPU | 2000 | 32 | 0.3 | N/A |
| 0.6B | CPU | 5000 | 32 | 0.3 | N/A |
| 0.6B | CPU | 10000 | 32 | 0.3 | N/A |
| 0.6B | GPU | 1000 | 64 | 80.0 | 3.27 |
| 0.6B | GPU | 5000 | 64 | 85.8 | 3.27 |
| 0.6B | GPU | 10000 | 64 | 86.8 | 3.27 |
| 0.6B | GPU | 50000 | 64 | 87.0 | 3.27 |
| 4B | GPU | 1000 | 32 | 54.9 | 9.28 |
| 4B | GPU | 5000 | 32 | 55.9 | 9.28 |
| 4B | GPU | 10000 | 32 | 56.0 | 9.28 |
| 4B | GPU | 50000 | 32 | 56.0 | 9.28 |
| 8B | GPU | 1000 | 16 | 35.0 | 15.07 |
| 8B | GPU | 5000 | 16 | 35.0 | 15.07 |
| 8B | GPU | 10000 | 16 | 35.0 | 15.07 |
| 8B | GPU | 50000 | 16 | 35.0 | 15.07 |
B.5 Usage Recommendations¶
Choosing Model Size¶
Data < 10K docs → 0.6B (fast, low memory)
Data 10K-50K docs → 4B (balanced performance)
Data > 50K docs → 8B (highest quality)
High quality required → 8B (regardless of data size)
Configuring batch_size¶
Abundant memory (>16GB) → batch_size=32 or 64
Limited memory (8-16GB) → batch_size=16
Tight memory (<8GB) → batch_size=8 or use gradient accumulation
Recommended Configurations by GPU Memory¶
8GB GPU¶
- 0.6B: batch_size=64 (supported)
- 4B: batch_size=4-8 (limited)
- 8B: Not recommended
16GB GPU¶
- 0.6B: batch_size=64 (supported)
- 4B: batch_size=32 (supported)
- 8B: batch_size=8 (limited)
24GB GPU¶
- 0.6B: batch_size=64 (supported)
- 4B: batch_size=64 (supported)
- 8B: batch_size=32 (supported)
Test Environment¶
- GPU: NVIDIA GeForce RTX 3090, 24GB VRAM
- CPU: Multi-core CPU (for comparison)
- Precision: FP16 (GPU), FP32 (CPU)
- Framework: PyTorch 2.7.0 + Transformers 5.3.0
- Model Source: ModelScope (Qwen/Qwen3-Embedding-{0.6B,4B,8B})
- Test Date: April 17, 2026
Test Methodology¶
GPU Memory Testing¶
- Test matrix: 3 models × 6 batch_sizes (1,4,8,16,32,64) × 3 seq_lens (128,256,512) = 54 configurations
- Each configuration runs one forward pass, recording peak memory
- Memory monitored using
torch.cuda.max_memory_allocated()
CPU/GPU Time Testing¶
- Each configuration runs 3 times, averaging last 2 runs (excluding warmup)
- Uses randomly generated text data
- CPU tests use all available cores
- GPU tests use recommended batch_size
Key Conclusions¶
- Strongly recommend GPU: GPU inference is 117-290× faster than CPU
- 24GB GPU sufficient for all models: Except extreme 8B configurations
- Production recommendations:
- Small tasks (<10K docs): 0.6B model, batch_size=32
- Medium tasks (10K-50K docs): 4B model, batch_size=32
- Large/high-quality tasks: 8B model, batch_size=16
- CPU only suitable for: Very small tests (<100 docs) or offline processing without GPU