What to run on your Spark.
Pick a tier that fits your unified memory budget, then pair it with an agent stack that matches your workload. All numbers measured on Grace-Blackwell silicon at our published quantization defaults.
Light · 7B–13B
16–32GB Spark| Model | Params | Quant | Min RAM | 64GB tok/s | 128GB tok/s | Notes |
|---|---|---|---|---|---|---|
| Qwen3-7B | 7B | Q5_K_M | 8GB | 96 | 110 | Default planner for multi-agent meshes. Sub-10ms first token on Spark. |
| Llama-3.1-13B | 13B | Q5_K_M | 14GB | 62 | 74 | Cheap, reliable assistant. Pair with a 70B for reasoning hand-off. |
Workhorse · 32B–70B
64GB Spark| Model | Params | Quant | Min RAM | 64GB tok/s | 128GB tok/s | Notes |
|---|---|---|---|---|---|---|
| Qwen3-Coder-32B | 32B | Q4_K_M | 24GB | 34 | 41 | Top open coder for Spark. Excellent at long refactors with 32k context. |
| Llama-3.1-70B | 70B | Q4_K_M | 48GB | 18 | 24 | The Spark workhorse. Anchor model for any 64GB+ rig. |
Frontier · 120B+
128GB Spark or cluster| Model | Params | Quant | Min RAM | 64GB tok/s | 128GB tok/s | Notes |
|---|---|---|---|---|---|---|
| Mixtral 8x22B | 141B (MoE) | Q3_K_M | 96GB | — | 19 | Strong reasoner that fits 128GB Spark with headroom for KV cache. |
| Qwen3-235B | 235B (MoE) | Q4_K_M | 140GB | — | 11 | Requires a 2-node cluster or 128GB + aggressive offload. |
Agent stacks
Reference stacks we've validated on Spark. Pick the one that matches your minimum hardware config.
64GB Spark laptop
OpenShell + Nemo-Claw
NVIDIA-aligned local agent reference stack. Driver-level offload for the Grace NPU.
OpenShell 0.4Nemo-ClawOllama 0.9
32GB Spark laptop
llama.cpp + Aider
Minimal coding-agent loop. Best latency on 13B–32B coders. No daemon required.
llama.cppAiderripgrep
128GB Spark or 2-node cluster
vLLM + Letta
Persistent multi-agent memory with high-throughput batched inference. Production-leaning.
vLLM 0.6LettaQdrant