Models & Agents

What to run on your Spark.

Pick a tier that fits your unified memory budget, then pair it with an agent stack that matches your workload. All numbers measured on Grace-Blackwell silicon at our published quantization defaults.

Light · 7B–13B

16–32GB Spark

Model	Params	Quant	Min RAM	64GB tok/s	128GB tok/s	Notes
Qwen3-7B	7B	Q5_K_M	8GB	96	110	Default planner for multi-agent meshes. Sub-10ms first token on Spark.
Llama-3.1-13B	13B	Q5_K_M	14GB	62	74	Cheap, reliable assistant. Pair with a 70B for reasoning hand-off.

Workhorse · 32B–70B

64GB Spark

Model	Params	Quant	Min RAM	64GB tok/s	128GB tok/s	Notes
Qwen3-Coder-32B	32B	Q4_K_M	24GB	34	41	Top open coder for Spark. Excellent at long refactors with 32k context.
Llama-3.1-70B	70B	Q4_K_M	48GB	18	24	The Spark workhorse. Anchor model for any 64GB+ rig.

Frontier · 120B+

128GB Spark or cluster

Model	Params	Quant	Min RAM	64GB tok/s	128GB tok/s	Notes
Mixtral 8x22B	141B (MoE)	Q3_K_M	96GB	—	19	Strong reasoner that fits 128GB Spark with headroom for KV cache.
Qwen3-235B	235B (MoE)	Q4_K_M	140GB	—	11	Requires a 2-node cluster or 128GB + aggressive offload.

Agent stacks

Reference stacks we've validated on Spark. Pick the one that matches your minimum hardware config.

64GB Spark laptop

OpenShell + Nemo-Claw

NVIDIA-aligned local agent reference stack. Driver-level offload for the Grace NPU.

OpenShell 0.4Nemo-ClawOllama 0.9

32GB Spark laptop

llama.cpp + Aider

Minimal coding-agent loop. Best latency on 13B–32B coders. No daemon required.

llama.cppAiderripgrep

128GB Spark or 2-node cluster

vLLM + Letta

Persistent multi-agent memory with high-throughput batched inference. Production-leaning.

vLLM 0.6LettaQdrant