Models & Agents

What to run on your Spark.

Pick a tier that fits your unified memory budget, then pair it with an agent stack that matches your workload. All numbers measured on Grace-Blackwell silicon at our published quantization defaults.

Light · 7B–13B

16–32GB Spark
ModelParamsQuantMin RAM64GB tok/s128GB tok/s
Qwen3-7B7BQ5_K_M8GB96110
Llama-3.1-13B13BQ5_K_M14GB6274

Workhorse · 32B–70B

64GB Spark
ModelParamsQuantMin RAM64GB tok/s128GB tok/s
Qwen3-Coder-32B32BQ4_K_M24GB3441
Llama-3.1-70B70BQ4_K_M48GB1824

Frontier · 120B+

128GB Spark or cluster
ModelParamsQuantMin RAM64GB tok/s128GB tok/s
Mixtral 8x22B141B (MoE)Q3_K_M96GB19
Qwen3-235B235B (MoE)Q4_K_M140GB11

Agent stacks

Reference stacks we've validated on Spark. Pick the one that matches your minimum hardware config.

64GB Spark laptop

OpenShell + Nemo-Claw

NVIDIA-aligned local agent reference stack. Driver-level offload for the Grace NPU.

OpenShell 0.4Nemo-ClawOllama 0.9
32GB Spark laptop

llama.cpp + Aider

Minimal coding-agent loop. Best latency on 13B–32B coders. No daemon required.

llama.cppAiderripgrep
128GB Spark or 2-node cluster

vLLM + Letta

Persistent multi-agent memory with high-throughput batched inference. Production-leaning.

vLLM 0.6LettaQdrant