Planning a GPU/CPU Division for Local LLM, and the Reality of Daily Trial and Error
On a Blackwell 96GB + EPYC 9175F workstation, I planned to split CPU into a Dagster idempotent pipeline and GPU into a user-facing interactive lane with daily LoRA updates. Reality is still experimenting with ik_llama.cpp partial offloading to find model combinations that don’t fight over memory bandwidth. Includes a timeline of 60+ models tested.
Summary
I planned a GPU/CPU division of labor for local LLM on a Blackwell 96GB + EPYC 9175F workstation with 12-channel memory bandwidth.
The plan: Use the CPU lane for a Dagster pipeline — idempotent, asynchronous, running large MoE models for quality output. Use the GPU lane as the user-facing interactive window with smaller, fine-tuned models specialized through daily LoRA adapters generated from Dagster assets.
The reality: I’m using ik_llama.cpp with partial GPU offloading, still experimenting daily with model combinations to avoid memory bandwidth contention between GPU and CPU. The Dagster + LoRA pipeline isn’t built yet.
The Plan: CPU/GPU Division Architecture
CPU Lane — Dagster Idempotent Pipeline
Leverage the 12-channel memory bandwidth of EPYC for large MoE models (100B–400B class) in async mode.
- Dagster manages the pipeline as idempotent assets. Jobs can be re-run safely and produce the same results
- Use cases: repository-wide analysis, documentation generation, test code generation, large-scale refactoring reasoning — work where quality matters more than speed
- No responsiveness requirement. Fire and forget, then collect results
MoE models often don’t fit in a single GPU’s VRAM even when their active parameters are small, because total parameters are enormous. With 12-channel CPU memory bandwidth, they run at practical speeds for async workloads.
GPU Lane — Interactive Window + Daily LoRA Updates
The GPU lane serves as the user-facing interactive endpoint.
- Run a ~30B-class model optimized for responsiveness
- Fine-tune for specialization: my codebase, my design patterns, my languages
- Generate LoRA adapters daily from Dagster assets, applying them to the base model on a daily cycle. The heavy CPU processing feeds back into GPU-side conversational quality
This feedback loop is the core of the plan. Knowledge extracted through slow, thorough CPU-side processing gets distilled into LoRA adapters that make the GPU-side interactive model progressively better. From the user’s perspective, it becomes a local LLM that gets smarter the more you use it.
Planned Pipeline Diagram
CPU (EPYC 12ch):
Dagster Pipeline → MoE 100B-400B (ik_llama.cpp / vLLM)
→ asset: code analysis results, test generation, refactoring proposals
→ asset: LoRA training data
→ daily LoRA adapter generation
GPU (Blackwell 96GB):
Inference server → Dense ~30B (vLLM) + LoRA adapter
→ Zed MCP / Aider / CLI
→ immediate responses
→ daily LoRA update (from CPU-side assets)
Reality: Model Combination Trial and Error with ik_llama.cpp
The planned Dagster + LoRA pipeline isn’t built yet. The current state is ik_llama.cpp with partial GPU offloading, experimenting with model combinations that don’t create memory bandwidth contention.
Focus of Experimentation
- Adjusting how much weight goes to GPU to balance interactive responsiveness against CPU-side bandwidth availability
- Avoiding patterns where running multiple models simultaneously causes both to slow down due to bandwidth competition
- Measuring how quantization choices (GGUF IQ, NVFP4, FP8) affect quality and speed per model
Current Tool Stack
| Tool | Role |
|---|---|
| ik_llama.cpp | Primary inference server with partial GPU offloading |
| Zed + MCP | Interactive front-end via OpenAI-compatible API |
| Aider | CLI code editing with DIFF-based repository modification |
| ctree + pathfinder | MCP tools for context efficiency |
vLLM is in the plan but ik_llama.cpp’s partial offloading currently offers more flexibility for trying different combinations.
Model Trial Timeline
Models tested from January through March 2026. Retrieved from cold storage archive.
January 2026
| Date | Model | Notes |
|---|---|---|
| 01/04 | Qwen3-VL-32B-Instruct | GUI Agent, Tool Use evaluation |
| 01/04 | IQuest-Coder-V1-40B-Instruct (GGUF) | High SWE-Bench score, Dense coding model |
| 01/04 | Hermes-4.3-36B | Instruction fidelity, Function Call stability |
| 01/05 | gpt-oss-120b (GGUF) | CPU lane candidate, 120B Dense |
| 01/05 | Hermes-4-70B-FP8 | 70B FP8, testing single-GPU fit |
| 01/05 | Llama-4-Scout-17B-16E-Instruct (GGUF) | MoE, 17B active |
| 01/09 | Llama-3.3-70B-Instruct-NVFP4 | NVFP4 quantization quality check |
| 01/09 | Llama-4-Scout-17B-16E-Instruct-NVFP4 | NVFP4 comparison |
| 01/09 | Command-A-Reasoning-NVFP4 | Reasoning-focused model |
| 01/09 | IQuest-Coder-V1-40B-Loop-Instruct-NVFP4 | Loop variant NVFP4 |
| 01/09 | MiroThinker-v1.5-30B | 30B thinking model |
| 01/09 | IQuest-Coder-V1-40B-Loop-Instruct | Loop variant original |
| 01/09 | IQuest-Coder-V1-40B-Instruct | Dense original weights |
| 01/11 | Llama-4-Maverick-17B-128E-Instruct (GGUF) | 128 expert MoE |
| 01/11 | plamo-2-translate | Translation-specific model |
| 01/11 | functiongemma-270m-it | Lightweight Function Call test |
| 01/11 | gemma-3-270m-it-NVFP4 | Ultra-light NVFP4 |
| 01/11 | gemma-3-27b-it-NVFP4A16 | 27B NVFP4 |
| 01/11 | gpt-oss-20b | 20B Dense |
| 01/11 | gemma-3-27b-it | 27B original |
| 01/11 | Qwen3-Coder-30B-A3B-Instruct-NVFP4 | MoE coder, 3B active |
| 01/11 | gpt-oss-120b | 120B original weights |
| 01/11 | Monstral-123B-v2-NVFP4 | 123B MoE NVFP4 |
| 01/11 | LTX-2 | Video generation model |
| 01/12 | Mixtral-8x22B (imatrix GGUF) | 8x22B MoE |
| 01/20 | Nemotron-3-Nano-30B-A3B-NVFP4 | NVIDIA MoE 30B |
| 01/20 | Qwen3-Coder-30B-A3B-Instruct-FP8 | FP8 comparison |
| 01/22 | GLM-4.7-Flash | 7B Flash model |
February 2026
| Date | Model | Notes |
|---|---|---|
| 02/05 | Nemotron-3-Nano-30B-A3B-NVFP4 (nvidia official) | Official NVFP4 re-evaluation |
| 02/13 | Qwen3-Next-80B-A3B-Thinking (GGUF) | 80B MoE Thinking |
| 02/14 | Step-3.5-Flash (GGUF) | Flash-class model |
| 02/15 | Kimi-K2.5 (GGUF) | 1T MoE quantized |
| 02/15 | GLM-5 (GGUF) | Next-gen GLM |
| 02/15 | GLM-4.7-Flash (GGUF) | GGUF Flash variant |
| 02/15 | GLM-4.7-Flash-Uncensored (imatrix GGUF) | Uncensored variant |
| 02/15 | Qwen3-Coder-Next (GGUF) | Next-gen coder |
| 02/16 | DeepSeek-V3.2-Speciale (GGUF) | V3.2 specialized |
| 02/16 | gpt-oss-120b-NEO (imatrix GGUF) | NEO imatrix variant |
| 02/17 | MiniMax-M2.5 (GGUF) | 230B MoE |
| 02/17 | Qwen3-Coder-Next-NVFP4 | NVFP4 variant |
| 02/18 | Step-3.5-Flash (GGUF, ubergarm) | Alternative quantization |
| 02/18 | Ace-Step1.5 | Music generation model |
| 02/18 | Voxtral-Mini-4B-Realtime | Real-time voice |
| 02/18 | FLUX.2-klein-9B | Image generation |
| 02/18 | GLM-5 (GGUF, ubergarm) | Alternative quantization |
| 02/18 | LFM2-8B-A1B | Liquid Foundation Model |
| 02/19 | LFM2-8B-A1B (GGUF) | GGUF variant |
| 02/19 | LFM2.5-1.2B-Thinking | Ultra-light Thinking |
| 02/19 | LFM2.5-VL-1.6B | Ultra-light Vision-Language |
| 02/19 | LFM2.5-1.2B-Instruct | Ultra-light Instruct |
| 02/19 | Devstral-2-123B-Instruct (GGUF) | 123B coding model |
| 02/19 | Qwen3.5-397B-A17B (GGUF) | 397B MoE |
| 02/20 | MiroThinker-v1.5-235B (GGUF) | 235B thinking model |
| 02/21 | Qwen3.5-397B-A17B (GGUF, ubergarm) | Alternative quantization |
| 02/23 | MiniMax-M2.5 (GGUF, AesSedai) | Alternative quantization |
| 02/24 | Kimi-K2.5 (GGUF, ubergarm) | Alternative quantization |
| 02/24 | Qwen3-Next-80B-A3B-Instruct-NVFP4 | NVFP4 variant |
| 02/24 | Qwen3-Next-80B-A3B-Thinking-NVFP4 | Thinking NVFP4 |
| 02/25 | Qwen3.5-35B-A3B (GGUF) | 35B MoE |
| 02/25 | Qwen3.5-122B-A10B (GGUF) | 122B MoE |
| 02/25 | Qwen3.5-122B-A10B-NVFP4 | NVFP4 variant |
| 02/25 | Qwen3.5-27B | 27B Dense |
| 02/25 | Qwen3.5-122B-A10B (GGUF, ubergarm) | Alternative quantization |
March 2026
| Date | Model | Notes |
|---|---|---|
| 03/02 | Nemotron-Nano-9B-v2-Japanese (GGUF) | Japanese-specialized |
| 03/02 | Nemotron-Nano-9B-v2-Japanese | Original weights |
| 03/03 | Qwen3.5-27B (GGUF, bartowski) | GGUF variant |
| 03/03 | Qwen3.5-27B (GGUF, ubergarm) | Alternative quantization |
| 03/06 | pplx-embed-context-v1-0.6b | Perplexity embedding |
| 03/06 | pplx-embed-v1-0.6b | Perplexity embedding |
| 03/06 | pplx-embed-context-v1-4b | 4B embedding |
| 03/06 | pplx-embed-v1-4b | 4B embedding |
Trends
Looking back at three months of experimentation:
- January: Baseline model selection. Comparing Dense coding models (IQuest-Coder 40B) against MoE (Scout, Maverick), and evaluating NVFP4 quantization quality
- Early February: Increased evaluation of Chinese-origin models (GLM, Kimi, DeepSeek). Also testing Flash-class lightweight models
- Late February: Qwen3.5-generation MoE models (397B, 122B, 35B) become the focus. More per-model quantization comparisons (downloading GGUF/NVFP4/FP8 variants of the same model)
- March: Shift to Japanese-specialized models (Nemotron-Nano-9B-v2-Japanese) and embedding models (Perplexity)
Throughout, the same model often appears multiple times with different quantizations. This is because quantization level directly affects GPU/CPU division efficiency during ik_llama.cpp partial offloading.
Next Steps
The planned Dagster pipeline + daily LoRA updates are not yet implemented. Based on what I’ve learned from the current experimentation:
- Stabilize a model combination for ik_llama.cpp partial offloading
- Build the Dagster idempotent pipeline and dedicate the CPU lane to async processing
- Begin daily LoRA adapter generation from Dagster assets for the GPU-side base model
- Integrate ctree + pathfinder MCP tools as Dagster asset inputs
