Summary

I planned a GPU/CPU division of labor for local LLM on a Blackwell 96GB + EPYC 9175F workstation with 12-channel memory bandwidth.

The plan: Use the CPU lane for a Dagster pipeline — idempotent, asynchronous, running large MoE models for quality output. Use the GPU lane as the user-facing interactive window with smaller, fine-tuned models specialized through daily LoRA adapters generated from Dagster assets.

The reality: I’m using ik_llama.cpp with partial GPU offloading, still experimenting daily with model combinations to avoid memory bandwidth contention between GPU and CPU. The Dagster + LoRA pipeline isn’t built yet.


The Plan: CPU/GPU Division Architecture

CPU Lane — Dagster Idempotent Pipeline

Leverage the 12-channel memory bandwidth of EPYC for large MoE models (100B–400B class) in async mode.

  • Dagster manages the pipeline as idempotent assets. Jobs can be re-run safely and produce the same results
  • Use cases: repository-wide analysis, documentation generation, test code generation, large-scale refactoring reasoning — work where quality matters more than speed
  • No responsiveness requirement. Fire and forget, then collect results

MoE models often don’t fit in a single GPU’s VRAM even when their active parameters are small, because total parameters are enormous. With 12-channel CPU memory bandwidth, they run at practical speeds for async workloads.

GPU Lane — Interactive Window + Daily LoRA Updates

The GPU lane serves as the user-facing interactive endpoint.

  • Run a ~30B-class model optimized for responsiveness
  • Fine-tune for specialization: my codebase, my design patterns, my languages
  • Generate LoRA adapters daily from Dagster assets, applying them to the base model on a daily cycle. The heavy CPU processing feeds back into GPU-side conversational quality

This feedback loop is the core of the plan. Knowledge extracted through slow, thorough CPU-side processing gets distilled into LoRA adapters that make the GPU-side interactive model progressively better. From the user’s perspective, it becomes a local LLM that gets smarter the more you use it.

Planned Pipeline Diagram

  CPU (EPYC 12ch):
  Dagster Pipeline → MoE 100B-400B (ik_llama.cpp / vLLM)
    → asset: code analysis results, test generation, refactoring proposals
    → asset: LoRA training data
    → daily LoRA adapter generation

GPU (Blackwell 96GB):
  Inference server → Dense ~30B (vLLM) + LoRA adapter
    → Zed MCP / Aider / CLI
    → immediate responses
    → daily LoRA update (from CPU-side assets)
  

Reality: Model Combination Trial and Error with ik_llama.cpp

The planned Dagster + LoRA pipeline isn’t built yet. The current state is ik_llama.cpp with partial GPU offloading, experimenting with model combinations that don’t create memory bandwidth contention.

Focus of Experimentation

  • Adjusting how much weight goes to GPU to balance interactive responsiveness against CPU-side bandwidth availability
  • Avoiding patterns where running multiple models simultaneously causes both to slow down due to bandwidth competition
  • Measuring how quantization choices (GGUF IQ, NVFP4, FP8) affect quality and speed per model

Current Tool Stack

ToolRole
ik_llama.cppPrimary inference server with partial GPU offloading
Zed + MCPInteractive front-end via OpenAI-compatible API
AiderCLI code editing with DIFF-based repository modification
ctree + pathfinderMCP tools for context efficiency

vLLM is in the plan but ik_llama.cpp’s partial offloading currently offers more flexibility for trying different combinations.


Model Trial Timeline

Models tested from January through March 2026. Retrieved from cold storage archive.

January 2026

DateModelNotes
01/04Qwen3-VL-32B-InstructGUI Agent, Tool Use evaluation
01/04IQuest-Coder-V1-40B-Instruct (GGUF)High SWE-Bench score, Dense coding model
01/04Hermes-4.3-36BInstruction fidelity, Function Call stability
01/05gpt-oss-120b (GGUF)CPU lane candidate, 120B Dense
01/05Hermes-4-70B-FP870B FP8, testing single-GPU fit
01/05Llama-4-Scout-17B-16E-Instruct (GGUF)MoE, 17B active
01/09Llama-3.3-70B-Instruct-NVFP4NVFP4 quantization quality check
01/09Llama-4-Scout-17B-16E-Instruct-NVFP4NVFP4 comparison
01/09Command-A-Reasoning-NVFP4Reasoning-focused model
01/09IQuest-Coder-V1-40B-Loop-Instruct-NVFP4Loop variant NVFP4
01/09MiroThinker-v1.5-30B30B thinking model
01/09IQuest-Coder-V1-40B-Loop-InstructLoop variant original
01/09IQuest-Coder-V1-40B-InstructDense original weights
01/11Llama-4-Maverick-17B-128E-Instruct (GGUF)128 expert MoE
01/11plamo-2-translateTranslation-specific model
01/11functiongemma-270m-itLightweight Function Call test
01/11gemma-3-270m-it-NVFP4Ultra-light NVFP4
01/11gemma-3-27b-it-NVFP4A1627B NVFP4
01/11gpt-oss-20b20B Dense
01/11gemma-3-27b-it27B original
01/11Qwen3-Coder-30B-A3B-Instruct-NVFP4MoE coder, 3B active
01/11gpt-oss-120b120B original weights
01/11Monstral-123B-v2-NVFP4123B MoE NVFP4
01/11LTX-2Video generation model
01/12Mixtral-8x22B (imatrix GGUF)8x22B MoE
01/20Nemotron-3-Nano-30B-A3B-NVFP4NVIDIA MoE 30B
01/20Qwen3-Coder-30B-A3B-Instruct-FP8FP8 comparison
01/22GLM-4.7-Flash7B Flash model

February 2026

DateModelNotes
02/05Nemotron-3-Nano-30B-A3B-NVFP4 (nvidia official)Official NVFP4 re-evaluation
02/13Qwen3-Next-80B-A3B-Thinking (GGUF)80B MoE Thinking
02/14Step-3.5-Flash (GGUF)Flash-class model
02/15Kimi-K2.5 (GGUF)1T MoE quantized
02/15GLM-5 (GGUF)Next-gen GLM
02/15GLM-4.7-Flash (GGUF)GGUF Flash variant
02/15GLM-4.7-Flash-Uncensored (imatrix GGUF)Uncensored variant
02/15Qwen3-Coder-Next (GGUF)Next-gen coder
02/16DeepSeek-V3.2-Speciale (GGUF)V3.2 specialized
02/16gpt-oss-120b-NEO (imatrix GGUF)NEO imatrix variant
02/17MiniMax-M2.5 (GGUF)230B MoE
02/17Qwen3-Coder-Next-NVFP4NVFP4 variant
02/18Step-3.5-Flash (GGUF, ubergarm)Alternative quantization
02/18Ace-Step1.5Music generation model
02/18Voxtral-Mini-4B-RealtimeReal-time voice
02/18FLUX.2-klein-9BImage generation
02/18GLM-5 (GGUF, ubergarm)Alternative quantization
02/18LFM2-8B-A1BLiquid Foundation Model
02/19LFM2-8B-A1B (GGUF)GGUF variant
02/19LFM2.5-1.2B-ThinkingUltra-light Thinking
02/19LFM2.5-VL-1.6BUltra-light Vision-Language
02/19LFM2.5-1.2B-InstructUltra-light Instruct
02/19Devstral-2-123B-Instruct (GGUF)123B coding model
02/19Qwen3.5-397B-A17B (GGUF)397B MoE
02/20MiroThinker-v1.5-235B (GGUF)235B thinking model
02/21Qwen3.5-397B-A17B (GGUF, ubergarm)Alternative quantization
02/23MiniMax-M2.5 (GGUF, AesSedai)Alternative quantization
02/24Kimi-K2.5 (GGUF, ubergarm)Alternative quantization
02/24Qwen3-Next-80B-A3B-Instruct-NVFP4NVFP4 variant
02/24Qwen3-Next-80B-A3B-Thinking-NVFP4Thinking NVFP4
02/25Qwen3.5-35B-A3B (GGUF)35B MoE
02/25Qwen3.5-122B-A10B (GGUF)122B MoE
02/25Qwen3.5-122B-A10B-NVFP4NVFP4 variant
02/25Qwen3.5-27B27B Dense
02/25Qwen3.5-122B-A10B (GGUF, ubergarm)Alternative quantization

March 2026

DateModelNotes
03/02Nemotron-Nano-9B-v2-Japanese (GGUF)Japanese-specialized
03/02Nemotron-Nano-9B-v2-JapaneseOriginal weights
03/03Qwen3.5-27B (GGUF, bartowski)GGUF variant
03/03Qwen3.5-27B (GGUF, ubergarm)Alternative quantization
03/06pplx-embed-context-v1-0.6bPerplexity embedding
03/06pplx-embed-v1-0.6bPerplexity embedding
03/06pplx-embed-context-v1-4b4B embedding
03/06pplx-embed-v1-4b4B embedding

Looking back at three months of experimentation:

  • January: Baseline model selection. Comparing Dense coding models (IQuest-Coder 40B) against MoE (Scout, Maverick), and evaluating NVFP4 quantization quality
  • Early February: Increased evaluation of Chinese-origin models (GLM, Kimi, DeepSeek). Also testing Flash-class lightweight models
  • Late February: Qwen3.5-generation MoE models (397B, 122B, 35B) become the focus. More per-model quantization comparisons (downloading GGUF/NVFP4/FP8 variants of the same model)
  • March: Shift to Japanese-specialized models (Nemotron-Nano-9B-v2-Japanese) and embedding models (Perplexity)

Throughout, the same model often appears multiple times with different quantizations. This is because quantization level directly affects GPU/CPU division efficiency during ik_llama.cpp partial offloading.


Next Steps

The planned Dagster pipeline + daily LoRA updates are not yet implemented. Based on what I’ve learned from the current experimentation:

  • Stabilize a model combination for ik_llama.cpp partial offloading
  • Build the Dagster idempotent pipeline and dedicate the CPU lane to async processing
  • Begin daily LoRA adapter generation from Dagster assets for the GPU-side base model
  • Integrate ctree + pathfinder MCP tools as Dagster asset inputs