On this page

Planning a GPU/CPU Division for Local LLM, and the Reality of Daily Trial and Error

On a Blackwell 96GB + EPYC 9175F workstation, I planned to split CPU into a Dagster idempotent pipeline and GPU into a user-facing interactive lane with daily LoRA updates. Reality is still experimenting with ik_llama.cpp partial offloading to find model combinations that don’t fight over memory bandwidth. Includes a timeline of 60+ models tested.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Summary

I planned a GPU/CPU division of labor for local LLM on a Blackwell 96GB + EPYC 9175F workstation with 12-channel memory bandwidth.

The plan: Use the CPU lane for a Dagster pipeline — idempotent, asynchronous, running large MoE models for quality output. Use the GPU lane as the user-facing interactive window with smaller, fine-tuned models specialized through daily LoRA adapters generated from Dagster assets.

The reality: I’m using ik_llama.cpp with partial GPU offloading, still experimenting daily with model combinations to avoid memory bandwidth contention between GPU and CPU. The Dagster + LoRA pipeline isn’t built yet.

The Plan: CPU/GPU Division Architecture

CPU Lane — Dagster Idempotent Pipeline

Leverage the 12-channel memory bandwidth of EPYC for large MoE models (100B–400B class) in async mode.

Dagster manages the pipeline as idempotent assets. Jobs can be re-run safely and produce the same results
Use cases: repository-wide analysis, documentation generation, test code generation, large-scale refactoring reasoning — work where quality matters more than speed
No responsiveness requirement. Fire and forget, then collect results

MoE models often don’t fit in a single GPU’s VRAM even when their active parameters are small, because total parameters are enormous. With 12-channel CPU memory bandwidth, they run at practical speeds for async workloads.

GPU Lane — Interactive Window + Daily LoRA Updates

The GPU lane serves as the user-facing interactive endpoint.

Run a ~30B-class model optimized for responsiveness
Fine-tune for specialization: my codebase, my design patterns, my languages
Generate LoRA adapters daily from Dagster assets, applying them to the base model on a daily cycle. The heavy CPU processing feeds back into GPU-side conversational quality

This feedback loop is the core of the plan. Knowledge extracted through slow, thorough CPU-side processing gets distilled into LoRA adapters that make the GPU-side interactive model progressively better. From the user’s perspective, it becomes a local LLM that gets smarter the more you use it.

Planned Pipeline Diagram

  CPU (EPYC 12ch):
  Dagster Pipeline → MoE 100B-400B (ik_llama.cpp / vLLM)
    → asset: code analysis results, test generation, refactoring proposals
    → asset: LoRA training data
    → daily LoRA adapter generation

GPU (Blackwell 96GB):
  Inference server → Dense ~30B (vLLM) + LoRA adapter
    → Zed MCP / Aider / CLI
    → immediate responses
    → daily LoRA update (from CPU-side assets)

Reality: Model Combination Trial and Error with ik_llama.cpp

The planned Dagster + LoRA pipeline isn’t built yet. The current state is ik_llama.cpp with partial GPU offloading, experimenting with model combinations that don’t create memory bandwidth contention.

Focus of Experimentation

Adjusting how much weight goes to GPU to balance interactive responsiveness against CPU-side bandwidth availability
Avoiding patterns where running multiple models simultaneously causes both to slow down due to bandwidth competition
Measuring how quantization choices (GGUF IQ, NVFP4, FP8) affect quality and speed per model

Current Tool Stack

Tool	Role
ik_llama.cpp	Primary inference server with partial GPU offloading
Zed + MCP	Interactive front-end via OpenAI-compatible API
Aider	CLI code editing with DIFF-based repository modification
ctree + pathfinder	MCP tools for context efficiency

vLLM is in the plan but ik_llama.cpp’s partial offloading currently offers more flexibility for trying different combinations.

Model Trial Timeline

Models tested from January through March 2026. Retrieved from cold storage archive.

January 2026

Date	Model	Notes
01/04	Qwen3-VL-32B-Instruct	GUI Agent, Tool Use evaluation
01/04	IQuest-Coder-V1-40B-Instruct (GGUF)	High SWE-Bench score, Dense coding model
01/04	Hermes-4.3-36B	Instruction fidelity, Function Call stability
01/05	gpt-oss-120b (GGUF)	CPU lane candidate, 120B Dense
01/05	Hermes-4-70B-FP8	70B FP8, testing single-GPU fit
01/05	Llama-4-Scout-17B-16E-Instruct (GGUF)	MoE, 17B active
01/09	Llama-3.3-70B-Instruct-NVFP4	NVFP4 quantization quality check
01/09	Llama-4-Scout-17B-16E-Instruct-NVFP4	NVFP4 comparison
01/09	Command-A-Reasoning-NVFP4	Reasoning-focused model
01/09	IQuest-Coder-V1-40B-Loop-Instruct-NVFP4	Loop variant NVFP4
01/09	MiroThinker-v1.5-30B	30B thinking model
01/09	IQuest-Coder-V1-40B-Loop-Instruct	Loop variant original
01/09	IQuest-Coder-V1-40B-Instruct	Dense original weights
01/11	Llama-4-Maverick-17B-128E-Instruct (GGUF)	128 expert MoE
01/11	plamo-2-translate	Translation-specific model
01/11	functiongemma-270m-it	Lightweight Function Call test
01/11	gemma-3-270m-it-NVFP4	Ultra-light NVFP4
01/11	gemma-3-27b-it-NVFP4A16	27B NVFP4
01/11	gpt-oss-20b	20B Dense
01/11	gemma-3-27b-it	27B original
01/11	Qwen3-Coder-30B-A3B-Instruct-NVFP4	MoE coder, 3B active
01/11	gpt-oss-120b	120B original weights
01/11	Monstral-123B-v2-NVFP4	123B MoE NVFP4
01/11	LTX-2	Video generation model
01/12	Mixtral-8x22B (imatrix GGUF)	8x22B MoE
01/20	Nemotron-3-Nano-30B-A3B-NVFP4	NVIDIA MoE 30B
01/20	Qwen3-Coder-30B-A3B-Instruct-FP8	FP8 comparison
01/22	GLM-4.7-Flash	7B Flash model

February 2026

Date	Model	Notes
02/05	Nemotron-3-Nano-30B-A3B-NVFP4 (nvidia official)	Official NVFP4 re-evaluation
02/13	Qwen3-Next-80B-A3B-Thinking (GGUF)	80B MoE Thinking
02/14	Step-3.5-Flash (GGUF)	Flash-class model
02/15	Kimi-K2.5 (GGUF)	1T MoE quantized
02/15	GLM-5 (GGUF)	Next-gen GLM
02/15	GLM-4.7-Flash (GGUF)	GGUF Flash variant
02/15	GLM-4.7-Flash-Uncensored (imatrix GGUF)	Uncensored variant
02/15	Qwen3-Coder-Next (GGUF)	Next-gen coder
02/16	DeepSeek-V3.2-Speciale (GGUF)	V3.2 specialized
02/16	gpt-oss-120b-NEO (imatrix GGUF)	NEO imatrix variant
02/17	MiniMax-M2.5 (GGUF)	230B MoE
02/17	Qwen3-Coder-Next-NVFP4	NVFP4 variant
02/18	Step-3.5-Flash (GGUF, ubergarm)	Alternative quantization
02/18	Ace-Step1.5	Music generation model
02/18	Voxtral-Mini-4B-Realtime	Real-time voice
02/18	FLUX.2-klein-9B	Image generation
02/18	GLM-5 (GGUF, ubergarm)	Alternative quantization
02/18	LFM2-8B-A1B	Liquid Foundation Model
02/19	LFM2-8B-A1B (GGUF)	GGUF variant
02/19	LFM2.5-1.2B-Thinking	Ultra-light Thinking
02/19	LFM2.5-VL-1.6B	Ultra-light Vision-Language
02/19	LFM2.5-1.2B-Instruct	Ultra-light Instruct
02/19	Devstral-2-123B-Instruct (GGUF)	123B coding model
02/19	Qwen3.5-397B-A17B (GGUF)	397B MoE
02/20	MiroThinker-v1.5-235B (GGUF)	235B thinking model
02/21	Qwen3.5-397B-A17B (GGUF, ubergarm)	Alternative quantization
02/23	MiniMax-M2.5 (GGUF, AesSedai)	Alternative quantization
02/24	Kimi-K2.5 (GGUF, ubergarm)	Alternative quantization
02/24	Qwen3-Next-80B-A3B-Instruct-NVFP4	NVFP4 variant
02/24	Qwen3-Next-80B-A3B-Thinking-NVFP4	Thinking NVFP4
02/25	Qwen3.5-35B-A3B (GGUF)	35B MoE
02/25	Qwen3.5-122B-A10B (GGUF)	122B MoE
02/25	Qwen3.5-122B-A10B-NVFP4	NVFP4 variant
02/25	Qwen3.5-27B	27B Dense
02/25	Qwen3.5-122B-A10B (GGUF, ubergarm)	Alternative quantization

March 2026

Date	Model	Notes
03/02	Nemotron-Nano-9B-v2-Japanese (GGUF)	Japanese-specialized
03/02	Nemotron-Nano-9B-v2-Japanese	Original weights
03/03	Qwen3.5-27B (GGUF, bartowski)	GGUF variant
03/03	Qwen3.5-27B (GGUF, ubergarm)	Alternative quantization
03/06	pplx-embed-context-v1-0.6b	Perplexity embedding
03/06	pplx-embed-v1-0.6b	Perplexity embedding
03/06	pplx-embed-context-v1-4b	4B embedding
03/06	pplx-embed-v1-4b	4B embedding

Trends

Looking back at three months of experimentation:

January: Baseline model selection. Comparing Dense coding models (IQuest-Coder 40B) against MoE (Scout, Maverick), and evaluating NVFP4 quantization quality
Early February: Increased evaluation of Chinese-origin models (GLM, Kimi, DeepSeek). Also testing Flash-class lightweight models
Late February: Qwen3.5-generation MoE models (397B, 122B, 35B) become the focus. More per-model quantization comparisons (downloading GGUF/NVFP4/FP8 variants of the same model)
March: Shift to Japanese-specialized models (Nemotron-Nano-9B-v2-Japanese) and embedding models (Perplexity)

Throughout, the same model often appears multiple times with different quantizations. This is because quantization level directly affects GPU/CPU division efficiency during ik_llama.cpp partial offloading.

Next Steps

The planned Dagster pipeline + daily LoRA updates are not yet implemented. Based on what I’ve learned from the current experimentation:

Stabilize a model combination for ik_llama.cpp partial offloading
Build the Dagster idempotent pipeline and dedicate the CPU lane to async processing
Begin daily LoRA adapter generation from Dagster assets for the GPU-side base model
Integrate ctree + pathfinder MCP tools as Dagster asset inputs

Evaluating Qwen3.5 Coding Ability on a Static Dental Clinic Site

A coding test where Qwen3.5 …

Designing Bilingual System Prompts for the PLAMO-translate AI MODEL

Full text and design rationale …