In real familiar operation, the orchestrator is responsible for worker planning, review, and convergence decisions. If the grandpa lane is slow, overall turn latency collapses no matter how fast the workers are. What I wanted to pin down here was twofold: the TG behavior of GLM-5.1 IQ3_KS when used as the orchestrator, and how far I could push the full layout into something practical once Qwen3-Coder-Next workers were added beside it.

The short version is this: even in full --cpu-moe mode, GLM-5.1 still uses GPU for part of the path, but CPU expert evaluation remains dominant, so going from 1 GPU to 2 GPUs barely changes anything. The only changes that moved the needle were expert placement changes. In particular, explicitly placing head 12 layers plus tail 10 layers across 2 GPUs via -ot improved throughput from 53.09ms/token to 45.98ms/token, or 18.84 t/s to 21.75 t/s. Once I then budgeted VRAM under the assumption of running two Qwen3-Coder-Next Q4_0 --parallel 2 workers, the resident layout became much clearer.

Conclusion

The measurements and aggregation in this run were enough to make a few operational conclusions directly.

TopicWhat seems settled
CPU bottleneckCPU evaluation of expert FFNs accounts for roughly 45-50ms/token, or 85-90% of total time
Meaning of 2 GPUsWith full cpu-moe, 2 GPUs behave like 1 GPU, with 53.09 vs 53.03 ms/token
Real source of improvementn-cpu-moe 64 gives +5.7%, while -ot head12+tail10 gives +15.3%
Hot layersHead+tail over 22 layers works materially better than tail-only over 14 layers; head is clearly hot
Context resilienceTG stays near 53ms at 61t and around 54-55ms even at 4096t
-gerImproves CPU-side expert execution by 1-2%, plausibly via better L3 locality
-sm graphUnsupported on GLM-DSA and falls back to layer
thinkingNot worth it for orchestrator use. 61 content tokens plus 294 reasoning tokens turns 4.6s into 19.1s

If I only cared about the fastest standalone orchestrator, the 22-layer head+tail layout was the winner. In real operation, though, the orchestrator has to share VRAM with workers. That makes it more realistic to settle GLM-5.1 around a 20-layer GPU expert budget while prioritizing Qwen3-Coder-Next quality and worker throughput.

What Looks Confirmed

CPU Is Dominant

  • CPU evaluation of expert FFNs dominates total generation time at around 45-50ms/token
  • Startup logs show llm_load_tensors: offloaded 80/80 layers to GPU, so GPU layer compute itself is active
  • GPU utilization under full cpu-moe sits around 17-19%
  • PCIe transfer is not a controlling term. Activation size is about 12KB/layer, and even across 79 layers the transfer estimate is only around 19μs
  • 2GPU layer-split is effectively meaningless in full cpu-moe mode. 1GPU and 2GPU are functionally equivalent in TG

At that point the right reading was not “the GPU is slow,” but “the GPU is idle while waiting for CPU expert evaluation.”

Expert Placement Improves TG

ConfigurationGPU expert layersms/tokenTG (t/s)Improvement
full cpu-moe (expert 0)053.018.8baseline
n-cpu-moe 641450.219.9+5.7%
-ot head12+tail102246.021.8+15.3%

Head Layers Are Hot

  • Tail-only over 14 layers improved TG by +5.7%
  • Head 12 plus tail 10 improved TG by +15.3%
  • The GPU-resident layer count only increased by 1.57x, but the gain grew by roughly 2.5x, which strongly suggests denser activation on the head side

High Context Tolerance

  • TG stays nearly flat from 61t to 4096t
  • MLA compresses KV enough to keep 32k ctx around 1.5GB
  • Because CPU expert evaluation dominates, context growth has relatively little leverage over total decode time

-ger Has a Real Effect

  • The gain is small but consistent: 50.55 -> 50.16 ms/token
  • Grouped expert routing appears to play slightly better with the large L3 cache on EPYC 9175F

-sm graph Is Unsupported on GLM-DSA

  • Even when explicitly requested, it falls back to layer
  • On this model, split mode is not where the useful optimization work is

Thinking Is Unnecessary for Orchestrator Work

  • Adding 294 reasoning tokens to an output that only needs 61 answer tokens turns 4.6s into 19.1s
  • --reasoning-budget 0 or enable_thinking: false should be the default

Background

The role split in this setup was straightforward.

RolePurposeModel
grandpaOrchestrator review, judgment, and contract consistency checksGLM-5.1 IQ3_KS
naughty-workerCode generation, edits, and file outputQwen3-Coder-Next

On the worker side, I want the best coding quality I can afford. On the grandpa side, I want enough quality while pushing TG as low as possible. That tradeoff is why I started by measuring expert placement on GLM-5.1 first.

Hardware

  GPU:  NVIDIA RTX PRO 6000 Blackwell Max-Q 96GB × 2
CPU:  AMD EPYC 9175F (16 cores)
RAM:  768GB DDR5 6400MT/s
  

In this discussion, the value of 2 GPUs is mostly capacity rather than bandwidth. Under full cpu-moe, the GPUs can still sit mostly idle. As soon as experts start moving back onto GPU, though, a number of plausible layouts stop fitting on a single card.

GLM-5.1 Model Overview

The key model metadata extracted from startup logs was:

  llm_load_print_meta: arch                 = glm-dsa
llm_load_print_meta: model type           = 744B.A40B
llm_load_print_meta: model ftype          = IQ3_KS - 3.1875 bpw
llm_load_print_meta: model params         = 753.864 B
llm_load_print_meta: model size           = 320.216 GiB (3.649 BPW)
llm_load_print_meta: n_layer              = 79
llm_load_print_meta: n_expert             = 256
llm_load_print_meta: n_expert_used        = 8
llm_load_print_meta: n_layer_dense_lead   = 3
  

The broad structure is:

  • layer 0-2: dense
  • layer 3-77: MoE
  • layer 78: nextn prediction
  • layer 79: output

There are 75 MoE layers. Per layer, expert weights are roughly:

  gate_exps = 1225 MiB
down_exps = 1638 MiB
up_exps   = 1225 MiB
  

That puts each expert layer at about 4.1GB. Total expert weight is roughly 307GB, which is why pushing all experts into CUDA_Host inflates RAM usage so dramatically. By contrast, non-expert GPU-side weight is only around 14.5GB, KV is around 1.5GB, and compute buffer is around 5.4GB, so a no-GPU-expert layout leaves the GPUs mostly underused.

Base Command

The baseline ik_llama.cpp launch shape for GLM-5.1 was:

  podman run --rm \
  --device nvidia.com/gpu=1 \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/.../IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 \
  --parallel 1 --threads 15 --threads-batch 24 \
  -b 8192 -ub 8192 -ngl 999 \
  --cpu-moe -muge -mla 3 -amb 512 \
  --jinja --host 0.0.0.0 --port 8000 \
  --warmup-batch --alias GLM-5.1
  

Main option meanings:

  • --cpu-moe: place all MoE expert weights in pinned CPU-side memory
  • -muge: merge ffn_up and gate_exps
  • -mla 3: MLA optimization level 3
  • -amb 512: attention max batch size
  • -ctk q8_0 -ctv q8_0: quantized KV cache

Method

I used two prompts for TG comparison:

  1. a short JSON generation request
  2. a long OpenAPI spec generation request

The short JSON case is close to review replies and contract-level judgments. The long OpenAPI case serves as a useful long-output load. I compared both with and without thinking, reading both timings.predicted_ms / predicted_n and the llama.cpp eval time logs.

  # short-output test
curl -s -w "\n\nTotal time: %{time_total}s\n" \
 http://compute.home.arpa:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
"model": "GLM-5.1",
"messages": [
{"role": "user", "content": "Write a JSON object with 5 fields describing a software project. Include name, language, version, description, and license."}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 45,
"min_p": 0.01,
"chat_template_kwargs": {"enable_thinking": false}
}'
  
  # long-output test
curl -s -w "\n\nTotal time: %{time_total}s\n" \
 http://compute.home.arpa:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
"model": "GLM-5.1",
"messages": [
{"role": "user", "content": "Write a detailed OpenAPI 3.0 specification in JSON for a task management API. Include endpoints for CRUD operations on projects and tasks, with request/response schemas, error responses, and authentication via Bearer token."}
],
"max_tokens": 2048,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 45,
"min_p": 0.01,
"chat_template_kwargs": {"enable_thinking": false}
}'
  

How I Concluded It Was CPU-Bound

The most important result was being able to say fairly early that CPU-side expert evaluation was dominating total runtime.

Expert Computation Is Almost Everything

  • Baseline TG is around 53ms/token
  • Roughly 45-50ms/token of that appears to be CPU-side expert evaluation
  • Non-expert GPU-side work, KV, and compute account for the remainder

With GPU utilization stuck around 17-19% while CPU rises to 1500%, the interpretation is consistent with the machine-level observations.

PCIe Is Not the Culprit

It is tempting to suspect activation transfer between CPU and GPU first, but the numbers do not support it.

  • Activation size is about 12KB/layer
  • Even across all 79 layers, transfer volume stays small
  • The transfer-time estimate is only around 19μs

That means PCIe is negligible against a 53ms TG. The slow part is not movement. It is CPU-side expert evaluation.

The Reason for High Context Tolerance Is the Same

The reason TG stays nearly flat across 61, 2048, and 4096 tokens is that attention growth is not becoming dominant. MLA keeps KV at about 1.5GB even at 32k ctx, and CPU expert work is the real bottleneck first. That behavior is quite different from dense or fully GPU-resident models.

Configuration 1: Hybrid -ot exps=CPU, 2GPU layer-split

The first measurement used -ot exps=CPU, placing all experts on CPU.

  podman run --rm \
  --device nvidia.com/gpu=all \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/a9962c23e50d9c352e09fe0d9cb131026f4e6441/IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 --parallel 1 --threads 15 --threads-batch 24 -b 8192 -ub 8192 -ngl 99 -ot exps=CPU -muge -mla 3 -amb 512 -sm graph --jinja --host 0.0.0.0 --port 8000 --warmup-batch --alias GLM-5.1
  

The startup values looked like this:

  Split mode 'graph' is not supported for this model
  => changing split mode to 'layer'

llm_load_tensors:  CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors:      CUDA0 buffer size =  7378.05 MiB
llm_load_tensors:      CUDA1 buffer size =  7120.90 MiB

llama_init_from_model: grouped er    = 0
llama_init_from_model: graph splits = 190
  

TG came out as:

  # short output, thinking off
prompt eval time =    1323.95 ms /    30 tokens (   44.13 ms per token,    22.66 tokens per second)
       eval time =    3238.59 ms /    61 tokens (   53.09 ms per token,    18.84 tokens per second)

# long output, thinking off
prompt eval time =    1831.66 ms /    43 tokens (   42.60 ms per token,    23.48 tokens per second)
       eval time =  111609.04 ms /  2048 tokens (   54.50 ms per token,    18.35 tokens per second)

# long output, thinking on
prompt eval time =    2459.53 ms /    43 tokens (   57.20 ms per token,    17.48 tokens per second)
       eval time =  226282.06 ms /  4096 tokens (   55.24 ms per token,    18.10 tokens per second)
  

The nvtop view was very clear:

  PID USER DEV     TYPE  GPU        GPU MEM    CPU  HOST MEM
3069 ksh3   0  Compute  19%  15236MiB  16%  1500% 309454MiB
3069 ksh3   1  Compute  17%  14460MiB  15%  1075% 309454MiB
  
nvtop showing the 2GPU exps=CPU baseline with both GPUs staying below 20% utilization
Once all experts move to CPU, the GPUs mostly wait while pinned host memory grows to the 300GB class

Configuration 2: Hybrid --cpu-moe, 2GPU

Next I repeated the same experiment using --cpu-moe rather than -ot exps=CPU. The goal was to verify whether the two behaved identically in practice.

  podman run --rm \
  --device nvidia.com/gpu=all \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/a9962c23e50d9c352e09fe0d9cb131026f4e6441/IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 --parallel 1 --threads 15 --threads-batch 24 -b 8192 -ub 8192 -ngl 99 --cpu-moe -muge -mla 3 -amb 512 -sm graph --jinja --host 0.0.0.0 --port 8000 --warmup-batch --alias GLM-5.1 --temp 0.7 --top-k 45 --top-p 0.95 --min-p 0.01
  

The measured values were effectively identical:

  Split mode 'graph' is not supported for this model
  => changing split mode to 'layer'

llm_load_tensors:  CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors:      CUDA0 buffer size =  7378.05 MiB
llm_load_tensors:      CUDA1 buffer size =  7120.90 MiB

llama_init_from_model: graph splits = 190
  
  # short output, thinking off
prompt eval time =    1323.95 ms /    30 tokens (   44.13 ms per token,    22.66 tokens per second)
       eval time =    3238.59 ms /    61 tokens (   53.09 ms per token,    18.84 tokens per second)

# long output, thinking off
prompt eval time =    1831.66 ms /    43 tokens (   42.60 ms per token,    23.48 tokens per second)
       eval time =  111609.04 ms /  2048 tokens (   54.50 ms per token,    18.35 tokens per second)
  

So in this model and build, -ot exps=CPU and --cpu-moe produced the same buffer layout and the same performance. From here onward, I treat both together as “full cpu-moe.”

Configuration 3: Hybrid --cpu-moe, 1GPU

I then switched to a single-GPU layout on dev1 to check whether 2GPU layer-split overhead might be hurting more than helping.

  podman run --rm \
  --device nvidia.com/gpu=1 \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/a9962c23e50d9c352e09fe0d9cb131026f4e6441/IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 --parallel 1 --threads 15 --threads-batch 24 -b 8192 -ub 8192 -ngl 999 --cpu-moe -muge -mla 3 -amb 512 --jinja --host 0.0.0.0 --port 8000 --warmup-batch --alias GLM-5.1 --temp 0.7 --top-k 45 --top-p 0.95 --min-p 0.01
  

Startup values on 1GPU:

  llm_load_tensors:  CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors:      CUDA0 buffer size = 14498.95 MiB

llama_kv_cache_init:      CUDA0 KV buffer size = 1491.79 MiB
llama_init_from_model:    CUDA0 compute buffer size = 5418.03 MiB
llama_init_from_model: graph splits = 152

Allocating 299.91 GiB of pinned host memory
done allocating 299.91 GiB in 46485.7 ms
  

The performance difference was tiny:

  # short output, thinking off
prompt eval time =    1297.67 ms /    30 tokens (   43.26 ms per token,    23.12 tokens per second)
       eval time =    3340.94 ms /    63 tokens (   53.03 ms per token,    18.86 tokens per second)

# long output, thinking off
prompt eval time =    1823.10 ms /    43 tokens (   42.40 ms per token,    23.59 tokens per second)
       eval time =  110967.85 ms /  2048 tokens (   54.18 ms per token,    18.46 tokens per second)

# long output, thinking on
prompt eval time =    1696.37 ms /    43 tokens (   39.45 ms per token,    25.35 tokens per second)
       eval time =  222410.35 ms /  4096 tokens (   54.30 ms per token,    18.42 tokens per second)
  
  PID USER DEV     TYPE  GPU        GPU MEM    CPU  HOST MEM
3381 ksh3   1  Compute   0%  23592MiB  24%     0% 309036MiB
  

There are two useful readings here:

  1. graph splits drops from 190 to 152, so the 1GPU graph is structurally simpler
  2. TG remains almost unchanged, meaning CPU expert evaluation is still the real bottleneck

So 2GPU is not making it faster, but 1GPU is not making it meaningfully worse either. That matters because it means freeing dev0 for worker use is entirely reasonable.

nvtop showing a 1GPU cpu-moe layout with about 23GB VRAM use and little TG change
Moving to 1GPU reduces graph splits, but the bottleneck stays the same and TG barely changes

Configuration 4: Hybrid --n-cpu-moe 64, 1GPU, -ger

This is where I started moving experts back to GPU. --n-cpu-moe 64 keeps earlier experts on CPU while returning later experts to GPU.

  podman run --rm \
  --device nvidia.com/gpu=1 \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/a9962c23e50d9c352e09fe0d9cb131026f4e6441/IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 --parallel 1 --threads 15 --threads-batch 24 -b 8192 -ub 8192 -ngl 999 --n-cpu-moe 64 -muge -mla 3 -amb 512 --jinja --host 0.0.0.0 --port 8000 --warmup-batch -ger --alias GLM-5.1 --temp 0.7 --top-k 45 --top-p 0.95 --min-p 0.01
  

The effective interpretation is:

  • blk.3-63: CPU-side
  • blk.64-77: GPU-side
  • in practice, tail 14 MoE layers returned to GPU
  Allocating 244.02 GiB of pinned host memory
  

TG improved clearly for the first time:

  # short output, thinking off
prompt eval time =    1173.39 ms /    30 tokens (   39.11 ms per token,    25.57 tokens per second)
       eval time =    3109.75 ms /    62 tokens (   50.16 ms per token,    19.94 tokens per second)

# long output, thinking off
prompt eval time =    1690.07 ms /    43 tokens (   39.30 ms per token,    25.44 tokens per second)
       eval time =  103526.50 ms /  2048 tokens (   50.55 ms per token,    19.78 tokens per second)

# long output, thinking on
prompt eval time =    1664.97 ms /    43 tokens (   38.72 ms per token,    25.83 tokens per second)
       eval time =  207284.39 ms /  4096 tokens (   50.61 ms per token,    19.76 tokens per second)
  

nvtop made the GPU-side change obvious:

  PID USER DEV     TYPE  GPU        GPU MEM    CPU  HOST MEM
4153 ksh3   1  Compute  41%  80804MiB  83%  1432% 251746MiB
  
nvtop showing n-cpu-moe 64 with ger on 1GPU and utilization rising to 41%
Putting the tail 14 layers back on GPU pushes VRAM toward 80GB and raises TG into the 19.9 t/s range

The gain was +5.7%. Not dramatic, but enough to prove that GPU-resident experts do matter.

Failure Case: Hybrid --n-cpu-moe 58, 1GPU -> OOM

I also tried pushing toward a 20-layer GPU expert budget on 1GPU, but --n-cpu-moe 58 failed cleanly with OOM.

  podman run --rm \
  --device nvidia.com/gpu=1 \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/a9962c23e50d9c352e09fe0d9cb131026f4e6441/IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 --parallel 1 --threads 15 --threads-batch 24 -b 8192 -ub 8192 -ngl 999 --n-cpu-moe 58 -muge -mla 3 -amb 512 --jinja --host 0.0.0.0 --port 8000 --warmup-batch -ger --alias GLM-5.1 --temp 0.7 --top-k 45 --top-p 0.95 --min-p 0.01
  
  ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5418.03 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 5681217536
llama_init_from_model: failed to allocate compute buffers
  

Trying to hold something close to 20 expert layers on a single 96GB card simply does not fit once non-expert weight, KV, and compute buffers are included.

Configuration 5: Hybrid -ot head12+tail10, 2GPU, -ger

The best result came from explicit placement. I pinned head and tail experts to separate GPUs.

  OT_ARGS=""
for i in $(seq 3 14); do
  OT_ARGS="$OT_ARGS -ot blk.$i.ffn_gate_exps=CUDA0"
  OT_ARGS="$OT_ARGS -ot blk.$i.ffn_down_exps=CUDA0"
  OT_ARGS="$OT_ARGS -ot blk.$i.ffn_up_exps=CUDA0"
done
for i in $(seq 68 77); do
  OT_ARGS="$OT_ARGS -ot blk.$i.ffn_gate_exps=CUDA1"
  OT_ARGS="$OT_ARGS -ot blk.$i.ffn_down_exps=CUDA1"
  OT_ARGS="$OT_ARGS -ot blk.$i.ffn_up_exps=CUDA1"
done

podman run --rm \
  --device nvidia.com/gpu=all \
  -p 8000:8000 \
  --cap-add=SYS_NICE \
  -v /mnt/data/models/models--ubergarm--GLM-5.1-GGUF:/models:ro,Z \
  registry.home.arpa/ik_llama.cpp:latest \
  -m /models/snapshots/a9962c23e50d9c352e09fe0d9cb131026f4e6441/IQ3_KS/GLM-5.1-IQ3_KS-00001-of-00008.gguf \
  --merge-qkv --ctx-size 32768 -ctk q8_0 -ctv q8_0 \
  --parallel 1 --threads 15 --threads-batch 24 \
  -b 8192 -ub 8192 -ngl 999 \
  --cpu-moe $OT_ARGS -ger \
  -muge -mla 3 -amb 512 \
  --jinja --host 0.0.0.0 --port 8000 \
  --warmup-batch --alias GLM-5.1
  

Observed startup values:

  Allocating 218045 MiB of pinned host memory

GPU0: 57190MiB (58%)
GPU1: 48756MiB (50%)
  

That means:

  • GPU0: head 12 expert layers + earlier attention
  • GPU1: tail 10 expert layers + later attention
  • CPU: the middle 53 expert layers

This was the best PP/TG line of the whole run:

  # short output, thinking off
prompt eval time =    1111.85 ms /    30 tokens (   37.06 ms per token,    26.98 tokens per second)
       eval time =    2942.51 ms /    64 tokens (   45.98 ms per token,    21.75 tokens per second)

# long output, thinking off
prompt eval time =    2016.80 ms /    43 tokens (   46.90 ms per token,    21.32 tokens per second)
       eval time =   94650.69 ms /  2048 tokens (   46.22 ms per token,    21.64 tokens per second)

# long output, thinking on
prompt eval time =    1431.92 ms /    43 tokens (   33.30 ms per token,    30.03 tokens per second)
       eval time =  190089.65 ms /  4096 tokens (   46.41 ms per token,    21.55 tokens per second)
  

The nvtop view also shows a slight reduction in CPU pressure:

  PID USER DEV     TYPE  GPU        GPU MEM    CPU  HOST MEM
4483 ksh3   0  Compute  24%  64268MiB  66%  1355% 219523MiB
4483 ksh3   1  Compute  22%  55322MiB  57%  1194% 219523MiB
  
nvtop showing head 12 and tail 10 layers explicitly placed across 2 GPUs
This was the best-performing layout in the run. Both GPUs have real work, and CPU pressure drops slightly relative to baseline

Grafana Monitoring

GPU utilization, memory copy, temperature, and power were monitored through DCGM exporter.

Grafana DCGM GPU Monitoring dashboard showing GPU utilization and memory copy utilization
The utilization pattern changes clearly as expert placement changes. In head+tail mode, the division of labor across 2 GPUs is the clearest

I also tracked CPU Busy User and host memory behavior via Node Exporter.

Grafana Node Exporter dashboard showing CPU Busy User and host memory usage
Pinned host memory can climb into the 200-300GB range, so RAM observability matters in real operation too

Benchmark Summary

TG (eval)

#ConfigurationGPUsGPU expert layersShort TG2048t TG4096t TGImprovement
1hybrid -ot exps=CPU2018.8418.3518.10baseline
2hybrid --cpu-moe2018.8418.3518.10±0%
3hybrid --cpu-moe1018.8618.4618.42+0.1%
4hybrid --n-cpu-moe 64 -ger114 (tail)19.9419.7819.76+5.8%
5hybrid -ot head+tail -ger222 (h12+t10)21.7521.6421.55+15.4%

PP (prompt eval)

#ConfigurationShort PPLong PPBest PP
1hybrid -ot exps=CPU 2GPU22.6623.4823.48
3hybrid --cpu-moe 1GPU23.1223.5925.35
4hybrid --n-cpu-moe 64 -ger25.5725.4425.83
5hybrid -ot head+tail -ger26.9821.3230.03

GPU Resource Comparison

#ConfigurationVRAM dev0VRAM dev1GPU utilCPU%pinned RAM
1hybrid exps=CPU 2GPU15236MiB14460MiB17-19%1500%300GB
3hybrid cpu-moe 1GPU23592MiB17-19%1500%300GB
4hybrid n-cpu-moe 6480804MiB41%1432%244GB
5hybrid -ot head+tail57190MiB48756MiB22-24% x 21355%218GB

Improvement Summary

  hybrid full cpu-moe (baseline):    18.84 tok/s
hybrid cpu-moe 1GPU:               18.86 tok/s  (+0.1%)
hybrid n-cpu-moe 64 -ger:          19.94 tok/s  (+5.8%)
hybrid -ot head12+tail10 -ger:     21.75 tok/s  (+15.4%)
  

Analysis

The Wall Is the CPU

As long as experts stay on CPU, the bottleneck is the CPU-side expert FFN evaluation before anything else. EPYC 9175F has 16 cores and a large L3, but that is still not enough to cheaply run 256 experts with 8 active per token. If 85-90% of TG lives there, the number of GPUs hardly matters.

Head Is Hot

Head+tail outperformed tail-only by too much for this to be explained by layer count alone. The gain curve strongly suggests that the head side carries denser or more useful expert activation.

The Value of 2 GPUs Is Capacity

Under full cpu-moe, 2 GPUs do not buy much. Once experts move back to GPU, that changes immediately. This becomes even more important when Qwen3-Coder-Next is also resident, because the orchestrator and workers begin competing directly for the same VRAM budget.

Thinking Is a Bad Trade Here

Thinking does not really change TG; it changes how many tokens get generated. Since the orchestrator mostly needs short judgments and short next-step instructions, spending heavily on reasoning tokens is not worth it.

-ger Is Small but Real

A 1-2% improvement sounds small, but it compounds when the review lane runs hundreds of times. Since the majority of experts still remain CPU-side, any L3-locality win is hard to dismiss.

Final Direction with Qwen3-Coder-Next Included

At this point the problem becomes a full resident layout problem rather than a GLM-only benchmark problem.

Qwen3-Coder-Next KV Efficiency

One reason Qwen3-Coder-Next is practical here is its KV efficiency.

  48 layers: 36 DeltaNet (no KV cache) + 12 Gated Attention (KV heads=2)
256k ctx: KV ~3GB
1M ctx (YaRN): KV ~11GB
  

Compared to a plain attention stack, that makes it much easier to hold long-context workers resident.

--parallel Strategy for Naughty

parallelper-slot ctxPracticality for coding
1256kComfortable, but lower throughput
2128kPractical baseline
385kHeavy tasks start to overflow
464kToo tight for agentic coding

--parallel 2 looked like the practical floor. Most coding tasks fit inside 128k. YaRN leaves room to stretch much further, but the first priority is practical throughput at 128k x 2.

Quant Choice: Q4_0 for Qwen3-Coder-Next

QuantPPLweightsgrandpa expert budget
IQ4_KSS8.3139 GiB48GB/GPU
Q4_08.2545 GiB42GB/GPU

The PPL gap is only 0.06, but worker-side code quality matters more. The cost is losing roughly 2-3 GLM expert layers worth of budget, or around a second per 1000 tokens. That trade still favors worker quality.

Final Layout

  dev0 (96GB):
  naughty-worker0: Q4_0, --parallel 2, 128k x 2
    weights 45GB + KV ~5.5GB + compute ~3GB = ~54GB
  grandpa expert (head side): ~42GB -> 10 layers

dev1 (96GB):
  naughty-worker1: Q4_0, --parallel 2, 128k x 2
    weights 45GB + KV ~5.5GB + compute ~3GB = ~54GB
  grandpa expert (tail+mid side): ~42GB -> 10 layers

grandpa non-expert/KV/compute: ~12GB/GPU (layer-split)
grandpa expert total: ~84GB -> 20 layers
CPU (768GB RAM): remaining 55 expert layers (~225GB pinned)
  

The target here is around ~47ms/token, or 21+ t/s. That does not quite reach the full 22-layer head+tail best of 46ms, but if I can keep two workers resident, overall throughput is likely better.

Expert Placement Plan (20 Layers)

  GPU0 (head-heavy):
  blk.3,4,5,6,7,8,9,10 + blk.20,30 = 10 layers

GPU1 (tail-heavy):
  blk.70,71,72,73,74,75,76,77 + blk.50,60 = 10 layers
  

That gives 8 head layers, 8 tail layers, and 4 middle layers. If it can land close to the full 22-layer result, it should be a good balance between worker residency and orchestrator speed.

Operational Impact on the Orchestrator

If I treat a review reply as roughly 1000 tokens, the configuration difference maps directly to turn length.

ConfigurationTime required
cpu-moe baseline (18.8 t/s)53s
n-cpu-moe 64 (19.9 t/s)50s
-ot head+tail (21.6 t/s)46s

A 7-second difference does not look huge in isolation, but it compounds quickly across 2 or 3 rounds. On the other hand, a full 22-layer head+tail layout reduces room for workers. That is why Phase 0 still needs real data on whether grandpa speed or naughty throughput contributes more to convergence.

Future Optimization Options

Custom GGUF Build

One option is to raise quant precision only for hot expert layers:

  # GPU layers (head 8 + tail 8)
blk\.(3|4|5|6|7|8|9|10)\.ffn_down_exps\.weight=iq6_k
blk\.(3|4|5|6|7|8|9|10)\.ffn_(gate|up)_exps\.weight=iq5_ks
blk\.(70|71|72|73|74|75|76|77)\.ffn_down_exps\.weight=iq6_k
blk\.(70|71|72|73|74|75|76|77)\.ffn_(gate|up)_exps\.weight=iq5_ks

# CPU layers (middle)
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks
  

That could improve strict parse-tier rate by raising quality where the GPU-resident hot layers matter most. The downside is VRAM growth on the order of ~19GB across 16 layers, so it directly trades against worker context.

Expert Activation Profiling

The next serious step is to use --metrics and verbose logging to measure per-layer activation frequency.

IQ4_K on the GLM Side

Moving from IQ3_KS to smol-IQ4_K could improve instruction-following quality. TG might drop by 10-15%, but if parse tier or convergence improves enough, the trade could still come out ahead.

Raw Benchmark Appendix

From here onward, I am preserving the raw benchmark material that fed the article directly. The intent is to keep the actual launch commands, startup logs, and PP/TG extracts searchable later with grep.

A. -ot exps=CPU 2GPU

  ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
  Device 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB

Split mode 'graph' is not supported for this model
  => changing split mode to 'layer'

llm_load_print_meta: model type       = 744B.A40B
llm_load_print_meta: model ftype      = IQ3_KS - 3.1875 bpw
llm_load_print_meta: model params     = 753.864 B
llm_load_print_meta: model size       = 320.216 GiB (3.649 BPW)

llm_load_tensors:  CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors:      CUDA0 buffer size =  7378.05 MiB
llm_load_tensors:      CUDA1 buffer size =  7120.90 MiB

llama_init_from_model: n_ctx         = 32768
llama_init_from_model: n_batch       = 8192
llama_init_from_model: n_ubatch      = 8192
llama_init_from_model: flash_attn    = 1
llama_init_from_model: mla_attn      = 3
llama_init_from_model: attn_max_b    = 512
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: graph_reuse   = 1

llama_kv_cache_init:      CUDA0 KV buffer size =   784.15 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   707.64 MiB
llama_init_from_model: KV self size  = 1491.75 MiB

llama_init_from_model:      CUDA0 compute buffer size =  5418.03 MiB
llama_init_from_model:      CUDA1 compute buffer size =  5032.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   704.09 MiB
llama_init_from_model: graph nodes  = 10250
llama_init_from_model: graph splits = 190

Allocating 299.91 GiB of pinned host memory
done allocating 299.91 GiB in 51786.7 ms
  
  # short output, thinking off (61 tokens)
prompt eval time =    1323.95 ms /    30 tokens (   44.13 ms per token,    22.66 tokens per second)
       eval time =    3238.59 ms /    61 tokens (   53.09 ms per token,    18.84 tokens per second)
      total time =    4562.54 ms /    91 tokens

# short output, thinking on (355 tokens = 294 thinking + 61 content)
prompt eval time =      55.51 ms /     1 tokens (   55.51 ms per token,    18.02 tokens per second)
       eval time =   19024.07 ms /   355 tokens (   53.59 ms per token,    18.66 tokens per second)
      total time =   19079.58 ms /   356 tokens

# long output, thinking off (2048 tokens)
prompt eval time =    1831.66 ms /    43 tokens (   42.60 ms per token,    23.48 tokens per second)
       eval time =  111609.04 ms /  2048 tokens (   54.50 ms per token,    18.35 tokens per second)
      total time =  113440.71 ms /  2091 tokens

# long output, thinking on (4096 tokens)
prompt eval time =    2459.53 ms /    43 tokens (   57.20 ms per token,    17.48 tokens per second)
       eval time =  226282.06 ms /  4096 tokens (   55.24 ms per token,    18.10 tokens per second)
      total time =  228741.58 ms /  4139 tokens
  

B. --cpu-moe 1GPU

  ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB

llm_load_tensors:  CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors:      CUDA0 buffer size = 14498.95 MiB

llama_kv_cache_init:      CUDA0 KV buffer size =  1491.79 MiB
llama_init_from_model:      CUDA0 compute buffer size =  5418.03 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   704.09 MiB
llama_init_from_model: graph nodes  = 10250
llama_init_from_model: graph splits = 152

Allocating 299.91 GiB of pinned host memory
done allocating 299.91 GiB in 46485.7 ms
  
  # short output, thinking off (63 tokens)
prompt eval time =    1297.67 ms /    30 tokens (   43.26 ms per token,    23.12 tokens per second)
       eval time =    3340.94 ms /    63 tokens (   53.03 ms per token,    18.86 tokens per second)
      total time =    4638.61 ms /    93 tokens

# long output, thinking off (2048 tokens)
prompt eval time =    1823.10 ms /    43 tokens (   42.40 ms per token,    23.59 tokens per second)
       eval time =  110967.85 ms /  2048 tokens (   54.18 ms per token,    18.46 tokens per second)
      total time =  112790.94 ms /  2091 tokens

# long output, thinking on (4096 tokens)
prompt eval time =    1696.37 ms /    43 tokens (   39.45 ms per token,    25.35 tokens per second)
       eval time =  222410.35 ms /  4096 tokens (   54.30 ms per token,    18.42 tokens per second)
      total time =  224106.72 ms /  4139 tokens
  

C. --n-cpu-moe 64 -ger 1GPU

  ggml_cuda_init: found 1 CUDA devices:

# blk.3〜blk.63: CUDA_Host (55 MoE layers on CPU)
# blk.64〜blk.77: GPU (14 MoE layers)

Allocating 244.02 GiB of pinned host memory
  
  # short output, thinking off (62 tokens)
prompt eval time =    1173.39 ms /    30 tokens (   39.11 ms per token,    25.57 tokens per second)
       eval time =    3109.75 ms /    62 tokens (   50.16 ms per token,    19.94 tokens per second)
      total time =    4283.14 ms /    92 tokens

# long output, thinking off (2048 tokens)
prompt eval time =    1690.07 ms /    43 tokens (   39.30 ms per token,    25.44 tokens per second)
       eval time =  103526.50 ms /  2048 tokens (   50.55 ms per token,    19.78 tokens per second)
      total time =  105216.57 ms /  2091 tokens

# long output, thinking on (4096 tokens)
prompt eval time =    1664.97 ms /    43 tokens (   38.72 ms per token,    25.83 tokens per second)
       eval time =  207284.39 ms /  4096 tokens (   50.61 ms per token,    19.76 tokens per second)
      total time =  208949.37 ms /  4139 tokens
  

D. -ot head12+tail10 -ger 2GPU

  ggml_cuda_init: found 2 CUDA devices:

Allocating 218045 MiB of pinned host memory (HOST MEM)

# GPU VRAM:
GPU0: 57190MiB (58%)
GPU1: 48756MiB (50%)
  
  # short output, thinking off (64 tokens)
prompt eval time =    1111.85 ms /    30 tokens (   37.06 ms per token,    26.98 tokens per second)
       eval time =    2942.51 ms /    64 tokens (   45.98 ms per token,    21.75 tokens per second)
      total time =    4054.36 ms /    94 tokens

# long output, thinking off (2048 tokens)
prompt eval time =    2016.80 ms /    43 tokens (   46.90 ms per token,    21.32 tokens per second)
       eval time =   94650.69 ms /  2048 tokens (   46.22 ms per token,    21.64 tokens per second)
      total time =   96667.49 ms /  2091 tokens

# long output, thinking on (4096 tokens)
prompt eval time =    1431.92 ms /    43 tokens (   33.30 ms per token,    30.03 tokens per second)
       eval time =  190089.65 ms /  4096 tokens (   46.41 ms per token,    21.55 tokens per second)
      total time =  191521.57 ms /  4139 tokens
  

Summary

What this run made clear is that GLM-5.1 speed in the orchestrator lane is decided less by “1GPU vs 2GPU” than by “which experts are moved back onto GPU.” Full cpu-moe is stable and GPUs still run part of the path, but the dominant bottleneck is on the CPU side, and that is not enough once the orchestrator has to coexist with workers. Moving the hot head+tail layers back onto GPU materially improves TG.

The important part in practice is not to stop at “21+ t/s in a standalone benchmark,” but to carry that into a full resident layout with two Qwen3-Coder-Next Q4_0 --parallel 2 workers. Phase 0 still needs real parse-tier, convergence, and force-accept data to tell me whether I should bias further toward grandpa speed or worker quality.

Also, if I were using GLM-5.1 by itself, the current results suggest that a layout which reserves roughly 30% of the total GPU expert budget for head and tail first, then spreads the rest across the middle layers with a weighted-average mindset, is probably a strong candidate. The current measurements already show that head+tail is clearly hot, but pushing everything to the edges may still leave useful middle-layer coverage on the table.

Even a rough version of that idea seems worth testing. For the middle layers, it would be practical to try just two variants: one biased toward odd middle layers and one biased toward even middle layers, then switch between them depending on workload type. Even without full activation profiling, keeping the hot edges fixed while testing two middle-layer bias patterns looks like a cheap and informative tuning method. It is still a hypothesis, but it seems like a worthwhile next placement experiment.