I had been preparing SFT/DPO datasets and LoRA under the assumption that llm-jp/llm-jp-4-32b-a3b-base would run as a dedicated translation role. But after measuring llm-jp-4-32b-a3b-base-NVFP4 on a single GPU, PP/TG was faster than expected. At that point, keeping a resident translator role looked less optimal than running translation batches only when needed.

Model provider: llm-jp on Hugging Face

Command Used

This is the exact launch command used during validation.

  podman run --rm -it \
  --device nvidia.com/gpu=0 \
  --ipc=host \
  -e HF_HOME=/hf \
  -e HF_HUB_OFFLINE=1 \
  -v /mnt/data/models:/hf/hub:ro \
  -p 9000:9000 \
  registry.home.arpa/vllm-openai:v0.18.0-cu130 \
  /hf/hub/llm-jp-4-32b-a3b-base-NVFP4 \
  --host 0.0.0.0 \
  --port 9000 \
  --trust-remote-code \
  --quantization compressed-tensors \
  --served-model-name translator \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 8 \
  --max-model-len 32678 \
  --max-num-batched-tokens 1024 \
  --no-enable-prefix-caching
  

I intentionally disabled prefix caching. For this translation workload, fixed-prefix reuse is limited, and I wanted to measure the raw behavior first.

Startup Logs (Key Raw Lines)

First, the startup baseline from runtime arguments:

  (APIServer pid=1) INFO 04-14 08:08:26 [utils.py:233] non-default args: {'model_tag': '/hf/hub/llm-jp-4-32b-a3b-base-NVFP4', 'host': '0.0.0.0', 'port': 9000, 'model': '/hf/hub/llm-jp-4-32b-a3b-base-NVFP4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 32678, 'quantization': 'compressed-tensors', 'served_model_name': ['translator'], 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': False, 'max_num_batched_tokens': 1024, 'max_num_seqs': 8}
  

Then the model load and init completion block:

  (EngineCore pid=118) INFO 04-14 08:08:35 [gpu_model_runner.py:4481] Starting to load model /hf/hub/llm-jp-4-32b-a3b-base-NVFP4...
(EngineCore pid=118) INFO 04-14 08:08:39 [default_loader.py:384] Loading weights took 3.32 seconds
(EngineCore pid=118) INFO 04-14 08:08:39 [gpu_model_runner.py:4566] Model loading took 18.23 GiB memory and 3.843664 seconds
(EngineCore pid=118) INFO 04-14 08:08:52 [monitor.py:48] torch.compile took 12.35 s in total
(EngineCore pid=118) INFO 04-14 08:08:52 [monitor.py:76] Initial profiling/warmup run took 0.51 s
(EngineCore pid=118) INFO 04-14 08:09:29 [gpu_worker.py:456] Available KV cache memory: 56.62 GiB
(EngineCore pid=118) INFO 04-14 08:09:29 [kv_cache_utils.py:1316] GPU KV cache size: 927,600 tokens
(EngineCore pid=118) INFO 04-14 08:09:29 [kv_cache_utils.py:1321] Maximum concurrency for 32,678 tokens per request: 28.38x
  

Single and Batch Measurements

Single request:

  (APIServer pid=1) INFO 04-14 08:11:32 [loggers.py:259] Engine 000: Avg prompt throughput: 12.5 tokens/s, Avg generation throughput: 25.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
  

Representative batch logs (seq=8, ctx=32768):

  (APIServer pid=1) INFO 04-14 07:34:20 [loggers.py:259] Engine 000: Avg prompt throughput: 353.8 tokens/s, Avg generation throughput: 157.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     10.0.2.100:41472 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     10.0.2.100:41484 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     10.0.2.100:36782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     10.0.2.100:36788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     10.0.2.100:36790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-14 07:34:30 [loggers.py:259] Engine 000: Avg prompt throughput: 2768.4 tokens/s, Avg generation throughput: 159.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-14 07:34:40 [loggers.py:259] Engine 000: Avg prompt throughput: 276.4 tokens/s, Avg generation throughput: 182.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-14 07:34:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-14 07:35:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 179.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
  

Ranges observed in this run:

  • Prompt throughput: around 350-1360 tok/s (with an observed instant peak of 2768.4 tok/s)
  • Generation throughput: around 152-183 tok/s
  • Prefix cache hit rate: always 0.0% (consistent with --no-enable-prefix-caching)

For translation workloads, this is already practical. max_num_seqs=8 is still conservative, with room to push further for short/medium requests.

Is Decode Throughput Decay Abnormal?

In long generations, decode throughput gradually drops:

  183.2 -> 180.5 -> 178.2 -> 176.4 -> 173.3 -> 170.0 -> 169.3 -> 167.9
169.7 -> 168.9 -> 167.2 -> 165.5 -> 163.7 -> 160.9 -> 159.4 -> 158.0 -> 156.3 -> 154.7 -> 152.2
  

Why I Pivoted the Architecture

Based on measured speed and stability, I switched to:

  1. Remove the translator role from my custom harness
  2. Trigger translation batches from Dagster pipeline events
  3. Reserve VRAM for resident inference roles

Originally I was preparing translation-specific LoRA with SFT/DPO. But measured PP/TG was already fast enough that the cost/benefit of keeping a resident translation role dropped. Unless there is automation that immediately reuses translated output for retraining, translation is not always urgent enough to justify that permanent allocation. With this speed, on-demand batches are usually enough and improve node-wide resource efficiency.

Next Steps

Turning llm-jp-4-32b-a3b-base into a translation-specialized setup may take more effort than expected. So the next checkpoint is to revalidate with thinking:low first.

Assuming reasoning: low|middle|high is selectable by use case, I plan to try this two-stage flow:

  1. First pass to strict Japanese with gemmatranslate4b-it
  2. Fluency pass with llm-jp-32b-a3b-thinking:low

If thinking:low and LoRA adaptation are still difficult to settle for this role, I want to at least establish practical limits for a two-stage hybrid with gemmatranslate-4b-it.

Summary

llm-jp-4-32b-a3b on single-GPU NVFP4 was validated in this run. I also tried --tensor-parallel-size 2, but it failed with Intermediate size padding for w1 and w3 ... not currently supported. I tested non-auto options such as --moe-backend cutlass, but those did not resolve it. Running two instances is a practical workaround, and I did not want to spend more time on that branch in this pass.

The useful outcome was clear: moving from a resident translator role to on-demand translation batches is the better operating direction for now.