Evaluating llm-jp-4-32b-a3b-base-NVFP4 for Translation and Pivoting Away from a Resident Translator Role
A single-GPU vLLM 0.18.0 record for llm-jp-4-32b-a3b-base-NVFP4, and why I switched from SFT/DPO+LoRA-first assumptions to on-demand translation batches in Dagster.
I had been preparing SFT/DPO datasets and LoRA under the assumption that llm-jp/llm-jp-4-32b-a3b-base would run as a dedicated translation role. But after measuring llm-jp-4-32b-a3b-base-NVFP4 on a single GPU, PP/TG was faster than expected. At that point, keeping a resident translator role looked less optimal than running translation batches only when needed.
Model provider: llm-jp on Hugging Face
Command Used
This is the exact launch command used during validation.
podman run --rm -it \
--device nvidia.com/gpu=0 \
--ipc=host \
-e HF_HOME=/hf \
-e HF_HUB_OFFLINE=1 \
-v /mnt/data/models:/hf/hub:ro \
-p 9000:9000 \
registry.home.arpa/vllm-openai:v0.18.0-cu130 \
/hf/hub/llm-jp-4-32b-a3b-base-NVFP4 \
--host 0.0.0.0 \
--port 9000 \
--trust-remote-code \
--quantization compressed-tensors \
--served-model-name translator \
--dtype bfloat16 \
--gpu-memory-utilization 0.80 \
--max-num-seqs 8 \
--max-model-len 32678 \
--max-num-batched-tokens 1024 \
--no-enable-prefix-caching
I intentionally disabled prefix caching. For this translation workload, fixed-prefix reuse is limited, and I wanted to measure the raw behavior first.
Startup Logs (Key Raw Lines)
First, the startup baseline from runtime arguments:
(APIServer pid=1) INFO 04-14 08:08:26 [utils.py:233] non-default args: {'model_tag': '/hf/hub/llm-jp-4-32b-a3b-base-NVFP4', 'host': '0.0.0.0', 'port': 9000, 'model': '/hf/hub/llm-jp-4-32b-a3b-base-NVFP4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 32678, 'quantization': 'compressed-tensors', 'served_model_name': ['translator'], 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': False, 'max_num_batched_tokens': 1024, 'max_num_seqs': 8}
Then the model load and init completion block:
(EngineCore pid=118) INFO 04-14 08:08:35 [gpu_model_runner.py:4481] Starting to load model /hf/hub/llm-jp-4-32b-a3b-base-NVFP4...
(EngineCore pid=118) INFO 04-14 08:08:39 [default_loader.py:384] Loading weights took 3.32 seconds
(EngineCore pid=118) INFO 04-14 08:08:39 [gpu_model_runner.py:4566] Model loading took 18.23 GiB memory and 3.843664 seconds
(EngineCore pid=118) INFO 04-14 08:08:52 [monitor.py:48] torch.compile took 12.35 s in total
(EngineCore pid=118) INFO 04-14 08:08:52 [monitor.py:76] Initial profiling/warmup run took 0.51 s
(EngineCore pid=118) INFO 04-14 08:09:29 [gpu_worker.py:456] Available KV cache memory: 56.62 GiB
(EngineCore pid=118) INFO 04-14 08:09:29 [kv_cache_utils.py:1316] GPU KV cache size: 927,600 tokens
(EngineCore pid=118) INFO 04-14 08:09:29 [kv_cache_utils.py:1321] Maximum concurrency for 32,678 tokens per request: 28.38x
Single and Batch Measurements
Single request:
(APIServer pid=1) INFO 04-14 08:11:32 [loggers.py:259] Engine 000: Avg prompt throughput: 12.5 tokens/s, Avg generation throughput: 25.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Representative batch logs (seq=8, ctx=32768):
(APIServer pid=1) INFO 04-14 07:34:20 [loggers.py:259] Engine 000: Avg prompt throughput: 353.8 tokens/s, Avg generation throughput: 157.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 10.0.2.100:41472 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 10.0.2.100:41484 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 10.0.2.100:36782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 10.0.2.100:36788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 10.0.2.100:36790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-14 07:34:30 [loggers.py:259] Engine 000: Avg prompt throughput: 2768.4 tokens/s, Avg generation throughput: 159.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-14 07:34:40 [loggers.py:259] Engine 000: Avg prompt throughput: 276.4 tokens/s, Avg generation throughput: 182.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-14 07:34:50 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-14 07:35:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 179.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
Ranges observed in this run:
- Prompt throughput: around
350-1360 tok/s(with an observed instant peak of2768.4 tok/s) - Generation throughput: around
152-183 tok/s - Prefix cache hit rate: always
0.0%(consistent with--no-enable-prefix-caching)
For translation workloads, this is already practical. max_num_seqs=8 is still conservative, with room to push further for short/medium requests.
Is Decode Throughput Decay Abnormal?
In long generations, decode throughput gradually drops:
183.2 -> 180.5 -> 178.2 -> 176.4 -> 173.3 -> 170.0 -> 169.3 -> 167.9
169.7 -> 168.9 -> 167.2 -> 165.5 -> 163.7 -> 160.9 -> 159.4 -> 158.0 -> 156.3 -> 154.7 -> 152.2
Why I Pivoted the Architecture
Based on measured speed and stability, I switched to:
- Remove the
translatorrole from my custom harness - Trigger translation batches from Dagster pipeline events
- Reserve VRAM for resident inference roles
Originally I was preparing translation-specific LoRA with SFT/DPO. But measured PP/TG was already fast enough that the cost/benefit of keeping a resident translation role dropped. Unless there is automation that immediately reuses translated output for retraining, translation is not always urgent enough to justify that permanent allocation. With this speed, on-demand batches are usually enough and improve node-wide resource efficiency.
Next Steps
Turning llm-jp-4-32b-a3b-base into a translation-specialized setup may take more effort than expected. So the next checkpoint is to revalidate with thinking:low first.
Assuming reasoning: low|middle|high is selectable by use case, I plan to try this two-stage flow:
- First pass to strict Japanese with
gemmatranslate4b-it - Fluency pass with
llm-jp-32b-a3b-thinking:low
If thinking:low and LoRA adaptation are still difficult to settle for this role, I want to at least establish practical limits for a two-stage hybrid with gemmatranslate-4b-it.
Summary
llm-jp-4-32b-a3b on single-GPU NVFP4 was validated in this run. I also tried --tensor-parallel-size 2, but it failed with Intermediate size padding for w1 and w3 ... not currently supported. I tested non-auto options such as --moe-backend cutlass, but those did not resolve it. Running two instances is a practical workaround, and I did not want to spend more time on that branch in this pass.
The useful outcome was clear: moving from a resident translator role to on-demand translation batches is the better operating direction for now.
