Validating a Japanese Data Generation Pipeline with LLM-jp-4-32B-NVFP4 x CAT-Translate-7B-NVFP4

A hands-on record of validating a batch pipeline that takes 887 messages[N].user prompts from a Claude Opus 4.6 distilled dataset and turns them into three Japanese styles (concise / polite / literary) using NVFP4 quantization on an RTX PRO 6000 Blackwell. Covers two vLLM servers, a Go goroutine + semaphore scheduler, measured throughput, scaling estimates for a 10k-run pipeline, and a weekly-judge operational plan.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

A record of validating a batch pipeline on an RTX PRO 6000 Blackwell that takes English reasoning prompts and produces three Japanese styles.

Video link: https://www.youtube.com/watch?v=kSjMKcIJQAg

Overview

Input: the 887 messages[N].user questions from a Claude Opus 4.6 distilled dataset. CAT-Translate-7B translates each into Japanese, and LLM-JP-4-32B rewrites it into three styles (concise / polite / literary). The whole thing runs through NVFP4 quantization + vLLM batch processing.

The goal is to shape the output as a multi-variant dataset that’s easy to select/edit downstream — a typical use case would be fine-tuning data preparation.

Models

Model	Purpose	License	Quantization	VRAM
CyberAgent/CAT-Translate-7B	En→Ja translation	MIT	NVFP4	~4.6 GiB
llm-jp/llm-jp-4-32b-a3b-thinking	Japanese paraphrase	Apache 2.0	NVFP4	~18.2 GiB

CAT-Translate is a translation-specialized model that CyberAgent released in February 2026. It hits Avg BLEU 37.68, ahead of TranslateGemma-12B (32.24) and PLaMo-2 Translate (25.24).

LLM-JP-4 is a Japanese model developed by the National Institute of Informatics, based on Qwen3 MoE (32B-A3B). It follows the Harmony Response Format.

NVFP4 Quantization

A 4-bit floating-point format available on NVIDIA Blackwell architecture (SM120). The llm-compressor project (official vLLM) is used to build the quantized checkpoint.

Quantization script

  import os
os.environ["HF_HUB_CACHE"] = "/mnt/data/models"

from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from datasets import load_dataset

MODEL_ID = "cyberagent/CAT-Translate-7b"
OUTPUT_DIR = "CAT-Translate-7b-NVFP4"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

dataset = load_dataset("nntsuzu/JESC", split="train[:20]")

def preprocess(ex):
    prompt = f"Translate the following Japanese text into English.\n\n{ex['translation']['ja']}"
    return tokenizer(prompt, truncation=True, max_length=512)

ds = dataset.map(preprocess, remove_columns=dataset.column_names)

recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
oneshot(model=model, dataset=ds, recipe=recipe, num_calibration_samples=20)

model.save_pretrained(OUTPUT_DIR, save_compressed=True)
tokenizer.save_pretrained(OUTPUT_DIR)

Calibration data is 20 samples from nntsuzu/JESC (CC-BY-4.0). Quantization of the 7B model takes about 40 seconds.

pyproject.toml

  [project]
name = "quant-cat-translate-nvfp4"
version = "0.1.0"
requires-python = ">=3.10"
dependencies = [
  "torch>=2.4.0",
  "transformers>=4.47.0",
  "datasets>=3.0.0",
  "llmcompressor>=0.6.0",
  "accelerate>=1.2.0",
  "sentencepiece>=0.2.0",
]
[tool.uv]
package = false

vLLM Server Configuration

CAT-Translate goes on GPU 0, LLM-JP on GPU 1. Started via Podman containers.

  # GPU 0: CAT-Translate-7B NVFP4
podman run --rm -d \
  --device nvidia.com/gpu=0 \
  --ipc=host \
  -v /mnt/data/models/CAT-Translate-7b-NVFP4:/model:ro \
  -p 9000:9000 \
  registry.home.arpa/vllm-openai:v0.18.0-cu130 \
  /model \
  --host 0.0.0.0 --port 9000 \
  --quantization compressed-tensors \
  --served-model-name cat-translate-nvfp4 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 32 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --no-enable-prefix-caching

# GPU 1: LLM-JP-4-32B NVFP4
podman run --rm -d \
  --device nvidia.com/gpu=1 \
  --ipc=host \
  -v /mnt/data/models/llm-jp-4-32b-a3b-thinking-NVFP4:/model:ro \
  -p 9001:9001 \
  registry.home.arpa/vllm-openai:v0.18.0-cu130 \
  /model \
  --host 0.0.0.0 --port 9001 \
  --quantization compressed-tensors \
  --served-model-name llmjp-nvfp4 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 32 \
  --max-model-len 4096 \
  --max-num-batched-tokens 4096 \
  --no-enable-prefix-caching

Left: CAT-Translate (`MistralForCausalLM`, :9000). Right: LLM-JP-4-32B (`Qwen3MoeForCausalLM`, :9001). Both run `compressed-tensors` quantization with `bfloat16` dtype, `max_model_len=4096`, `max_num_seqs=32`, and Chunked prefill + Asynchronous scheduling enabled

Measured Throughput

Model	Peak PP (tokens/s)	Peak TG (tokens/s)	Max concurrent	KV Cache usage
CAT-Translate-7B NVFP4	484	263	4	< 0.1%
LLM-JP-4-32B NVFP4	1,978	1,724	32	1.6%

LLM-JP is faster than the dense 7B CAT-Translate thanks to its MoE architecture (32B total / 3B activated). KV Cache is almost idle on both, which leaves a lot of room to push --max-num-seqs higher.

Mid-batch run. Left: POST logs on :9000 (CAT-Translate). Right: :9001 (LLM-JP). The LLM-JP engine is clocking Avg prompt throughput 927.5 tokens/s while Running/Waiting stay at 0 — it's pacing itself between requests rather than saturating

Batch Script (Go)

Before wiring this into a proper Dagster pipeline I wanted to check if the approach was usable at all, so each item runs “translate → paraphrase” in a single goroutine, and total concurrency is bounded by a buffered channel used as a semaphore. A throwaway script to feel things out.

  English Q (887) → goroutine (translate → paraphrase) → 3-style output
                   ↑ concurrency = len(sem)

  sem := make(chan struct{}, concurrency)
var wg sync.WaitGroup

for _, q := range questions {
    sem <- struct{}{}
    wg.Add(1)
    go func(q Question) {
        defer wg.Done()
        defer func() { <-sem }()

        ja, err := translate(ctx, catClient, q.English)
        if err != nil { record(q, err); return }

        styles, err := paraphrase(ctx, llmjpClient, ja)
        if err != nil { record(q, err); return }

        writeJSONL(out, q, ja, styles)
    }(q)
}
wg.Wait()

vLLM handles the incoming stream fine with --max-num-seqs 32, so the client side just fires and forgets.

CAT-Translate uses no system prompt — just the official translation prompt template:

  Translate the following English text into Japanese.

{text}

LLM-JP system prompt:

  あなたは表現豊かな日本語に精通した校閲・添削のエキスパートです。
志賀直哉のような無駄のない、研ぎ澄まされた日本語表現を得意とします。

与えられた日本語テキストを、以下の3つの異なる表現に言い換えてください。

1. 簡潔体: 無駄を削ぎ落とした最小限の表現
2. 丁寧体: 読み手への配慮を含む丁寧な表現
3. 文語調: 格調高く引き締まった文語的表現

出力形式:
<<CONCISE>>
（ここに記述）
<<POLITE>>
（ここに記述）
<<LITERARY>>
（ここに記述）

Sampling parameters

Model	temperature	top_p	reasoning_effort	Rationale
CAT-Translate	0.3	0.95	—	Translation should stay on the deterministic side
LLM-JP	1.0	0.95	low	Qwen3-based, so recommended values sit here. `low` keeps think blocks short

LLM-JP-4 is based on Qwen3 MoE, so I just use the Qwen-family recommended temperature=1.0 / top_p=0.95. With reasoning_effort: low the <think> section stays short, but the Harmony-derived analysis block still gets flattened into content, so that’s stripped in post-processing.

Results

256 items processed (concurrency=32).

Quality statistics

Metric	Value
Success rate	256/256 (100%)
3-tag complete rate	252/256 (98.4%)
EN avg length	174 chars
JA translation avg	89 chars
JA/EN compression	0.51
Concise avg length	98 chars
Polite avg length	172 chars (1.77x concise)
Literary avg length	164 chars (1.68x concise)

Causes of the 4 tag-missing items

index	Cause
154	Flipped to `CONCISE` markdown notation
172	Model refusal (copyright)
238	`<<CONCENSED>>` typo
253	`CONCISE` markdown notation

LLM-JP Harmony format issue

LLM-JP-4 uses the OpenAI Harmony Response Format, and output from the analysis channel leaks flat into content because vLLM doesn’t parse Harmony’s channel separation.

   analysis We need to output three versions. assistant final <<CONCISE>>
...

reasoning_effort: low does work — the analysis portion is short. Handled by regex-stripping everything before <<CONCISE>> in post-processing.

Output samples

Technical (index 202)

EN: What’s a good ratio of unit tests to integration tests to e2e tests? We have almost no integration tests and our e2e suite takes 45 minutes to run.

JA translation: ユニットテスト、インテグレーションテスト、E2E テストの良い割合はどれくらいですか？当社ではインテグレーションテストがほとんどなく、E2E スイートは実行に 45 分かかります。

Concise: インテグレーションテストがほぼ無く、E2E テストが 45 分かかる状況では、ユニットテスト 70%、統合テスト 15%、E2E テスト 15% が目安となります。

Polite: 現在、インテグレーションテストが少なく、E2E テストの実行に 45 分ほど要すると伺っております。そのような環境では、テスト全体の構成比率として、ユニットテストを約 70％、統合テストを約 15％、E2E テストを残りの 15％程度に設定されることが一般的です。ご参考になれば幸いです。

Literary: 統合テストが乏しく、E2E スイートが四十分半もの時間を要すという現状に照らすに、全体のテスト配分は次のごとくあるべきかと存じます。― ユニットテスト　七割 ― 統合テスト　一割五分 ― E2E テスト　一割五分

Interpersonal (index 144)

EN: I told my coworker off in front of the whole team because she kept taking credit for my work. Everyone went silent. Did I do the right thing?

JA translation: チーム全員の前で同僚を叱責しました。彼女が私の仕事を横取りしていたからです。すると、全員が黙り込んでしまいました。私は正しいことをしたのでしょうか？

Concise: 同僚を叱りつけたのは、彼女が私の仕事を奪おうとしたからだ。皆が黙っていた。これでよかったのか。

Polite: 本日のミーティングにおいて、○○さんが私の担当業務を横取りされたように見受けられました。そのため、直接注意させていただきました。しかし、周囲の皆様には沈黙の時間が訪れたようでした。この行為は適切だったでしょうか。

Literary: 会議の場にて、我が同僚の所作が私の業務を侵す様相を呈し、慎みなく叱責致した。その結果、参加者全員は静かに息を潜めた。果たしてこの行為は正当なるものと致すべきか、問う所存である。

Quality review (16 random samples, judged by Claude Opus 4.6)

I pulled 16 random samples from the 256-item output and reviewed both the translations and paraphrases.

CAT-Translate (translation) quality

Overall: B+ — practical, with room to improve.

Assessment	Count	Detail
Accurate/natural	12/16	Technical terms rendered well
Minor issues	3/16	Meaning comes through but somewhat literal
Mistranslation	1/16	“emotional valence” → 「感情的価値」 (fair, given that the source was an intentionally nonsensical question)

Notable issues:

index 245: “I’m locked out” → 「家から鍵が閉まっています」 — 「閉め出された」 would be correct; the direction of the meaning got flipped
JA/EN compression ratio 0.52 is natural, but a few samples drop information (long EN squeezed into too-short JA)

LLM-JP (paraphrase) quality

Overall: A- — the three styles come out clearly distinct.

Concise: Expressions are pared down well. Occasional ** markdown stays in (index 250).

Polite: Business-like keigo feels natural. Canned phrases like 「ご教示いただけますでしょうか」 and 「ご検討ください」 are used appropriately. The most stable of the three styles.

Literary: index 27 stands out — 「サービスの闇が降りしきるや、三十分は速やかに灯火を掲げんとす」 is a bit over-creative but fun. index 144’s 「会議の場にて、我が同僚の所作が私の業務を侵す様相を呈し」 is solidly literary. Some mixing of modern and classical forms does happen (index 177: 「敷衍すべく、只今策定中でございます」 — 「ございます」 isn’t classical).

Problem pattern summary

Problem	Count	Handling
`bold` markdown leaking in	2/16	Strip in post-processing
Polite forms mixed into literary	2/16	Fixable with system prompt refinement
Translation nuance drift	1/16	Limit of CAT-Translate
Sincere translation of nonsense	3/16	Inherent to the source data

Sample

  ...
{"index":82,"english_q":"Explain the mathematics behind random forests. How do bagging and feature randomization create a model that's better than individual decision trees?","japanese_q":"ランダムフォレストの背後にある数学を説明してください。バギングと特徴量のランダマイゼーションは、個々の決定木よりも優れたモデルをどのように作成するのでしょうか？","paraphrases":" analysis We need to output three versions. Provide concise, polite, literary. assistant final \u003c\u003cCONCISE\u003e\u003e  \nランダムフォレストは、ブートストラップサンプルで多数の決定木を育て、各木の学習時に特徴量をランダムに選ぶことで、相関の低い多数の木を集約し、過学習を抑えて汎化性能を向上させます。  \n\n\u003c\u003cPOLITE\u003e\u003e  \nランダムフォレストは、ブートストラップ法によるサブサンプルと特徴量のランダム抽出を組み合わせて、多数の決定木を構築し、それぞれを独立に学習させます。その後、予測を多数決（または平均）で統合することで、個々の木よりも分散が小さく、安定した高精度のモデルを実現します。  \n\n\u003c\u003cLITERARY\u003e\u003e  \n乱択的森林（Random Forest）とは、ブートストラップ抽出により得た多様なデータ群と、各決定木の訓練過程で逐次ランダムに選ばれる特徴集合という二つの乱択手段を用いて、相互独立的なる多数の樹を育み上げる技法である。これらの樹はそれぞれが互いに相関し難い情報を帯びているため、集合的推論においては個々の木の揺らぎが相殺され、全体として極めて堅牢且つ高次元においても過剰適合を免れた卓越した汎化能力を呈すのである。"}
{"index":86,"english_q":"What's the viscosity in centipoise of our deal pipeline, and at what revenue throughput does the flow transition from laminar to turbulent? We need to size the sales team for Q3.","japanese_q":"当社の取引パイプラインの粘性（センチポイズ）はどれくらいで、どの売上スループットで流れが層流から乱流に移行しますか？ 第3四半期の営業チームの人員規模を決定する必要があります。","paraphrases":" analysis We need to output three versions. assistant final \u003c\u003cCONCISE\u003e\u003e\n当社の取引パイプラインの粘度（センチポイズ）は X で、売上スループットが Y になると層流から乱流へ遷移します。第3四半期の営業人数はこの結果を基に決定します。\n\n\u003c\u003cPOLITE\u003e\u003e\n弊社の取引パイプラインにおける粘度（センチポイズ）は X であり、売上スループットが Y に達すると層流状態から乱流状態へと変化します。その情報をもとに、第3四半期における営業チームの適切な人数をご決定いただければ幸いです。\n\n\u003c\u003cLITERARY\u003e\u003e\n我が社の取引管路における粘度（センチポイズ）は X と測定され、売上スループットが Y を越える折に、層流は乱れたる流勢へと転換致す。此の事実を踏まえ、第３四半期の営業部門要員数を定めんものとす。"}
{"index":89,"english_q":"Our controller wants to implement a triangulated accrual reconciliation method for intercompany eliminations under U.S. GAAP -- how should we configure the convergence tolerance between the three ledger axes before the quarter-end close?","japanese_q":"当社のコントローラーは、米国会計基準（US GAAP）に基づく企業間消去のための、三角測量による発生調整方式を実装したいと考えています。四半期末の決算前に、3つの会計軸間の収束許容範囲をどのように設定すればよいでしょうか？","paraphrases":" analysis We need to output three rephrasings. Provide concise, polite, literary. assistant final \u003c\u003cCONCISE\u003e\u003e  \n当社はUS GAAPの三角測量法で企業間消去を行う予定です。四半期末前に、３つの会計軸の収束許容範囲をどう設定すべきか教えてください。\n\n\u003c\u003cPOLITE\u003e\u003e  \n弊社では、米国会計基準（US GAAP）に則った企業間消去に三角測量方式の導入を検討しております。四半期末決算前に、三つの会計軸間の収束許容範囲をどのように設定すべきか、ご教示いただけますと幸いです。\n\n\u003c\u003cLITERARY\u003e\u003e  \n是に、米国会計基準（US GAAP）に則り、企業間消去を三角測量の手法にて実施すべく企図いたすところです。四半期末の決算に際し、三軸の会計項目が収束すべき許容範囲を如何に設定すべきか、御指南賜りたく存ず。"}
{"index":90,"english_q":"What's the yield-to-maturity on our residency program's teaching portfolio, and should we rebalance toward more procedural exposure before the next accreditation coupon date?","japanese_q":"当レジデンシープログラムの教育ポートフォリオにおける利回り対満期日比率はどのくらいですか？また、次の認定クーポンの期日までに、より手技的な経験を重視するようにバランスを調整すべきでしょうか？","paraphrases":" analysis Need to rewrite three styles. assistant final \u003c\u003cCONCISE\u003e\u003e\n教育ポートフォリオの利回り対満期日比率は約 X ％です。次回認定クーポンの締切までに、手技的経験を重視した構成へ調整すべきか検討してください。\n\n\u003c\u003cPOLITE\u003e\u003e\n当レジデンシープログラムにおける教育ポートフォリオの利回り対満期日比率は、約 X ％となっております。次の認定クーポンの期限が近づいておりますので、それまでに手技的経験を重点的に評価するよう、ポートフォリオのバランスを見直すことをご検討いただければ幸いです。\n\n\u003c\u003cLITERARY\u003e\u003e\n我がレジデンシープログラムにおきまして、教育ポートフォリオの利回りは満期日に対し約 X ％という数値を呈す。次に掲げられる認定クーポンの有効期限が迫っておるゆえ、手技的経験を一層重視すべく、ポートフォリオ配分を精緻に再調整すべきか否か、熟慮の程を求む。"}
{"index":93,"english_q":"What's the per-brushstroke aesthetic contribution score when painting a room, and how do we identify which strokes are lowering the overall wall beauty index?","japanese_q":"部屋を塗る際、1回の筆遣いあたりの美的貢献スコアはどれくらいか、また、どの筆遣いが壁の全体的な美しさ指数を下げているかをどうやって特定するか？","paraphrases":" analysis We need to output three rewrites. Provide concise, polite, literary. assistant final \u003c\u003cCONCISE\u003e\u003e\n部屋を塗るとき、1本の筆跡が美的貢献度に与える影響と、全体の美観を低下させる筆跡の特定方法を教えてください。\n\n\u003c\u003cPOLITE\u003e\u003e\n部屋を塗装する際に、1回の筆遣いが美的貢献スコアとしてどれほど寄与しているか、そして全体の壁の美しさ指数を下げている筆遣いをどのように判別すればよいか、ご教示いただけますでしょうか。\n\n\u003c\u003cLITERARY\u003e\u003e\n部屋を彩るべく塗料を施す際、一回の筆跡が美的貢献として如何に数値化され得るか、並びに其の筆跡の中で壁面全体の美観を蝕むものを見極めんとする方法論をご提示賜りますよう、謹んでお願い申し上げます。"}
{"index":87,"english_q":"We want to calculate the impedance mismatch between our frontend team's React component model and the backend team's domain-driven design layer — at what ohm-equivalent threshold should we consider introducing a BFF service to match the impedance?","japanese_q":"フロントエンドチームのReactコンポーネントモデルとバックエンドチームのドメイン駆動設計レイヤーとの間のインピーダンス不整合を計算したいと考えています。どの程度の抵抗値相当の閾値で、インピーダンスを一致させるためにBFF（Business Functional Component）サービスを導入すべきでしょうか？","paraphrases":" analysis We need to respond with three paraphrases as instructed. Provide concise, polite, literary versions. assistant final \u003c\u003cCONCISE\u003e\u003e\n\nフロントエンドの React コンポーネントとバックエンドの DDD レイヤー間に生じるインピーダンス不整合を測定し、BFF を導入すべき抵抗値（閾値）を決める。\n\n\u003c\u003cPOLITE\u003e\u003e\n\n現在、フロントエンドチームの React コンポーネントとバックエンドチームのドメイン駆動設計レイヤーとの間に存在するインピーダンス不整合について、どの程度の抵抗値に相当する閾値で合わせるべきか、またその際に Business Functional Component（BFF）サービスを導入すべきかどうかをご検討いただきたく存じます。\n\n\u003c\u003cLITERARY\u003e\u003e\n\nフロントエンド部門における React コンポーネント群と、バックエンド部門が採用する Domain‑Driven Design の層構造との間に顕在化すべきインピーダンス不整合を算定し、其の抵抗閾値を如何に設定すれば、Business Functional Component（BFF）を介して円滑に統合できるか、慎重にご判断いただきたく存じます。"}
{"index":92,"english_q":"What tolerance range in milliempathies should we set for compassion drift among palliative care nurses during consecutive 12-hour shifts, and at what threshold do we mandate a rotation?","japanese_q":"緩和ケア看護師が連続12時間シフト中に経験する共感疲労（コンパッション・ドリフト）に対し、どの程度の許容範囲（ミリエンパシー単位）を設定すべきか、またどの閾値でローテーションを義務付けるべきか？","paraphrases":" analysis Need to rewrite in three styles. assistant final \u003c\u003cCONCISE\u003e\u003e\n共感疲労の許容レベルは「ミリエンパシー」単位で ≤ 200 と設定し、200 を超えたら即ローテーションする。\n\n\u003c\u003cPOLITE\u003e\u003e\n緩和ケア看護師の長時間労働による共感疲労については、許容できる上限を「ミリエンパシー」200単位とし、この数値を超えた場合には速やかにローテーションを実施すべきと考えております。\n\n\u003c\u003cLITERARY\u003e\u003e\n当職業において、連続十二時間に及ぶシフトの中で看護師が抱える共感疲労は、ミリエンパシーという指標で測り得るとすれば、二百単位を超える瞬間こそ警戒すべき転機であり、そこへ到達した暁には必ずやローテーションを執りて、負担の軽減を図らねばならぬ。"}
{"index":53,"english_q":"How does a compiler optimize code? Walk me through constant folding, dead code elimination, loop unrolling, and inlining with examples.","japanese_q":"コンパイラはどのようにコードを最適化しますか？ 定数畳み込み、デッドコードの削除、ループのアンロール、インライン化について、例を挙げて説明します。","paraphrases":" analysis We need to output three versions. Provide concise, polite, literary. Use proper Japanese. assistant final \u003c\u003cCONCISE\u003e\u003e\nコンパイラはコードを最適化します。たとえば定数畳み込みでは式 `2 * 3 + 5` をコンパイル時に `11` に置き換えます。デッドコード削除では実行不能な `if (false) { … }` を除去します。ループのアンロールは `for(i=0;i\u003c4;i++) sum+=a[i];` を `sum+=a[0]+a[1]+a[2]+a[3];` と展開し、インライン化は関数呼び出しを呼び出し側に直接埋め込んでオーバーヘッドを減らします。\n\n\u003c\u003cPOLITE\u003e\u003e\nコンパイラはコードの最適化技術を駆使して、プログラムの実行速度やメモリ使用量を改善いたします。例えば、定数畳み込みという手法では、コンパイル時に計算可能な式 `2 * 3 + 5` を事前に評価し、その結果である `11` をコードに組み込みます。デッドコード削除では、実行不可能なブロック（例：`if (false) { … }`）を検出し、不要な部分を除去しますよ。また、ループのアンロールは、繰り返し回数が固定された `for (i = 0; i \u003c 4; ++i) sum += a[i];` を `sum += a[0] + a[1] + a[2] + a[3];` と展開して命令数を削減します。さらに、インライン化は小さな関数呼び出しを呼び出し元に直接埋め込むことで、関数呼び出しオーバーヘッドを低減させます。\n\n\u003c\u003cLITERARY\u003e\u003e\nコンパイラは、鋭利なる刃のごとく冗長を削り、原初の真理へと導く術を以てコードを研ぎ澄ませる。まず、**定数畳み込み**に於いては、`2 * 3 + 5` のような演算を解析し、コンパイル時にその結果である `11` と置換す。次に、**デッドコードの除去**では、`if (false) { … }` の如く決して執行されぬブロックを検出し、闇に葬り去ることで無駄なる痕跡を消し去る。続くは、**ループのアンロール**――有限回の反復を明示的に書き足すことに由り、`for (i = 0; i \u003c 4; ++i) sum += a[i];` は `sum += a[0] + a[1] + a[2] + a[3];` と変容し、命令列は簡潔の峰を成す。最後に、**インライン化**は、小さき関数呼び出しを呼び出し元へ直接巻き込み、スタック操作等の余計な負荷を払拭す。此の三技法により、機械は一層の清浄と効率を獲得し、プログラムは常に最適の姿勢を保つのである。"}
{"index":91,"english_q":"What's the tensile strength of our lesson plan scaffolding, and at what student-to-teacher ratio should we expect the learning framework to undergo brittle fracture?","japanese_q":"私たちのレッスンプランの足場の引張強度はどれくらいで、学習フレームワークが脆性破壊を起こすと予想される生徒と教師の比率はどれくらいですか？","paraphrases":" analysis We need to produce three rewrites. Provide answers? The question asks about lesson plan footing tensile strength and ratio of students/teachers predicted brittle failure. Likely answer is not known; we can say insufficient data. But still rewrite the query.\n\nWe'll output rephrased versions. assistant final \u003c\u003cCONCISE\u003e\u003e\nレッスンプランの足場の引張強度と、学習フレームワークが脆性破壊すると見込まれる生徒と教師の割合は？\n\n\u003c\u003cPOLITE\u003e\u003e\nご教示いただけますでしょうか。当レッスンプランにおける足場の引張強度の数値、および学習フレームワークが脆性破壊を起こすと予測される生徒と教師の比率について教えていただきたく存じます。\n\n\u003c\u003cLITERARY\u003e\u003e\n此処に問うは、我がレッスンプランの足場が耐える引張の力のほど、そして学習枠組みが脆弱なる破砕を呈すと考えられる生徒と教師との比率なり。どうかご教授賜らんことを。"}
...

License & Attribution

Resource	License	Attribution
CAT-Translate-7B	MIT	CyberAgent
LLM-JP-4-32B	Apache 2.0	National Institute of Informatics
nntsuzu/JESC	CC-BY-4.0	Stanford / Google Brain / Rakuten
TeichAI/Claude-Opus-4.6-Reasoning-887x	Apache 2.0	TeichAI

Scaling estimate: 10k-item pipeline

Extrapolating from the 256-item run.

VRAM budget

192 GB (96 GB × 2 GPU) × 0.80 = 153.6 GB available.

Model	GPU	Weights	KV Cache budget	Max concurrency @4k
CAT-Translate 7B NVFP4	GPU 0	4.6 GiB	70.4 GiB	140x
LLM-JP 32B-A3B NVFP4	GPU 1	18.2 GiB	56.6 GiB	226x

Total weight footprint: 22.8 GiB. That leaves 127 GiB for KV Cache.

Choosing concurrency

CAT-Translate’s TG (263 t/s) is about 1/6.5 of LLM-JP’s (1,724 t/s), so I give CAT 6× the seq count to balance the pipeline.

  CAT-Translate: seq = n × 6
LLM-JP:        seq = n

n	LLM-JP seq	CAT seq (n×6)	LLM-JP KV	CAT KV	Verdict
32	32	192 → cap 140	8 GiB	17 GiB	OK
64	64	384 → cap 140	16 GiB	17 GiB	OK (optimal)
128	128	768 → cap 140	32 GiB	17 GiB	OK, LLM-JP side overcommitted

Since CAT-Translate caps at 140 concurrent, n=64 (LLM-JP seq=64, CAT seq=140) is the practical ceiling. Total VRAM stays around ~56 GiB / 153.6 GiB with plenty of headroom.

Time estimate

256 items at concurrency=32 measured 90 seconds including cold start. From the second run onward (hot), it finishes in 57 seconds. Scaling estimates linearly extend the hot 57-second figure, and bumping concurrency 32→64 gets a conservative 1.5× from LLM-JP batch efficiency and CAT seq increase (32→140).

Items	concurrency=32 hot (measured/est.)	concurrency=64 (est.)
256	57 s	~38 s
10,000	~37 min	~25 min
100,000	~6.2 h	~4.1 h

For reference: the cold-start-inclusive 256-item run was 90 s, meaning 33 s was the one-time warmup. That shrinks into noise for longer batches.

Yield and effective data volume

Based on the 16-sample random review (Claude Opus 4.6 judge), the share of data usable directly for SFT/DPO is estimated as follows.

Assessment	Share	Est. at 10k
Usable as-is	62.5%	~6,250 items
Salvageable with light post-processing	25.0%	~2,500 items
Discard (mistranslation / refusal / tag break)	12.5%	~1,250 items

Main sources of discards are the nonsense questions (a property of the source data) and translation nuance drift. Pre-filtering the inputs would push the discard rate further down.

Final estimate for a 10k-item run

Item	Value
Inputs	10,000
LLM-JP seq (n)	64
CAT-Translate seq (n×6, capped)	140
Estimated runtime	~25 min (hot)
VRAM usage	~56 GiB / 153.6 GiB (36%)
Usable as-is (62.5%)	~6,250 items
Including salvage (87.5%)	~8,750 items

A 25-minute run yields a 6,000–8,000-item block of Japanese output. VRAM is only at 36%, leaving room to run another larger model (say BF16 verification work) in parallel.

Separately from single-node scaling, horizontally scaling the vLLM instances lets throughput grow linearly. Both CAT-Translate and LLM-JP are small enough to easily run the same containers on other GPU nodes; if Dagster round-robins jobs across them, processing time per 10k roughly scales down with node count.

Operational picture: run the judge weekly

Rather than blasting 10k items through and reviewing in one shot, pacing the run to fit available resources is easier to operate. For example, running the judge on leftover Claude capacity each Friday — rejecting NGs and archiving the rest little by little.

The generation side (vLLM) can run any time; just feed in chunks of input
Judge on the Claude side gets spread out based on rate limits and remaining quota
Judge results get stored as JSONL so aggregation and re-evaluation stay easy

No reason to run continuously, so neither the GPUs nor Claude feel rushed or wasted.

Summary

NVFP4 quantization fits the 7B + 32B pair on one 96 GB GPU each, for a combined VRAM usage of about 23 GB at concurrency=32. LLM-JP-4’s MoE architecture (32B-A3B) pulls TG 1,724 t/s — about 6.5× faster than the dense 7B CAT-Translate’s 263 t/s.

In scaling, leveraging that speed ratio by giving CAT 6× the seq count minimizes GPU idle time. 256 items finish in 57 seconds hot; a 10k run should land around 25 minutes.

This was a throwaway batch script just to validate the behavior — the real path is breaking this sequence into Dagster assets/ops. Horizontal vLLM scaling, retry policy, and judge composition all sit more naturally on the Dagster side.

Operationally, rather than running the whole generation in one shot, spreading the judge across leftover Claude and Codex capacity each Friday feels like a good fit.

Running Kimi-K2.6 Locally: Making a 1T MoE Practical with ik_llama.cpp and Blackwell

A local validation of …

Running GLM-5.1 IQ3_KS Locally: CPU/GPU Hybrid Inference and Expert Layer Placement

A hands-on record of running …