LLM Research

Large language model benchmarks, CPU/GPU inference validation, and optimization research.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Django 5 Travel Booking Site Generation Test with Qwen3.5-122B-A10B Local Inference

Testing whether a locally-running Qwen3.5-122B-A10B (Q5_K_M) can generate a full-stack Django 5 web …

Running GLM-5.2 (744B-A40B) GGUFs Locally: Did MTP Help? Notes From a Few Quant and Expert-Placement Tests

Notes from running two GLM-5.2 (744B-A40B MoE) GGUF quants (1.630bpw / 2.244bpw) on a dual RTX PRO …

Running DeepSeek-V4-Flash on Two DwarfStar4 Nodes for Orchestration

A record of loading DeepSeek-V4-Flash IQ2XXS on two RTX PRO 6000 Blackwell Max-Q GPUs, one …

Step-3.7-Flash-NVFP4 as a Local Orchestrator: Multi-Agent System Development

A local-only multi-agent benchmark using Step-3.7-Flash-NVFP4 as the orchestrator and the familiar …

Gemma 4 31B on vLLM/SGLang: NVFP4/FP8 and MTP Benchmark

A record of running Gemma 4 31B IT on vLLM 0.21.0 and SGLang gemma4-mtp, comparing NVFP4/FP8 block …

Running MiMo V2.5 Pro IQ2_S Locally: RTX PRO 6000 Blackwell x1/x2 Benchmark

Running the Xiaomi MiMo V2.5 Pro 1.02T MoE IQ2_S GGUF on llama.cpp CUDA13 with RTX PRO 6000 …

DwarfStar 4 × RTX PRO 6000 Blackwell: DeepSeek V4 Flash Q2 Reaches 43 tok/s

A first look at antirez's DwarfStar 4 inference engine on an NVIDIA RTX PRO 6000 Blackwell Max-Q …

Measuring Qwen3.6-27B NVFP4+MTP on vLLM: ~190 tok/s TG on Dual RTX PRO 6000 Blackwell Max-Q

Running Qwen3.6-27B-Text-NVFP4-MTP on vLLM v0.19.2rc1 with MTP speculative decoding across dual RTX …

Running DeepSeek-V4-Flash with a llama.cpp WIP Branch: First Local Inference on Dual Blackwell Max-Q 96GB GPUs

A first-pass validation of DeepSeek-V4-Flash (284B MoE / 13B active) on dual RTX PRO 6000 Blackwell …

Qwen3.6-27B-FP8: Role-Specific Fine-Tuning Strategy and Integration into My Agent Stack

Running Qwen3.6-27B-FP8 on RTX PRO 6000 Blackwell, with measured SGLang performance and a …

Running Kimi-K2.6 Locally: Making a 1T MoE Practical with ik_llama.cpp and Blackwell

A local validation of Kimi-K2.6 (1T MoE, 384 experts × 8 active) on RTX PRO 6000 Blackwell Max-Q …

Validating a Japanese Data Generation Pipeline with LLM-jp-4-32B-NVFP4 x CAT-Translate-7B-NVFP4

A hands-on record of validating a batch pipeline that takes 887 messages[N].user prompts from a …

Running GLM-5.1 IQ3_KS Locally: CPU/GPU Hybrid Inference and Expert Layer Placement

A hands-on record of running GLM-5.1 IQ3_KS (744B MoE) on a homelab with dual RTX PRO 6000 Blackwell …

Qwen3.5-397B-A17B Validation: Making 55 t/s and 262k Tool-Use Loops Practical on 2x Blackwell 96GB

Validation log for Qwen3.5-397B-A17B (Q4_K_M, 227.5 GiB) on dual RTX PRO 6000 Blackwell 96GB GPUs. …

Running MiniMax-M2.7 (229B MoE) on 2x Blackwell 96GB: 71.9 t/s on Average, but No Commercial Use

Record of running MiniMax-M2.7 (229B MoE, smol-IQ3_KS) locally on dual RTX PRO 6000 Blackwell 96GB …

Optimizing a GLM-5.1 + Qwen3-Coder-Next Stack: Orchestrator TG Benchmarks and Final Layout Design

A benchmark record for running GLM-5.1 (744B MoE, IQ3_KS) as the familiar orchestrator. Compares …

Designing and Implementing a Dagster Conversation Lineage, Evaluation, and Dataset Generation System

A design record for building three asset groups on top of the existing agent-gateway Dagster …

Evaluating Qwen3.5 Coding Ability on a Static Dental Clinic Site

A coding test where Qwen3.5 built a six-page dental clinic site with HTML, Tailwind CSS, and …

Planning a GPU/CPU Division for Local LLM, and the Reality of Daily Trial and Error

On a Blackwell 96GB + EPYC 9175F workstation, I planned to split CPU into a Dagster idempotent …

Designing Bilingual System Prompts for the PLAMO-translate AI MODEL

Full text and design rationale for a pair of system prompts targeting the PLAMO-translate AI MODEL. …

LTX-2 Video Generation Prompt Engineering: From 36-Scene Horror to Cinematic Continuity Pipelines

Structured prompt specifications for LTX-2 video generation. Covers the 36-scene horror scenario …

How I'd Choose a Daily Quantization Setup for Hermes-4.3-36B

Comparing Hermes-4.3-36B across BF16, FP8, and nvfp4 on a Blackwell GPU. Not just raw throughput — …

The Reality of 40B Dense Models: What Running IQuest-Coder-V1-40B on CPU/GPU/Aider Actually Showed

IQuest-Coder-V1-40B-Instruct (Dense 40B) tested across CPU Q5_K_M, GPU nvfp4, and Aider whole-edit. …

What I Learned from Running Command-A Reasoning 08-2025 Inside an Aider Coding Loop

A hands-on evaluation of command-a-reasoning-08-2025-nvfp4 inside an Aider coding loop, using Go …

Reworking a Local AI Coding Environment Around Serena MCP

Designing a local AI coding environment around Serena MCP that connects Obsidian notes to VS Code …

Where GLM-4.7-Flash Uncensored Helps and Where It Becomes Dangerous

An evaluation of uncensored GLM-4.7 Flash for defensive security work. The model is useful for …

Why IQuest-Coder Loop-Instruct Still Feels Slow in Aider

A breakdown of why IQuest-Coder-V1-40B-Loop-Instruct feels slow in aider despite fast prefill. …

Why MCP Worked in VSCode Remote SSH but Not in Zed

A record of why the same MCP configuration worked in VSCode Remote SSH but failed in Zed for a …

Why EPYC 9175F's 512MB L3 Cache Accelerates MoE Inference: Hypothesis Validation with a 1T Model

Running Kimi-K2.5 (1T MoE) CPU-only on AMD EPYC 9175F to validate the hypothesis that massive L3 …

MiniMax-2.5 (229B MoE) Expert Offload and Web Generation: IQ5_K to IQ3_S

Complete record of running the 229B MoE model MiniMax-2.5 on EPYC 9175F + RTX PRO 6000. Expert …

Qwen3.5-397B IQ4_NL Measured: 22.5tok/s Average from 28 Runs, Hybrid Offload Config and 400B-Class MoE Daily Viability

Qwen3.5-397B-A17B (397B total / 17B active MoE) deployed with IQ4_NL quantization on EPYC 9175F + …

Llama-4-Scout-17B-16E Measured: CPU Q6_K 17tok/s vs GPU nvfp4 60tok/s, Cache Strategy and 100K Context Boundary

Llama-4-Scout (17B active / 16-expert MoE) benchmarked on EPYC 9175F CPU Q6_K inference and RTX PRO …

1T MoE Kimi-K2.5 CPU Inference: Thread Optimization Through Long Context Operations

Complete CPU inference benchmark of Kimi-K2.5 (1.03T MoE, Q4_K_S/Q4_K_M) on EPYC 9175F. Why th=13 is …

Llama-4-Maverick-17B-128E CPU Inference: Q4_K_M vs Q8_0 Speed-Quality Trade-off Measured

Llama-4-Maverick (17B active / 128-expert MoE) CPU inference on EPYC 9175F, comparing Q4_K_M and …

Qwen3-Coder-Next 80B in Three Modes: BF16 CPU / IQ4_NL Hybrid / nvfp4 GPU Measured

Qwen3-Coder-Next (~80B MoE) benchmarked across BF16 CPU inference (7.59 tok/s), IQ4_NL Hybrid GPU …

GLM-4.7-Flash IQ5_K Benchmark: CPU vs Hybrid vs Full GPU Performance Comparison

Benchmarking GLM-4.7-Flash (IQ5_K GGUF) across CPU-only, MoE Expert Offload (Hybrid), and Full GPU …

Why DeepSeek-V3.2 Appears Slower Than Kimi-K2.5: Prompt Cache Mismatches and TG Bottleneck Analysis

Analyzing why DeepSeek-V3.2 decode speed plateaus at 14-15 tok/s in llama.cpp, traced to prompt …

Qwen3.5-397B Autonomous Code Generation: From Dental Clinic Sites to Django CMS Foundations

Two code generation validations using the 400B-class MoE model Qwen3.5-397B. One-shot generation of …