On this page

LTX-2 Video Generation Prompt Engineering: From 36-Scene Horror to Cinematic Continuity Pipelines

Structured prompt specifications for LTX-2 video generation. Covers the 36-scene horror scenario template with mandatory dialogue, cinematic shot design principles, and multi-scene visual continuity control for production pipelines.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Summary

When producing watchable video from LTX-2, prompt structuring is key. Generating a single pretty shot and chaining 36 scenes into a coherent story require fundamentally different engineering. This article consolidates three specifications refined through production use:

36-scene horror scenario generation template — system prompt for LLM-driven scenario writing
Cinematic prompt design principles — structure for maximizing individual shot quality
Multi-scene continuity control — pipeline design for preventing inter-clip visual breakage

Prerequisites

Video generation engine: LTX-2 (5-second clip generation)
Resolution: 3840x2160 (4K), 24fps
Scenario generation: Local LLM (Mixtral, etc.) generating STORY→SCENES→LTX2_PROMPTS in one pass
Pipeline: Per-scene generation → ffmpeg concatenation → audio composited downstream

1. 36-Scene Horror Scenario Generation Template

Design Intent

When experimenting with Monstral-123B (NVFP4 quantization) for horror scenario generation, we encountered critical instabilities:

Scene counts varied between 20–50; achieving a fixed 36-scene structure was impossible
Narrative structure collapsed; act structure (setup/escalation/resolution) became unclear
Dialogue disappeared midway; mute scenes proliferated
Endings became ambiguous or devolved into abrupt “it was all a dream” tropes

These weren’t LLM performance failures — they stemmed from the absence of explicit output format constraints. By designing a system prompt that rigorously constrains output schema and rules, we enabled mid-size models to reliably generate stable 36-scene horror narratives.

Output Schema (Fixed)

LLM output is restricted to exactly these 5 sections. No extra commentary.

STORY_PROMPT — narrative skeleton (one paragraph)
LABELS — genre, tone, motifs
OUTLINE — 5-stage plot summary
SCENES_36 — Scene 01 through Scene 36 (1-3 sentences each)
LTX2_PROMPTS_36 — Shot 01 through Shot 36 (generation prompt per shot)

Scenes must run from Scene 01 to Scene 36 without gaps or extras. That constraint is tight for the model, but it makes downstream parsing straightforward.

Scene Allocation

  Scenes 01–08 : Daily life and subtle unease (first whispers, minor anomalies)
Scenes 09–20 : Escalation and repetition (voices repeat, reflections speak back)
Scenes 21–28 : Investigation and confrontation (identifying the cause, revealing truth)
Scenes 29–36 : Resolution and closure (cause identified → concrete action → safety restored)

Dream endings, “it was all imagination” dismissals, and unresolved ambiguity are prohibited.

Required Elements Per Scene

  Scene 04:
Visual: A woman stands before a mirror. Camera slowly zooms into her reflection.
Whisper: "You see it too, don't you?"
Sound: Fluorescent light flickers with a low humming sound.

Visual: what is on screen
Camera: framing or camera movement
Dialogue: at least one spoken line (Dialogue / Whisper / Voice(V.O.) / Heard Voice / Inner Voice)

No scene may be silent. For scenes where dialogue is unnatural, use whispered words, distorted voices, internal monologue, reflected/mirrored speech, or remembered phrases. Thirty-six scenes with no vocal event would be visually repetitive and editorially difficult to work with. By requiring speech in every beat, the structure forces audio rhythm into the design alongside image progression.

Setting Rules

Japan (urban or suburban), deliberately vague location names
Cultural elements: quiet residential streets, small houses, mirrors, shrines, corridors, sliding doors, evening TV noise, cicadas, wind, fluorescent lights
No graphic gore or explicit violence. Fear is built through psychological means: whispers, shadows, reflections, repetition, isolation

Named locations constrain the generation and pull attention away from atmosphere. What the design wants is culturally legible unease, not location tourism.

Horror Constraints

Permitted sources of fear are bounded:

Whispers, shadows, reflections, repetition
Voices that cannot be confirmed as real
Emotional isolation

Not permitted: graphic gore, explicit violence, dream-only endings, “it was all imagination” endings, unresolved ambiguity.

The intended output is not shock horror. It is low-amplitude psychological horror that can survive scene-by-scene video generation without collapsing into spectacle or incoherence. For LTX-2, low-amplitude anomalies are less likely to break individual clips.

Prohibition Block

These are stated bluntly because ambiguity allows escape.

  The ending must be resolved.
No dream ending.
No insanity-only explanation.
No ambiguity.

Story Flavors (Presets)

Pre-built flavor templates the LLM can auto-select when user input is vague:

Flavor	Premise
`trapped_room_mummy`	Locked in before summer break, only traces remain
`water_reflection`	Water surface reflections gradually rewrite reality
`station_beep_phrase`	Station chimes become words that guide people
`fox_shrine_wrath`	Removal of a small shrine triggers quiet anomalies

Each flavor includes a logline, key beats, and a resolution pattern. They serve as a bridge between vague user input and structured scenario generation. They also make it harder for the model to wander into unresolved endings because the resolution pattern is specified before generation begins.

story_core Template

Defines the narrative skeleton using placeholders.

  {SETTING}
{PROTAGONIST}
{HELPER}
{FLAVOR}
{RESOLUTION_PATTERN}
{MOTIFS}

Permitted fear sources at this layer: psychology, sound, shadow, water, repetition. Restraint around gore is specified. This layer does not address cinematography. Its job is to keep the narrative shape stable.

LTX2_PROMPTS_36 Shot Definition

  Shot 01:
  DURATION=5s FPS=24 RES=3840x2160
  PROMPT: <concrete visual instruction, short>
  NEGATIVE: low quality, blurry, distorted hands, deformed face, gore, blood
  CAMERA: handheld POV / slow push-in / over-the-shoulder
  LIGHTING: low-key, streetlight, fluorescent hum
  AUDIO_CUE: <ambient, dialogue, silence>
  CONTINUITY: prev=none next=shared_object:mirror
  SEED_HINT: episode_seed+01

Avoid face close-ups; prefer hands, backs, silhouettes. Telop/subtitles are composited downstream via ffmpeg.

That fielded structure is more useful than a single prompt paragraph. It makes individual fields patchable, shareable across shots, and machine-validatable. Centralizing NEGATIVE, swapping SEED_HINT, or auto-generating CONTINUITY are all straightforward with this layout.

LTX-2 Style Defaults

  DURATION  : 5 seconds
FPS       : 24
RESOLUTION: 3840x2160 (4K)
STYLE     : cinematic, realistic, subtle horror, Japanese suburban atmosphere
CAMERA    : handheld POV / slow push-in / over-the-shoulder (avoid face close-ups)
LIGHTING  : low-key, streetlight, fluorescent hum, rainy reflections

With a shared negative prompt and a seed strategy added, this crosses from creative prompting into real pipeline design.

2. Cinematic Prompt Design Principles

Shot-First Thinking

The first thing in every prompt is “where the camera is.” Abstract phrasing like “the camera shows” is banned. Use explicit cinematography language:

static camera / slow pan / close framing / wide interior shot / shallow depth
Specify camera position, field of view, and spatial compression

A position and lensing cue are more useful than an abstract narration of visibility. This is not stylistic preference — it is a stability concern for the diffusion process.

Environment Anchoring (Visual Mood)

Embed lighting, color palette, and surface textures as shared parameters across prompts.

Element	Examples
Lighting	`warm`, `cold`, `flickering`, `fluorescent`, `natural`, `overcast`
Color palette	`muted pastels`, `warm golds`, `sickly greens`, `neutral grays`
Surface textures	`fogged glass`, `worn metal`, `dust in light beams`
Mood	`cozy`, `oppressive`, `sterile`, `playful`

This anchors the diffusion process and stabilizes results across scenes.

Action as Continuous Physical Sequence

Write actions as natural progression, not bullet points:

  Leaning into frame → Hesitating near the handle → Exhaling slowly to fog the glass

Use arrows (→) to show step-by-step motion. Without this, LTX-2 produces “teleportation” — the model cannot predict intermediate frames when motion is abbreviated. Obsessively describing how the body moves through space is an effective approach to smooth video.

Character Definition Through Behavior

Define characters through posture, micro-expressions, timing, and small physical habits rather than long descriptions. Include age, clothing, and emotional state expressed through motion. The spec references Pixar-style acting as a useful shorthand. Static attribute lists produce characters that stand still. Behavioral specs produce characters that perform.

Camera Movement as Narrative Tool

Slow pan: observation and tension building
Static shots: tension or comedic timing
Avoid sudden or unmotivated movement unless it supports a punchline
Maintain consistent pan direction and speed across adjacent clips
When continuing from previous scene, explicitly state Continue the same pan

That phrase is a reusable control token. It lets adjacent clips inherit motion logic without repeating the full camera description.

Audio Is Mandatory

Audio is not decoration — it defines timing:

Ambient Sound: oven humming, steam, distant noise
Dialogue Beats: quoted dialogue, [Beat] and [Silence] markers for pauses
Speaking mode: whispering vs speaking vs muttering
Music must be explicitly included or excluded

Prompting with only visual instructions tends to produce clips where sound feels disconnected from dramatic timing because it was not part of the design.

Prompt Structure Template

The recommended seven-step order is fixed.

Shot Establishment
Environment & Lighting
Character Position & Emotion
Core Action Sequence
Camera Movement (with timing)
Audio & Dialogue
Ending Visual Beat (pose, pause, reaction)

The Ending visual beat step is worth noting as a separate item. In video generation, the last few frames determine much of the perceived quality of a clip. Making the ending an explicit planning step is the practical approach.

Example: Cinematic Prompt (approx. 600 words)

The spec’s main example starts with INT. OVEN – DAY. and demonstrates every rule in one shot.

  INT. OVEN – DAY.
Static camera positioned from inside the oven, looking outward through the slightly fogged glass door. The frame is tight and claustrophobic, bordered by dark metal edges. Warm golden light fills the interior, reflecting softly off the glass and illuminating a tray of freshly baked cookies resting just below frame center. Fine steam curls upward, leaving faint streaks across the glass.

The color palette is rich and inviting: honeyed browns, soft ambers, and glowing highlights. The air feels warm and dense, as if the oven itself is breathing.

A baker leans into frame from outside the oven door. His face slowly fills the glass, eyes narrowed in intense concentration. His breath fogs the glass further as he exhales, creating overlapping clouds that briefly obscure his features before clearing again. He doesn't blink. His posture is rigid, shoulders slightly hunched, as though this moment demands absolute precision.

The camera remains static as subtle reflections ripple across the glass from the rising heat.

Baker (whispering, reverent):
"Today… I achieve perfection."

He tilts his head slightly, adjusting his angle to catch the light just right. His eyes track the edges of the cookies, watching for the slightest change in color. One hand lifts slowly into frame, hovering near the oven door handle but not touching it yet.

The steam grows thicker for a moment, then thins.

Baker (still whispering, building intensity):
"Golden edges. Soft center."

He leans even closer, nose nearly pressing against the glass. The camera holds, letting the silence stretch just long enough to feel deliberate.

Baker (a hushed gasp):
"The gods themselves will smell these cookies… and weep."

A beat.
The baker's eyes widen slightly as a new thought intrudes. His hand freezes mid-air.

Baker:
"Wait—"

Silence. The hum of the oven is suddenly very noticeable.

Baker (uncertain, quieter):
"Did I… forget the chocolate chips?"

Another beat. Just long enough to be uncomfortable.

Cut to a side angle outside the oven. The camera now slowly pans from left to right, revealing a coworker casually stepping into frame. The coworker chews loudly, crumbs at the corner of his mouth, utterly unconcerned. The lighting here is flatter and cooler, contrasting the oven's warmth.

Coworker (mouth full, casual):
"Nope. You forgot the sugar."

The pan continues for a moment as the coworker shrugs and wanders off.

Cut back inside the oven.
A quick push-in as the baker's face slams back into frame, pressed against the glass. His expression is pure horror. His breath fogs the glass completely now, obscuring the cookies behind it.

Behind the fog, the cookies visibly sag and deflate in slow motion, their once-perfect shape collapsing inward. Steam rises dramatically, drifting upward like a defeated sigh.

No music. Only the oven's hum and the faint crackle of heat.

The baker stares, unmoving.

Baker (barely audible):
"...No."

The camera holds on the fogged glass as the steam slowly dissipates, revealing the ruined cookies one last time before the shot ends.

Every rule in the spec appears in this example:

Shot establishment: static camera inside the oven, looking outward — frame and spatial constraints defined before any character appears
Environment: warm golden light, honeyed browns, fogged glass — palette and texture locked early
Character: defined through posture and behavior — He doesn't blink does more work than a paragraph of description
Action: leaning, exhaling, hovering, freezing — continuous sequence with no teleports
Camera: the slow pan appears only when the scene has a reason to reveal new information; the static holds are load-bearing
Audio: whisper, silence, oven hum, and heat crackle are all embedded in the dramatic timing

The line No music. Only the oven's hum and the faint crackle of heat. shows how a sound decision functions as narrative control. The absence of music is not a default — it is a choice that makes the comedic silence work.

3. Multi-Scene Continuity Control

The Problem

Moving from single beautiful shots to multi-scene narratives introduces:

Visual drift: lighting and texture shift between shots
Camera discontinuity: pan speed and direction mismatch across clips
Timing loss: inability to control acting “beats” and pauses

Last-Frame Continuity

Continuation directive: explicitly state Continue the same pan at the start of the next scene
Seed incrementing: use a fixed episode seed plus shot index (episode_seed + shot_idx). Allows micro-variation while maintaining macro-consistency
Texture inheritance: visual states from previous shots (e.g., fogged glass) become preconditions for the next shot

CONTINUITY Field in Practice

Each shot’s CONTINUITY field specifies prev= and next=, connecting adjacent scenes through shared objects, shared sounds, continuing actions, or location transitions:

  CONTINUITY: prev=shared_sound:fluorescent_hum next=location_transition:hallway_to_kitchen

Without explicit connection directives, LTX-2 generates each shot as an independent image, producing jarring cuts when concatenated.

Avoiding Teleportation

When physical motion is abbreviated, the AI cannot predict intermediate frames and produces “teleportation.” The most effective countermeasure is obsessively describing how the body moves through space.

Evaluating the Sample Outputs

OUTPUT 1 (Fox Curse)

Follows the schema. Includes STORY_PROMPT, LABELS, OUTLINE, and SCENES_36. The story arc is complete. Scene-to-scene continuity is weak, and the camera language is not rich enough to act as generation-ready shot instructions. This is why the later template layer was added.

OUTPUT 2 (Gon’s Mutation)

Stronger as a narrative. Gon loses his sister to a bear, exterminates bears, triggers a curse, mutates, and eventually reaches acceptance. The resolution requirement is more clearly satisfied than in OUTPUT 1. Camera language still has the same limitation: present but too sparse for direct LTX-2 use.

Variable Parameterization

The file contains shell-variable-style entries like MAIN_CHARCTER="Gon", SUB_CHARCTERS="Mosuke, Yayoi", and FLAVOR=.... These are experimentation artifacts, not a finished schema. The intent is sound: keep the scene skeleton fixed and swap in characters and flavor patterns as variable inputs. The implementation needs to move from ad hoc notation to structured YAML or JSON before it is durable.

To turn this into a durable workflow:

Consolidate the repeated system-prompt blocks into one canonical version
Move OUTPUT 1 and OUTPUT 2 into a dedicated examples section
Replace ad hoc variables with YAML or JSON inputs to stabilize parsing
Add automatic validation between SCENES_36 and LTX2_PROMPTS_36

Continuity validation is the most valuable automation target. Checking whether prev and next hooks match the scene plan, whether resolution scenes contain concrete action, and whether any silent scene slipped through is straightforward to mechanize.

Takeaways

LTX-2 prompt engineering is fundamentally “writing a film storyboard in natural language.” The horror template forces structured output from the LLM. The cinematic principles maximize individual shot quality. The continuity controls connect shots into a narrative. All three layers are required for a production-ready pipeline.

The future of video generation AI depends not on single-shot beauty, but on the intelligence embedded in the gaps between consecutive shots.

Reproduction Steps

Minimum Setup

LTX-2 video generation environment (ComfyUI or API)
Scenario generation LLM (local or API)
ffmpeg (clip concatenation and telop compositing)

Horror Scenario Generation Workflow

Select a Story Flavor (or provide free-form input)
Feed the system prompt template from this article to the LLM
Retrieve SCENES_36 and LTX2_PROMPTS_36 from output
Feed each shot’s PROMPT to LTX-2 sequentially
Concatenate generated clips with ffmpeg
Composite audio and telop downstream

Cinematic Quality Checklist

Camera position specified at the start of each prompt
Lighting and color palette consistent across scenes
Actions described step-by-step with arrows (→)
CONTINUITY defined between adjacent shots
Audio directives included
Face close-ups avoided; hands/backs/silhouettes preferred

Building an MCP Server to Fix Local LLM Tool Call Failures: pathfinder Design and Benchmarks

How I built pathfinder, a Rust …

How I'd Choose a Daily Quantization Setup for Hermes-4.3-36B

Comparing Hermes-4.3-36B …