Introduction

When I build a local ELT stack around Dagster, Trino, DataFusion, and PostgreSQL, the hard part is usually not raw CPU throughput. It is deciding how much memory each step is allowed to consume, when that memory must be released, and when intermediate data should leave RAM altogether. This note was written around a machine with AMD EPYC 9175F 16c/32t, 768GB DDR5-6400, RTX 6000 Pro Max-Q, and NVMe 3.84TB, and the main goal was to turn 512GB of that memory into a reusable ELT execution pool.

I was already thinking about local, offline-first delivery patterns such as a 48h Prototype or a compact data-platform package built around flows like Airbyte -> Iceberg / Trino -> visualization. This ELT memo is the execution policy underneath that broader platform idea.

Background and Motivation

I wanted to avoid the common trap of treating a high-memory server as an excuse to keep everything in RAM all the time. Even with 768GB, the combination of Trino heavy queries, large DataFusion transforms, PostgreSQL load phases, and loosely controlled Dagster concurrency can create ugly peak collisions. When that happens, the issue is not only OOM risk. Recovery becomes messy because some tasks will already be half-spilled, half-retrying, or holding file handles they should have released.

That is why I reduced the whole policy to one operating principle:

Reuse a dedicated 512GB pool step by step, keep light paths in In-Memory, and fall back to Arrow IPC or Parquet when the workload gets large.

That one sentence is useful because it forces performance, safety, and restartability into the same design.

Base Resource Assumptions

These are the hardware assumptions:

  • Server: AMD EPYC 9175F 16c/32t (5GHz) + 768GB DDR5-6400 + RTX 6000 Pro Max-Q + NVMe 3.84TB
  • ELT pool: a dedicated 512GB reserved for workflow execution
  • Bandwidth: memory bandwidth around ~600GB/s, NVMe I/O in multiple GB/s, network at 10GbE

The important point is that I do not treat 512GB as a vague upper bound for the entire system. I treat it as a scheduled execution budget that is loaned to the currently active ELT step. Once the stack grows to include ingestion, observability, cataloging, and visualization, explicit budgets are easier to maintain than an always-on free-for-all.

Memory Operation Policy

Principle

The memory policy is simple on purpose:

  • Reuse the 512GB pool step by step
  • Prefer direct In-Memory handoff when everything stays in one process
  • If the work is split across processes or containers, keep the combined usage within the same 512GB ceiling

With Dagster, it is tempting to over-parallelize because op graphs make concurrency feel cheap. In practice, heavy analytics jobs behave better when I run them wide but serialized, release the memory cleanly, and let the next step reuse the same pool.

Switching Thresholds

I also wrote down explicit thresholds for intermediate data size:

Intermediate sizeRecommended modeNotes
<= 40-50% (200-250GB)In-Memory (in-process + InMemoryIOManager)Fastest path because it directly uses ~600GB/s memory bandwidth
50-80%Arrow IPC streamGood compromise with lower overhead and NVMe-backed safety
>80% or concurrent heavy opsParquet temporary outputBetter restart behavior and stronger safety margin

I prefer hard switching rules like this because they turn a fuzzy judgment call into a repeatable operating policy.

Optimization by Component

DataFusion (Rust)

On the DataFusion side, the biggest rule is to avoid accidentally materializing huge intermediates too early.

  • Set MemoryPool limits per op using S/M/L profiles
  • Point DiskManager at NVMe under /opt/tmp so forced spill is available
  • Prefer the streaming API, especially execute_stream()
  • Align partition count with the physical core count (16c)

Even on a large-memory server, streaming through work that can stay streamed is usually safer than collecting everything and hoping the next step still fits.

Trino

Trino needs both internal limits and external scheduling discipline.

  JVM: -Xmx192g
max-spill-per-node=512GB
query.max-memory-per-node=128GB
query.max-total-memory-per-node=160GB
  

That setup assumes a 256GB container, with spill directed to NVMe. Just as important, I only allow one heavy query at a time through Dagster concurrency controls. Engine tuning alone is not enough if the orchestrator keeps launching competing memory-heavy work.

PostgreSQL

For PostgreSQL, I care less about flashy tuning and more about predictable bulk-load behavior.

  shared_buffers=8-16GB
work_mem=128MB
maintenance_work_mem=2-4GB
wal_compression=lz4
checkpoint_timeout=20-30min
max_wal_size=12-16GB
  

The preferred load patterns are:

  • UNLOGGED TABLE -> COPY -> build indexes -> LOGGED
  • CTAS (CREATE TABLE AS SELECT)

I also keep synchronous_commit=off as a load-time option, not a blanket default. That is the kind of optimization that helps during ingest but should stay explicit and bounded.

Dagster

Dagster is where the runtime policy becomes enforceable.

  • Light jobs: in_process_executor + InMemoryIOManager
  • Heavy jobs: multiprocess_executor + per-op Podman containers (--memory/--cpuset-cpus)

Then I tag workloads to control concurrency:

  • heavy operations (Trino, large DataFusion ops): max_concurrent=1
  • light operations such as metadata handling, dbt run, or logging: parallel execution is allowed

This matters because it keeps execution behavior consistent even when different jobs are authored over time by different assumptions.

Process / Container Design

In-Memory (fastest path)

The fastest mode is to run the whole job with in-process executor and pass intermediates as pyarrow.Table or RecordBatch through shallow copies. I only want to do that when the intermediate size stays around <=200GB, or roughly half of the ELT pool.

If I break that rule, the cost often appears later, when a downstream op starts and pushes the real peak over the edge.

Container Isolation (stable path)

For safer execution, I run each op through podman run --memory=... --cpuset-cpus=.... The appeal is straightforward: once the op exits, the memory is released with the process. That maps cleanly onto the goal of reusing the 512GB pool across steps.

This also fits the wider platform direction, where local reproducibility and container-first execution are recurring themes.

Caveats When Spawning Subprocesses

If Python launches Rust child processes while the parent is still holding large buffers, combined peak memory becomes the real problem. A few specific safeguards:

  • release parent buffers with del
  • let scopes end before child startup
  • call gc.collect()
  • prefer Python spawn over fork

That last point matters more than it looks. Once fork loses its Copy-on-Write advantage, actual memory usage can jump much faster than expected.

I/O Optimization

Disk strategy is not only about picking a fast SSD. It is also about choosing the right intermediate representation.

  • Arrow IPC: the fastest escape hatch, close to zero-copy
  • Parquet: the safer fallback with zstd(1-3) and row groups at 128-256MB
  • NVMe: centralize spill directories under /opt/tmp
  • /dev/shm: increase to --shm-size 2-4GB for Arrow Plasma / IPC
  • cleanup: remove temporary files at job end, including failure paths

I like to think of Arrow IPC as the speed-oriented spill path and Parquet as the safety-oriented spill path. That distinction keeps the runtime choice simple.

Operating Rules

The final operating rules collapse into five points:

  1. Run heavy queries one at a time and make them wide.
  2. End each op, free its memory, and reuse the same 512GB pool.
  3. Switch between In-Memory, Arrow IPC, and Parquet based on intermediate size.
  4. Use Parquet when restartability matters most.
  5. Use In-Memory shallow-copy or pointer-style transfer when raw speed matters most.

These rules are not just local tuning notes. They are the policy layer that makes different ELT jobs behave consistently.

Benefits

This design gives me four concrete advantages:

  • speed: In-Memory paths can exploit the full ~600GB/s memory bandwidth
  • stability: cgroup limits, spill behavior, and step-wise reuse make the stack harder to crash
  • flexibility: the runtime can switch between Arrow IPC and Parquet depending on workload shape
  • simplicity: Dagster executor switching becomes the main knob for execution mode

On a single-server setup, explicitly limiting concurrency often makes the system both faster and more reliable than trying to parallelize everything.

Summary

The result is simple: reuse 512GB step by step, keep light workloads in In-Memory, spill heavier ones to Arrow or Parquet, and isolate ops in containers so release is guaranteed.

Next Steps

The natural next step is to turn this policy into concrete templates for podman-compose.yml memory and CPU profiles plus Dagster executor settings. A broader local stack with pieces like Airbyte OSS, Great Expectations, OpenMetadata, and MLflow is already outlined, so connecting those surrounding services to the same ELT execution rules is the obvious continuation.