How I Structured a Reusable Memory Pool for ELT on a Single EPYC Server
Designing a 512GB memory pool that rotates step by step across In-Memory, Arrow IPC, and Parquet paths for stable ELT execution.
Introduction
When I build a local ELT stack around Dagster, Trino, DataFusion, and PostgreSQL, the hard part is usually not raw CPU throughput. It is deciding how much memory each step is allowed to consume, when that memory must be released, and when intermediate data should leave RAM altogether. This note was written around a machine with AMD EPYC 9175F 16c/32t, 768GB DDR5-6400, RTX 6000 Pro Max-Q, and NVMe 3.84TB, and the main goal was to turn 512GB of that memory into a reusable ELT execution pool.
I was already thinking about local, offline-first delivery patterns such as a 48h Prototype or a compact data-platform package built around flows like Airbyte -> Iceberg / Trino -> visualization. This ELT memo is the execution policy underneath that broader platform idea.
Background and Motivation
I wanted to avoid the common trap of treating a high-memory server as an excuse to keep everything in RAM all the time. Even with 768GB, the combination of Trino heavy queries, large DataFusion transforms, PostgreSQL load phases, and loosely controlled Dagster concurrency can create ugly peak collisions. When that happens, the issue is not only OOM risk. Recovery becomes messy because some tasks will already be half-spilled, half-retrying, or holding file handles they should have released.
That is why I reduced the whole policy to one operating principle:
Reuse a dedicated
512GBpool step by step, keep light paths inIn-Memory, and fall back toArrow IPCorParquetwhen the workload gets large.
That one sentence is useful because it forces performance, safety, and restartability into the same design.
Base Resource Assumptions
These are the hardware assumptions:
- Server:
AMD EPYC 9175F 16c/32t (5GHz) + 768GB DDR5-6400 + RTX 6000 Pro Max-Q + NVMe 3.84TB - ELT pool: a dedicated
512GBreserved for workflow execution - Bandwidth: memory bandwidth around
~600GB/s, NVMe I/O in multipleGB/s, network at10GbE
The important point is that I do not treat 512GB as a vague upper bound for the entire system. I treat it as a scheduled execution budget that is loaned to the currently active ELT step. Once the stack grows to include ingestion, observability, cataloging, and visualization, explicit budgets are easier to maintain than an always-on free-for-all.
Memory Operation Policy
Principle
The memory policy is simple on purpose:
- Reuse the
512GBpool step by step - Prefer direct
In-Memoryhandoff when everything stays in one process - If the work is split across processes or containers, keep the combined usage within the same
512GBceiling
With Dagster, it is tempting to over-parallelize because op graphs make concurrency feel cheap. In practice, heavy analytics jobs behave better when I run them wide but serialized, release the memory cleanly, and let the next step reuse the same pool.
Switching Thresholds
I also wrote down explicit thresholds for intermediate data size:
| Intermediate size | Recommended mode | Notes |
|---|---|---|
<= 40-50% (200-250GB) | In-Memory (in-process + InMemoryIOManager) | Fastest path because it directly uses ~600GB/s memory bandwidth |
50-80% | Arrow IPC stream | Good compromise with lower overhead and NVMe-backed safety |
>80% or concurrent heavy ops | Parquet temporary output | Better restart behavior and stronger safety margin |
I prefer hard switching rules like this because they turn a fuzzy judgment call into a repeatable operating policy.
Optimization by Component
DataFusion (Rust)
On the DataFusion side, the biggest rule is to avoid accidentally materializing huge intermediates too early.
- Set
MemoryPoollimits per op usingS/M/Lprofiles - Point
DiskManagerat NVMe under/opt/tmpso forced spill is available - Prefer the streaming API, especially
execute_stream() - Align partition count with the physical core count (
16c)
Even on a large-memory server, streaming through work that can stay streamed is usually safer than collecting everything and hoping the next step still fits.
Trino
Trino needs both internal limits and external scheduling discipline.
JVM: -Xmx192g
max-spill-per-node=512GB
query.max-memory-per-node=128GB
query.max-total-memory-per-node=160GB
That setup assumes a 256GB container, with spill directed to NVMe. Just as important, I only allow one heavy query at a time through Dagster concurrency controls. Engine tuning alone is not enough if the orchestrator keeps launching competing memory-heavy work.
PostgreSQL
For PostgreSQL, I care less about flashy tuning and more about predictable bulk-load behavior.
shared_buffers=8-16GB
work_mem=128MB
maintenance_work_mem=2-4GB
wal_compression=lz4
checkpoint_timeout=20-30min
max_wal_size=12-16GB
The preferred load patterns are:
UNLOGGED TABLE -> COPY -> build indexes -> LOGGEDCTAS (CREATE TABLE AS SELECT)
I also keep synchronous_commit=off as a load-time option, not a blanket default. That is the kind of optimization that helps during ingest but should stay explicit and bounded.
Dagster
Dagster is where the runtime policy becomes enforceable.
- Light jobs:
in_process_executor + InMemoryIOManager - Heavy jobs:
multiprocess_executor + per-op Podman containers (--memory/--cpuset-cpus)
Then I tag workloads to control concurrency:
- heavy operations (
Trino, largeDataFusionops):max_concurrent=1 - light operations such as metadata handling,
dbt run, or logging: parallel execution is allowed
This matters because it keeps execution behavior consistent even when different jobs are authored over time by different assumptions.
Process / Container Design
In-Memory (fastest path)
The fastest mode is to run the whole job with in-process executor and pass intermediates as pyarrow.Table or RecordBatch through shallow copies. I only want to do that when the intermediate size stays around <=200GB, or roughly half of the ELT pool.
If I break that rule, the cost often appears later, when a downstream op starts and pushes the real peak over the edge.
Container Isolation (stable path)
For safer execution, I run each op through podman run --memory=... --cpuset-cpus=.... The appeal is straightforward: once the op exits, the memory is released with the process. That maps cleanly onto the goal of reusing the 512GB pool across steps.
This also fits the wider platform direction, where local reproducibility and container-first execution are recurring themes.
Caveats When Spawning Subprocesses
If Python launches Rust child processes while the parent is still holding large buffers, combined peak memory becomes the real problem. A few specific safeguards:
- release parent buffers with
del - let scopes end before child startup
- call
gc.collect() - prefer Python
spawnoverfork
That last point matters more than it looks. Once fork loses its Copy-on-Write advantage, actual memory usage can jump much faster than expected.
I/O Optimization
Disk strategy is not only about picking a fast SSD. It is also about choosing the right intermediate representation.
Arrow IPC: the fastest escape hatch, close to zero-copyParquet: the safer fallback withzstd(1-3)and row groups at128-256MBNVMe: centralize spill directories under/opt/tmp/dev/shm: increase to--shm-size 2-4GBfor Arrow Plasma / IPC- cleanup: remove temporary files at job end, including failure paths
I like to think of Arrow IPC as the speed-oriented spill path and Parquet as the safety-oriented spill path. That distinction keeps the runtime choice simple.
Operating Rules
The final operating rules collapse into five points:
- Run heavy queries one at a time and make them wide.
- End each op, free its memory, and reuse the same
512GBpool. - Switch between
In-Memory,Arrow IPC, andParquetbased on intermediate size. - Use
Parquetwhen restartability matters most. - Use
In-Memoryshallow-copy or pointer-style transfer when raw speed matters most.
These rules are not just local tuning notes. They are the policy layer that makes different ELT jobs behave consistently.
Benefits
This design gives me four concrete advantages:
- speed:
In-Memorypaths can exploit the full~600GB/smemory bandwidth - stability: cgroup limits, spill behavior, and step-wise reuse make the stack harder to crash
- flexibility: the runtime can switch between
Arrow IPCandParquetdepending on workload shape - simplicity:
Dagsterexecutor switching becomes the main knob for execution mode
On a single-server setup, explicitly limiting concurrency often makes the system both faster and more reliable than trying to parallelize everything.
Summary
The result is simple: reuse 512GB step by step, keep light workloads in In-Memory, spill heavier ones to Arrow or Parquet, and isolate ops in containers so release is guaranteed.
Next Steps
The natural next step is to turn this policy into concrete templates for podman-compose.yml memory and CPU profiles plus Dagster executor settings. A broader local stack with pieces like Airbyte OSS, Great Expectations, OpenMetadata, and MLflow is already outlined, so connecting those surrounding services to the same ELT execution rules is the obvious continuation.
