Integrating MLflow into devstack — Separating Dagster and Experiment Tracking Responsibilities
Implementation record of adding MLflow Tracking Server and MinIO to the agent-gateway devstack, connecting Dagster’s orchestration layer with the ML experiment tracking layer via correlation_id. Covers real-world issues including port conflicts, missing psycopg2, and adding a database to an existing volume.
Conclusion
MLflow Tracking Server and MinIO (S3-compatible object storage) were added to the devstack running the Dagster + NATS event pipeline. The design principle is a separation of responsibilities: “Dagster provides the orchestration bird’s-eye view, MLflow provides the detailed view of individual ML experiments,” with both linked by correlation_id.
The final changes span 8 files and 3 new services (MinIO, minio-init, MLflow). The system became operational after working through three real-world issues: port conflict with macOS AirPlay Receiver, missing psycopg2 in the official MLflow image, and adding a database to an existing PostgreSQL volume.
Why MLflow Was Needed
Dagster was already embedded in the devstack as the orchestrator, running a pipeline that pulls pipeline events from NATS JetStream via sensors, executes jobs, and persists results to PostgreSQL. Dagster is sufficient for a bird’s-eye view of “what happened,” but there was no mechanism to track what happened inside individual ML experiments — hyperparameters, metric trends, artifacts.
Two observation layers are explicitly separated:
- Dagster: Bird’s-eye view (what happened)
- MLflow: Detailed view (what happened inside)
correlation_idlinks both layers- Dagster asset metadata records
experiment_id,run_id,tracker_url,summary - MLflow run tags include
correlation_id
The experiment tracking foundation needed to be in place ahead of Phase 2 fine-tuning pipelines and investment prediction model generation.
Existing Infrastructure
The overall devstack configuration is as follows.
Core services:
| Service | Role |
|---|---|
| NATS (JetStream) | Messaging |
| PostgreSQL 18 (pgvector + JIT, 4GB SHM) | Data store |
| Dagster 3 containers (webserver + daemon + user-code gRPC) | Orchestration |
| Vector | Telemetry collection |
| FastAPI Reranker (ColBERT) | Reranking |
Lakehouse profile (optional): Nessie, Trino, dbt-fusion
3-host configuration:
- Storage server
- Desktop Mac — NATS co-located with gateway
- Compute server — Dagster and PostgreSQL
The README listed MinIO (S3-compatible) for binary artifact storage in the data retention strategy, but no MinIO service existed in podman-compose.yml yet. Since MLflow requires an S3-compatible artifact store, MinIO was added simultaneously.
Change Plan
- Add MLflow Tracking Server to
podman-compose.yml. Backend store reuses the existing PostgreSQL with a new mlflow database; artifact store uses MinIO’ss3://mlflow/ - Add
CREATE DATABASE mlflow;toinit.sql - Add
dagster-mlflowresource to the Dagster side, makingexperiment_nameandmlflow_tracking_uriconfigurable - Add
MLFLOW_TRACKING_URIandMLFLOW_S3_ENDPOINT_URLto environment variables - Update
README.md
MLflow is dedicated to experiment tracking; orchestration remains with Dagster.
dagster-mlflow API Investigation
The existing Dagster is pinned to 1.12.14. Compatibility with dagster-mlflow was verified.
On PyPI, dagster-mlflow follows Dagster’s version scheme — dagster 1.12.X corresponds to dagster-mlflow 0.28.X. For dagster 1.12.14, dagster-mlflow==0.28.14 is used. Dependencies: dagster==1.12.14, mlflow, pandas<3.0.0, protobuf!=5.29.0.
The source code was read directly after installing in a temporary venv:
mlflow_trackingis an old-style@resourcedecoratorResourceDefinition. Not theConfigurableResourcepattern used by existingPostgresResourceandEmbeddingResource- Ops access it via
required_resource_keys={"mlflow"}andcontext.resources.mlflow - The
end_mlflow_on_run_finishedhook must be applied to jobs or MLflow runs will hang. This is mandatory - The
MlflowMetametaclass proxies allmlflow.*methods, solog_params(),log_metric(),log_artifact()can be called directly on the resource object - S3 credentials can be passed via
envconfig, but this is unnecessary if already set as container environment variables
The old-style resource pattern differs somewhat from the existing codebase, but this is unavoidable given dagster-mlflow’s package design.
Implementation
Following the plan, 8 files were changed or newly created.
podman-compose.yml: 3 Services Added
MinIO (docker.io/minio/minio:latest, ports 9000/9001, minio-data volume) was added as S3-compatible object storage. minio-init is a one-shot container that creates the s3://mlflow/ bucket using the mc command.
MLflow Tracking Server (ghcr.io/mlflow/mlflow:v2.21.3, port 5000):
--backend-store-uri=postgresql://postgres:postgres@postgres:5432/mlflow
--default-artifact-root=s3://mlflow/
Connected to PostgreSQL and MinIO as backend/artifact stores. depends_on specifies postgres service_healthy and minio-init service_completed_successfully.
The following environment variables were added to Dagster’s dagster-user-code and dagster-daemon containers:
MLFLOW_TRACKING_URI: "http://mlflow:5000"
MLFLOW_S3_ENDPOINT_URL: "http://minio:9000"
AWS_ACCESS_KEY_ID: minioadmin
AWS_SECRET_ACCESS_KEY: minioadmin
init.sql
CREATE DATABASE mlflow;
dagster/pyproject.toml
Added dagster-mlflow==0.28.14 and boto3. boto3 is required for MLflow’s S3 artifact store access.
dagster/project/resources/mlflow.py (new)
mlflow_tracking.configured({
"experiment_name": os.getenv("MLFLOW_EXPERIMENT_NAME", "agent-gateway"),
"mlflow_tracking_uri": os.getenv("MLFLOW_TRACKING_URI", "http://mlflow:5000"),
"extra_tags": {"project": "agent-gateway"},
})
dagster/project/defs.py
Added "mlflow": mlflow_resource to the Definitions resources.
.envrc
Added MLFLOW_TRACKING_URI and MLFLOW_S3_ENDPOINT_URL.
Troubleshooting
Port 5000 Conflict
podman compose up -d mlflow failed to start.
Error response from daemon: "listen tcp :5000: bind: address already in use"
Checking with lsof -i :5000 revealed macOS ControlCenter (AirPlay Receiver) was occupying port 5000.
The MLflow host port mapping was changed to "${MLFLOW_PORT:-5050}:5000". The container continues to run on port 5000 internally, so Dagster containers’ MLFLOW_TRACKING_URI: http://mlflow:5000 requires no change. Only host-side access uses port 5050.
The internal MLFLOW_TRACKING_URI was briefly changed to 5050 as well, but the mistake was quickly caught — inter-container communication should use container ports — and reverted to 5000. .envrc was updated with MLFLOW_PORT="5050", and MLFLOW_TRACKING_URI references "http://${COMPUTE_HOST}:${MLFLOW_PORT}" for external access.
Official Image Missing psycopg2
After the port fix, MLflow crashed immediately on restart.
ModuleNotFoundError: No module named 'psycopg2'
The official ghcr.io/mlflow/mlflow:v2.21.3 image does not include psycopg2. The image assumes SQLite or MySQL as the backend store; the PostgreSQL driver must be installed manually.
A new devstack/mlflow/Dockerfile was created:
FROM ghcr.io/mlflow/mlflow:v2.21.3
RUN pip install --no-cache-dir psycopg2-binary boto3
The mlflow service in podman-compose.yml was changed from a direct image: reference to build: context: ./devstack/mlflow, tagged locally as localhost/agent-gateway/mlflow:2.21.3.
Adding a Database to an Existing Volume
After rebuilding with the custom image, a PostgreSQL-side error appeared.
FATAL: database "mlflow" does not exist
CREATE DATABASE mlflow; had already been added to init.sql, but PostgreSQL’s docker-entrypoint-initdb.d only runs on first startup. With the postgres-data volume having existed for three weeks, nothing added to init.sql would take effect.
The database was created manually:
podman exec agent-gateway-postgres-1 psql -U postgres -c "CREATE DATABASE mlflow;"
The addition to init.sql is preserved for cases where the volume is destroyed and recreated, or for new environment setup.
Result
After MLflow started and Alembic migrations ran automatically, gunicorn began operating with 4 workers.
[2026-03-12 02:07:58 +0000] [24] [INFO] Starting gunicorn 23.0.0
[2026-03-12 02:07:58 +0000] [24] [INFO] Listening at: http://0.0.0.0:5000 (24)
Connectivity was verified via the experiments/search API, confirming the Default experiment was returned. The artifact_location was s3://mlflow/0, correctly referencing MinIO.
Changed Files
| File | Change |
|---|---|
podman-compose.yml | Added 3 services: MinIO, minio-init, mlflow + Dagster env vars |
devstack/postgres/init.sql | CREATE DATABASE mlflow |
devstack/dagster/pyproject.toml | Added dagster-mlflow==0.28.14, boto3 |
devstack/dagster/project/resources/mlflow.py | New: mlflow_tracking configured resource |
devstack/dagster/project/defs.py | mlflow resource registration |
devstack/mlflow/Dockerfile | New: psycopg2-binary + boto3 |
.envrc | MLFLOW_PORT, MLFLOW_TRACKING_URI, MLFLOW_S3_ENDPOINT_URL |
README.md | Service list, environment variables, topology diagram |
Design Notes
dagster-mlflowis an old-style resource (@resourcedecorator), notConfigurableResource. The@end_mlflow_on_run_finishedhook must always be applied to jobs- Inter-container communication uses internal port 5000; host access uses port 5050 (macOS AirPlay Receiver workaround)
- When a PostgreSQL volume already exists, new database creation in
init.sqlmust be applied manually
