Containerizing a Local LLM Stack: Docker Compose for vLLM, llama.cpp, and a Rust Proxy
A secure Docker Compose configuration for running vLLM, llama.cpp, Qdrant, and PostgreSQL under rootless Podman with network isolation.
Introduction
When operating Large Language Models locally, spinning up a standalone inference server like vLLM or llama.cpp is relatively straightforward. However, the moment you attempt to integrate Retrieval-Augmented Generation (RAG) or workflow pipelines, the architecture quickly becomes convoluted. Through trial and error, I found that the best approach is to avoid overloading the LLM server itself with additional responsibilities, and instead control the request flow via a frontend proxy API.
In this article, I will share the minimum viable Docker Compose configuration I use to integrate a custom OpenAI-compatible Proxy API, Qdrant, PostgreSQL, vLLM, and llama.cpp. This setup is designed specifically for a GPU-enabled environment using rootless Podman.
Background and Architectural Goals
When building an LLM platform, baking RAG logic or job management directly into the inference server makes it extremely difficult to swap out models or scale the system later. To solve this, I decoupled the system based on the following principles:
- Backend (Inference): vLLM and llama.cpp function purely as “OpenAI-compatible inference endpoints.” They do nothing else.
- Frontend (Proxy): A custom Proxy API written in Rust (Axum) intercepts all requests. It handles RAG searches (using Qdrant and PostgreSQL) and triggers Dagster pipelines via NATS before optionally forwarding the prompt to the backend.
- Data Stack: Qdrant is dedicated solely to vector storage, while PostgreSQL manages metadata, job queues, and document chunk pairing.
The docker-compose.yml presented below is the manifestation of this “separation of inference and control.”
Container Configuration (docker-compose.yml)
I deploy this stack at /opt/containers/compose/llm-stack/docker-compose.yml. For security, I have hardened the configuration by applying cap_drop: ["ALL"] and no-new-privileges:true to every container, and restricted writable directories using tmpfs. A key architectural feature is the strict separation of internal networks into db_net and llm_net.
version: "3.9"
name: llm-stack
networks:
db_net:
driver: bridge
internal: true
llm_net:
driver: bridge
internal: true
volumes:
pg_data:
driver: local
driver_opts:
type: none
o: bind
device: /mnt/data/postgres/data
qdrant_data:
driver: local
driver_opts:
type: none
o: bind
device: /mnt/data/qdrant/data
api_data:
driver: local
driver_opts:
type: none
o: bind
device: /mnt/data/llm-api
vllm_models:
driver: local
driver_opts:
type: none
o: bind
device: /mnt/data/models/vllm
llama_models:
driver: local
driver_opts:
type: none
o: bind
device: /mnt/data/models/llama
services:
postgres:
image: postgres:17
container_name: pg
restart: always
command: ["postgres","-c","max_connections=300","-c","shared_buffers=4GB","-c","wal_compression=on"]
environment:
POSTGRES_USER: ${PG_USER:-loft}
POSTGRES_PASSWORD: ${PG_PASSWORD:-change_me}
POSTGRES_DB: ${PG_DB:-loftdb}
healthcheck:
test: ["CMD-SHELL","pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB -h 127.0.0.1"]
interval: 10s
timeout: 3s
retries: 10
networks: [db_net]
volumes:
- pg_data:/var/lib/postgresql/data:rw
read_only: true
tmpfs:
- /tmp:rw,nosuid,nodev,noexec,size=256m
- /var/run/postgresql:rw,mode=775
security_opt: ["no-new-privileges:true"]
cap_drop: ["ALL"]
ulimits:
nofile: 262144
qdrant:
image: qdrant/qdrant:latest
restart: always
environment:
QDRANT__SERVICE__GRPC_PORT: 6334
QDRANT__STORAGE__WAL_MEMORY_CAPACITY: "33554432"
QDRANT__STORAGE__OPTIMIZERS__DEFAULT_SEGMENT_NUMBER: "2"
healthcheck:
test: ["CMD","/qdrant/tools/healthcheck.sh"]
interval: 10s
timeout: 3s
retries: 10
networks: [db_net]
volumes:
- qdrant_data:/qdrant/storage:rw
ports:
- "6333:6333"
- "6334:6334"
read_only: false
security_opt: ["no-new-privileges:true"]
cap_drop: ["ALL"]
ulimits:
nofile: 262144
vllm:
image: vllm/vllm-openai:latest
restart: always
command:
[
"python","-m","vllm.entrypoints.openai.api_server",
"--model","${VLLM_MODEL:-/models}",
"--host","0.0.0.0",
"--port","8000",
"--tensor-parallel-size","${VLLM_TP:-1}",
"--max-num-seqs","${VLLM_MAX_SEQS:-32}"
]
environment:
NVIDIA_VISIBLE_DEVICES: "all"
devices:
- "nvidia.com/gpu=all"
healthcheck:
test: ["CMD","curl","-sf","http://127.0.0.1:8000/health"]
interval: 10s
timeout: 5s
retries: 20
networks: [llm_net]
volumes:
- vllm_models:/models:ro
ports:
- "8000:8000"
security_opt: ["no-new-privileges:true"]
cap_drop: ["ALL"]
ulimits:
nofile: 262144
llamacpp:
image: ghcr.io/ggerganov/llama.cpp:full
restart: always
command:
[
"server",
"-m","/models/${LLAMA_MODEL:-model.gguf}",
"--host","0.0.0.0",
"--port","8080",
"--mlock",
"--no-mmap",
"--ctx-size","${LLAMA_CTX:-8192}",
"--batch-size","${LLAMA_BATCH:-512}",
"--embedding"
]
environment:
NVIDIA_VISIBLE_DEVICES: "all"
devices:
- "nvidia.com/gpu=all"
healthcheck:
test: ["CMD","curl","-sf","http://127.0.0.1:8080/health"]
interval: 10s
timeout: 5s
retries: 20
networks: [llm_net]
volumes:
- llama_models:/models:ro
ports:
- "8080:8080"
security_opt: ["no-new-privileges:true"]
cap_drop: ["ALL"]
ulimits:
nofile: 262144
openai-proto-api:
image: ghcr.io/your-org/openai-proto-api:latest
restart: always
environment:
VLLM_BASE_URL: http://vllm:8000
LLAMA_BASE_URL: http://llamacpp:8080
QDRANT_URL: http://qdrant:6333
QDRANT_API_KEY: ${QDRANT_API_KEY:-}
PGHOST: postgres
PGPORT: 5432
PGUSER: ${PG_USER:-loft}
PGPASSWORD: ${PG_PASSWORD:-change_me}
PGDATABASE: ${PG_DB:-loftdb}
API_PORT: 9000
API_KEY: ${API_KEY:-change_me}
depends_on:
- qdrant
- postgres
- vllm
healthcheck:
test: ["CMD","curl","-sf","http://127.0.0.1:9000/healthz"]
interval: 10s
timeout: 5s
retries: 30
networks:
- llm_net
- db_net
volumes:
- api_data:/var/lib/llm-api
ports:
- "9000:9000"
read_only: true
tmpfs:
- /tmp:rw,nosuid,nodev,noexec,size=256m
security_opt: ["no-new-privileges:true"]
cap_drop: ["ALL"]
ulimits:
nofile: 262144
Environment Variables (.env)
Sensitive or dynamic configuration values are kept out of the compose file. Place this .env file at /opt/containers/compose/llm-stack/.env.
PG_USER=loft
PG_PASSWORD=change_me
PG_DB=loftdb
QDRANT_API_KEY=
API_KEY=change_me
VLLM_MODEL=/models
VLLM_TP=1
VLLM_MAX_SEQS=32
LLAMA_MODEL=model.gguf
LLAMA_CTX=8192
LLAMA_BATCH=512
Setup and Startup Procedure
Here is the initial setup procedure to get the stack running. Pay special attention to adjusting the directory permissions for PostgreSQL.
mkdir -p /mnt/data/{postgres/data,qdrant/data,models/vllm,models/llama,llm-api}
podman unshare chown -R 999:999 /mnt/data/postgres/data
cd /opt/containers/compose/llm-stack
podman-compose up -d
Because the Postgres container operates internally as uid=999, we must use podman unshare chown to align the ownership of the host’s mount directory. If you skip this step, the database initialization phase will fail with a permission error, and the container will not start.
GPU Allocation
In the service definitions for vLLM and llama.cpp, GPU access is universally defined using devices: ["nvidia.com/gpu=all"]. This assumes that the host environment has already been configured with the Container Device Interface (CDI), specifically by running nvidia-ctk runtime configure --runtime=crun beforehand.
Systemd Integration (Auto-start)
To ensure the services reliably restart after a server reboot, we can leverage Podman’s built-in functionality to register them as systemd user services.
podman generate systemd --files --name llm-stack_openai-proto-api
podman generate systemd --files --name llm-stack_vllm
podman generate systemd --files --name llm-stack_llamacpp
podman generate systemd --files --name llm-stack_qdrant
podman generate systemd --files --name llm-stack_postgres
mkdir -p ~/.config/systemd/user
mv *.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now llm-stack_*.service
loginctl enable-linger ksh3
By setting loginctl enable-linger, we ensure that the user’s processes continue to run in the background even after they log out of the host.
Results and Discussion
The most significant advantage of this configuration is that only the API container (openai-proto-api) is attached to both the db_net and llm_net networks.
Because the inference servers (vLLM and llama.cpp) are strictly confined to llm_net, there is zero risk of them accidentally accessing the databases. Conversely, Qdrant and Postgres cannot reach external networks or the inference endpoints. The Proxy API sits at the forefront as an OpenAI-compatible interface, accepts client requests, and routes them to either vLLM or llama.cpp based on the requested model name. The initial requirement is implemented while maintaining secure, distinct network boundaries.
Future Work
Currently, container log monitoring is not included directly in this Compose setup; the intention is to collect stdout or journal logs using promtail. Moving forward, I plan to expand the metadata processing capabilities on the Proxy side to further improve RAG accuracy.
