On this page

Containerizing a Local LLM Stack: Docker Compose for vLLM, llama.cpp, and a Rust Proxy

A secure Docker Compose configuration for running vLLM, llama.cpp, Qdrant, and PostgreSQL under rootless Podman with network isolation.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

When operating Large Language Models locally, spinning up a standalone inference server like vLLM or llama.cpp is relatively straightforward. However, the moment you attempt to integrate Retrieval-Augmented Generation (RAG) or workflow pipelines, the architecture quickly becomes convoluted. Through trial and error, I found that the best approach is to avoid overloading the LLM server itself with additional responsibilities, and instead control the request flow via a frontend proxy API.

In this article, I will share the minimum viable Docker Compose configuration I use to integrate a custom OpenAI-compatible Proxy API, Qdrant, PostgreSQL, vLLM, and llama.cpp. This setup is designed specifically for a GPU-enabled environment using rootless Podman.

Background and Architectural Goals

When building an LLM platform, baking RAG logic or job management directly into the inference server makes it extremely difficult to swap out models or scale the system later. To solve this, I decoupled the system based on the following principles:

Backend (Inference): vLLM and llama.cpp function purely as “OpenAI-compatible inference endpoints.” They do nothing else.
Frontend (Proxy): A custom Proxy API written in Rust (Axum) intercepts all requests. It handles RAG searches (using Qdrant and PostgreSQL) and triggers Dagster pipelines via NATS before optionally forwarding the prompt to the backend.
Data Stack: Qdrant is dedicated solely to vector storage, while PostgreSQL manages metadata, job queues, and document chunk pairing.

The docker-compose.yml presented below is the manifestation of this “separation of inference and control.”

Container Configuration (docker-compose.yml)

I deploy this stack at /opt/containers/compose/llm-stack/docker-compose.yml. For security, I have hardened the configuration by applying cap_drop: ["ALL"] and no-new-privileges:true to every container, and restricted writable directories using tmpfs. A key architectural feature is the strict separation of internal networks into db_net and llm_net.

  version: "3.9"

name: llm-stack

networks:
  db_net:
    driver: bridge
    internal: true
  llm_net:
    driver: bridge
    internal: true

volumes:
  pg_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/data/postgres/data
  qdrant_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/data/qdrant/data
  api_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/data/llm-api
  vllm_models:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/data/models/vllm
  llama_models:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /mnt/data/models/llama

services:
  postgres:
    image: postgres:17
    container_name: pg
    restart: always
    command: ["postgres","-c","max_connections=300","-c","shared_buffers=4GB","-c","wal_compression=on"]
    environment:
      POSTGRES_USER: ${PG_USER:-loft}
      POSTGRES_PASSWORD: ${PG_PASSWORD:-change_me}
      POSTGRES_DB: ${PG_DB:-loftdb}
    healthcheck:
      test: ["CMD-SHELL","pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB -h 127.0.0.1"]
      interval: 10s
      timeout: 3s
      retries: 10
    networks: [db_net]
    volumes:
      - pg_data:/var/lib/postgresql/data:rw
    read_only: true
    tmpfs:
      - /tmp:rw,nosuid,nodev,noexec,size=256m
      - /var/run/postgresql:rw,mode=775
    security_opt: ["no-new-privileges:true"]
    cap_drop: ["ALL"]
    ulimits:
      nofile: 262144

  qdrant:
    image: qdrant/qdrant:latest
    restart: always
    environment:
      QDRANT__SERVICE__GRPC_PORT: 6334
      QDRANT__STORAGE__WAL_MEMORY_CAPACITY: "33554432"
      QDRANT__STORAGE__OPTIMIZERS__DEFAULT_SEGMENT_NUMBER: "2"
    healthcheck:
      test: ["CMD","/qdrant/tools/healthcheck.sh"]
      interval: 10s
      timeout: 3s
      retries: 10
    networks: [db_net]
    volumes:
      - qdrant_data:/qdrant/storage:rw
    ports:
      - "6333:6333"
      - "6334:6334"
    read_only: false
    security_opt: ["no-new-privileges:true"]
    cap_drop: ["ALL"]
    ulimits:
      nofile: 262144

  vllm:
    image: vllm/vllm-openai:latest
    restart: always
    command:
      [
        "python","-m","vllm.entrypoints.openai.api_server",
        "--model","${VLLM_MODEL:-/models}",
        "--host","0.0.0.0",
        "--port","8000",
        "--tensor-parallel-size","${VLLM_TP:-1}",
        "--max-num-seqs","${VLLM_MAX_SEQS:-32}"
      ]
    environment:
      NVIDIA_VISIBLE_DEVICES: "all"
    devices:
      - "nvidia.com/gpu=all"
    healthcheck:
      test: ["CMD","curl","-sf","http://127.0.0.1:8000/health"]
      interval: 10s
      timeout: 5s
      retries: 20
    networks: [llm_net]
    volumes:
      - vllm_models:/models:ro
    ports:
      - "8000:8000"
    security_opt: ["no-new-privileges:true"]
    cap_drop: ["ALL"]
    ulimits:
      nofile: 262144

  llamacpp:
    image: ghcr.io/ggerganov/llama.cpp:full
    restart: always
    command:
      [
        "server",
        "-m","/models/${LLAMA_MODEL:-model.gguf}",
        "--host","0.0.0.0",
        "--port","8080",
        "--mlock",
        "--no-mmap",
        "--ctx-size","${LLAMA_CTX:-8192}",
        "--batch-size","${LLAMA_BATCH:-512}",
        "--embedding"
      ]
    environment:
      NVIDIA_VISIBLE_DEVICES: "all"
    devices:
      - "nvidia.com/gpu=all"
    healthcheck:
      test: ["CMD","curl","-sf","http://127.0.0.1:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 20
    networks: [llm_net]
    volumes:
      - llama_models:/models:ro
    ports:
      - "8080:8080"
    security_opt: ["no-new-privileges:true"]
    cap_drop: ["ALL"]
    ulimits:
      nofile: 262144

  openai-proto-api:
    image: ghcr.io/your-org/openai-proto-api:latest
    restart: always
    environment:
      VLLM_BASE_URL: http://vllm:8000
      LLAMA_BASE_URL: http://llamacpp:8080
      QDRANT_URL: http://qdrant:6333
      QDRANT_API_KEY: ${QDRANT_API_KEY:-}
      PGHOST: postgres
      PGPORT: 5432
      PGUSER: ${PG_USER:-loft}
      PGPASSWORD: ${PG_PASSWORD:-change_me}
      PGDATABASE: ${PG_DB:-loftdb}
      API_PORT: 9000
      API_KEY: ${API_KEY:-change_me}
    depends_on:
      - qdrant
      - postgres
      - vllm
    healthcheck:
      test: ["CMD","curl","-sf","http://127.0.0.1:9000/healthz"]
      interval: 10s
      timeout: 5s
      retries: 30
    networks:
      - llm_net
      - db_net
    volumes:
      - api_data:/var/lib/llm-api
    ports:
      - "9000:9000"
    read_only: true
    tmpfs:
      - /tmp:rw,nosuid,nodev,noexec,size=256m
    security_opt: ["no-new-privileges:true"]
    cap_drop: ["ALL"]
    ulimits:
      nofile: 262144

Environment Variables (.env)

Sensitive or dynamic configuration values are kept out of the compose file. Place this .env file at /opt/containers/compose/llm-stack/.env.

  PG_USER=loft
PG_PASSWORD=change_me
PG_DB=loftdb
QDRANT_API_KEY=
API_KEY=change_me

VLLM_MODEL=/models
VLLM_TP=1
VLLM_MAX_SEQS=32

LLAMA_MODEL=model.gguf
LLAMA_CTX=8192
LLAMA_BATCH=512

Setup and Startup Procedure

Here is the initial setup procedure to get the stack running. Pay special attention to adjusting the directory permissions for PostgreSQL.

  mkdir -p /mnt/data/{postgres/data,qdrant/data,models/vllm,models/llama,llm-api}
podman unshare chown -R 999:999 /mnt/data/postgres/data
cd /opt/containers/compose/llm-stack
podman-compose up -d

Because the Postgres container operates internally as uid=999, we must use podman unshare chown to align the ownership of the host’s mount directory. If you skip this step, the database initialization phase will fail with a permission error, and the container will not start.

GPU Allocation

In the service definitions for vLLM and llama.cpp, GPU access is universally defined using devices: ["nvidia.com/gpu=all"]. This assumes that the host environment has already been configured with the Container Device Interface (CDI), specifically by running nvidia-ctk runtime configure --runtime=crun beforehand.

Systemd Integration (Auto-start)

To ensure the services reliably restart after a server reboot, we can leverage Podman’s built-in functionality to register them as systemd user services.

  podman generate systemd --files --name llm-stack_openai-proto-api
podman generate systemd --files --name llm-stack_vllm
podman generate systemd --files --name llm-stack_llamacpp
podman generate systemd --files --name llm-stack_qdrant
podman generate systemd --files --name llm-stack_postgres
mkdir -p ~/.config/systemd/user
mv *.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now llm-stack_*.service
loginctl enable-linger ksh3

By setting loginctl enable-linger, we ensure that the user’s processes continue to run in the background even after they log out of the host.

Results and Discussion

The most significant advantage of this configuration is that only the API container (openai-proto-api) is attached to both the db_net and llm_net networks.

Because the inference servers (vLLM and llama.cpp) are strictly confined to llm_net, there is zero risk of them accidentally accessing the databases. Conversely, Qdrant and Postgres cannot reach external networks or the inference endpoints. The Proxy API sits at the forefront as an OpenAI-compatible interface, accepts client requests, and routes them to either vLLM or llama.cpp based on the requested model name. The initial requirement is implemented while maintaining secure, distinct network boundaries.

Future Work

Currently, container log monitoring is not included directly in this Compose setup; the intention is to collect stdout or journal logs using promtail. Moving forward, I plan to expand the metadata processing capabilities on the Proxy side to further improve RAG accuracy.

Rebuilding a Compute Server in 20-30 Minutes with tar.zst and rclone

A backup and recovery design …

How I Split rclone and rsync When Moving Hugging Face Models from Cold to Hot Storage

A transfer procedure that …