On this page

How I Stabilized smartctl-exporter and Standardized Exporter Operations Across Rootful and Rootless Scopes

Establishing a consistent rootful/rootless placement rule for monitoring exporters while stabilizing smartctl-exporter v0.14.0 with NVMe workarounds and daemon-reload discipline.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

Once I had more than a few exporters in the monitoring stack, the real problem was no longer just getting them to run. I needed a consistent rule for what belongs in a rootful scope, what should stay rootless, and how to keep those services predictable when something breaks. This note became the place where I normalized that operating model while stabilizing smartctl-exporter.

The immediate trigger was the earlier node_exporter rollout across multiple machines. That work already established the Prometheus side of the pattern: one central Prometheus instance scraping several hosts. The next step was handling exporters closer to hardware, especially smartctl-exporter, without turning systemd and compose management into a mess.

Background and Motivation

The previous node_exporter note was straightforward: add host metrics on each machine and scrape them centrally from Storage. That part worked. What was still inconsistent was the service placement model behind those exporters.

Two recurring problems kept showing up. The first was cross-scope ambiguity between rootful and rootless services. The second was forgetting daemon-reload after editing units or compose files, which made it hard to tell whether a change had really been applied. On top of that, smartctl-exporter v0.14.0 could not use --smartctl.device-opts, so NVMe behavior needed an explicit workaround.

Instead of solving those issues one by one every time, I decided to capture them as a reusable operating pattern: where exporters belong, how dependencies should be treated, and what commands should always follow an edit.

Directory Standard

I started by fixing the directory layout. Without that, service files and compose files drift over time and every new exporter gets its own exception.

  # rootful (administrator-managed)
sudo mkdir -p /opt/containers/{compose,systemd}/{rootful,rootless}
sudo mkdir -p /opt/prometheus/exporters/smartctl/bin

The point was to keep the hierarchy legible. Host-facing components such as smartctl-exporter go to the rootful side. Application services such as prometheus and loki stay on the rootless side. That decision removes a lot of repeated judgment later.

systemd (Rootful)

For smartctl-exporter, I aligned on the same one-service-per-stack oneshot pattern I was already using elsewhere.

  # /etc/systemd/system/smartctl-exporter.service
[Unit]
Description=smartctl-exporter (rootful)
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/containers/compose/rootful
ExecStart=/usr/bin/podman compose up -d
ExecStop=/usr/bin/podman compose down
TimeoutStopSec=30s

[Install]
WantedBy=multi-user.target

The activation flow is standard:

  sudo systemctl daemon-reload
sudo systemctl enable --now smartctl-exporter.service

I kept this exporter in system scope for a reason. Services that touch devices or depend on elevated privileges are easier to reason about when they are explicitly rootful. Forcing them into a rootless model tends to save nothing and adds confusion later.

Compose (Rootful / `smartctl-exporter` v0.14.0)

This was the core of the note. Because smartctl-exporter v0.14.0 does not support --smartctl.device-opts, I added a wrapper that always injects -T permissive.

  sudo tee /opt/prometheus/exporters/smartctl/bin/smartctl-wrapper >/dev/null <<'SH'
#!/bin/sh
case " $* " in
  *"/dev/nvme"*) exec /usr/sbin/smartctl -T permissive -d nvme "$@";;
  *)             exec /usr/sbin/smartctl -T permissive "$@";;
esac
SH
sudo chmod +x /opt/prometheus/exporters/smartctl/bin/smartctl-wrapper

That makes the intent obvious: when the exporter talks to NVMe devices, it should do so in permissive mode. It is a cleaner workaround than trying to hide the behavior entirely in arguments.

The compose definition looks like this:

  # /opt/containers/compose/rootful/smartctl-exporter.yml
services:
  smartctl-exporter:
    image: docker.io/prometheuscommunity/smartctl-exporter:v0.14.0
    container_name: smartctl-exporter
    user: root
    privileged: true
    restart: unless-stopped
    ports:
      - "9633:9633"
    devices:
      - /dev/nvme0:/dev/nvme0
      - /dev/nvme1:/dev/nvme1
    volumes:
      - /opt/prometheus/exporters/smartctl/bin/smartctl-wrapper:/usr/local/bin/smartctl-wrapper:ro
    command:
      - --web.listen-address=:9633
      - --smartctl.path=/usr/local/bin/smartctl-wrapper
      - --smartctl.interval=60s
      - --smartctl.timeout=3s
      - --smartctl.retries=0
      - --smartctl.device=/dev/nvme0
      - --smartctl.device=/dev/nvme1

The operational detail that matters most here is --smartctl.timeout=3s with --smartctl.retries=0. If /metrics hangs, the problem stops being a Prometheus scrape issue and becomes an exporter availability issue. Failing fast is better than holding the whole path hostage.

For a quick validation, I used:

  sudo systemctl restart smartctl-exporter.service
curl -s 127.0.0.1:9633/metrics | grep 'smartctl_device{device="nvme'

If nvme0 still blocks, I fall back to nvme1 only first:

  # temporarily limit to nvme1
command:
  - --web.listen-address=:9633
  - --smartctl.path=/usr/local/bin/smartctl-wrapper
  - --smartctl.device=/dev/nvme1
  - --smartctl.interval=60s
  - --smartctl.timeout=3s
  - --smartctl.retries=0

That is not a compromise in quality. It is a proper recovery path. In monitoring, restoring observability first and bringing the problematic device back later is usually the right trade.

systemd (Rootless) Template

I also standardized the rootless side. The mktxp.service example captures the intended pattern for user-level compose stacks.

  # /opt/containers/systemd/rootless/mktxp.service
[Unit]
Description=Podman Compose Stack for mktxp
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/prometheus/exporters/mktxp/compose.d
ExecStart=/usr/bin/podman-compose up -d
ExecStop=/usr/bin/podman-compose down
TimeoutStopSec=30s

[Install]
WantedBy=default.target

Enable it in user scope:

  systemctl --user daemon-reload
systemctl --user enable --now /opt/containers/systemd/rootless/mktxp.service
loginctl enable-linger "$USER"

loginctl enable-linger is part of the real operating model here. If I want rootless services to behave as long-running infrastructure, I need to make that explicit.

Prometheus Scrape Configuration

On the Prometheus side, things are intentionally simple. Once smartctl-exporter is available on :9633, scraping it is just another job.

  scrape_configs:
  - job_name: "smartctl"
    scrape_interval: 60s
    scrape_timeout: 10s
    static_configs:
      - targets: ["storage-server:9633"]

This matches the earlier node_exporter rollout logic. Prometheus remains the central pull-based layer. The tricky part is not scraping, but stabilizing the exporter so the scrape path stays available.

Grafana Queries for nvme0 and nvme1

On the dashboard side, I wanted to avoid stale hard-coded device filters.

  smartctl_device{device=~"nvme.*"}
# temperature example
smartctl_temperature_celsius{device=~"nvme.*"}
# health
smartctl_health_ok{device=~"nvme.*"}

If I leave a query pinned to something like device="nvme1", I can fix the exporter and still keep looking at only half the data. Cleaning up the Grafana query is part of the exporter change, not a separate cosmetic step.

Common Failure Points and Fast Recovery Commands

These are the commands I wanted immediately available during recovery:

  # /metrics hangs -> likely blocked on nvme0. Exclude it first and restore access.

# container stuck in Stopping
podman rm -f smartctl-exporter || true

# name collision
podman run --replace -d --name smartctl-exporter ...

# check whether 9633 is listening
ss -ltnp | grep :9633

# inspect device discovery and failures
podman logs smartctl-exporter | sed -n 's/.*Number of devices found.*/&/p'
podman logs smartctl-exporter | grep -E 'readjson|Invalid Log Page|device not found|Listening on'

The point is not elegance. It is reducing time-to-recovery. Monitoring systems benefit more from quick restoration of visibility than from perfect postmortem discipline in the first five minutes.

How I Treat Dependencies Between Rootless and Rootful Services

This became the most important part of the operating rule set. systemd behaves predictably inside a single scope: system-to-system or user-to-user. Once I try to model strict dependencies across rootful and rootless boundaries, the result becomes harder to understand than the problem itself.

So I settled on a clear rule:

Keep dependencies loose and rely on network retries.
Use After=, Wants=, Requires=, and PartOf= only inside the same scope.
Prevent port conflicts up front with a fixed port table.

That logic works for promtail -> loki and for prometheus -> exporters. Both are naturally retry-driven systems. They do not need heavyweight cross-scope orchestration to recover.

Making `daemon-reload` Hard to Forget

The other recurring failure was forgetting daemon-reload. After touching a unit or compose file, I did not want to rely on memory.

So I added a tiny helper:

  # /usr/local/bin/sdreload
#!/bin/sh
systemctl daemon-reload || true
systemctl --user daemon-reload || true
echo "[done] daemon-reload (system & user)"
chmod +x /usr/local/bin/sdreload

And I paired it with a fixed post-change routine:

  # rootful (system scope)
sdreload
systemctl restart smartctl-exporter.service
systemctl status smartctl-exporter.service -n 30

# rootless (user scope)
sdreload
systemctl --user restart loki.service promtail.service
systemctl --user status loki.service promtail.service -n 30

This is not sophisticated automation, but it removes a very common class of human error. In practice that is more valuable than a clever setup that nobody follows consistently.

Best-Practice Placement for Rootful and Rootless Services

I reduced the placement rule to something very simple:

Rootful: anything that touches the kernel or devices
Rootless: application-layer services

In concrete terms:

Rootful: smartctl-exporter, node-exporter
Rootless: loki, prometheus, promtail, grafana

This is about more than permissions. It also makes troubleshooting paths obvious. If a host-facing exporter breaks, I know I am in system scope. If an application service breaks, I stay in the user scope.

Final Operating Rules

The final rules are deliberately simple:

Do not create hard dependencies across rootful and rootless scopes
Always run sdreload after touching service or compose definitions
Keep a fixed port design and avoid duplicates
Treat exporters as rootful and application services as rootless by default

These rules fit naturally with the earlier node_exporter expansion work. Prometheus can keep gaining scrape targets without forcing me to renegotiate the ownership model for every new exporter.

Results

What I got out of this was not just a workaround for smartctl-exporter. I ended up with a clearer operating standard for the monitoring stack as a whole: where services live, how they should restart, and where dependencies should stop.

That consistency matters. It shortens recovery, reduces explanation overhead, and makes new exporters easier to add without rebuilding the logic every time.

Future Work

The next step is to keep rewriting the existing *.service and compose/*.yml files toward this standard. The goal is to reduce them to the smallest non-conflicting set of dependencies and ports.

For smartctl-exporter specifically, I will keep using the safer sequence: stabilize with nvme1, reintroduce nvme0 later, then align the Prometheus and Grafana side so the recovered device set is visible everywhere.

Building the Storage Server's Always-On Monitoring Stack: Prometheus, Loki, Promtail, and Quadlet

Complete record of building a …

Designing and Building a Local Dev Platform on EPYC 9175F and Podman

Complete design and build …