On this page

State-Driven Syslog Monitoring with MikroTik RouterOS Netwatch

Implementing Netwatch-based Syslog server liveness monitoring on MikroTik RouterOS with automatic switching between UDP delivery and local NAND buffering.

These articles use AI-generated summaries of Obsidian notes originally kept as technical memos.

English translations are produced with AI assistance.

Introduction

In this post, I will explain the step-by-step process of building a fundamental network infrastructure on a MikroTik router (including DNS DoH, VLAN isolation, DHCP, and firewall rules). Furthermore, I will detail how I utilized the Netwatch feature to create a mechanism that automatically switches between sending logs and stashing them locally, depending on the availability of the Syslog server. This update reflects the more operational side of the setup: using a CRS304 as both the local 10GbE switch and the router, while preserving firewall counters even when the Linux logging host is unavailable.

In small to medium-sized network environments, a single router is often tasked with integrated management – from routing and security to log monitoring. Therefore, alongside building a secure VLAN-isolated environment, I decided to implement a failsafe mechanism to ensure that no logs are lost even when the Syslog server goes down.

Background and Motivation

MikroTik RouterOS provides exceptionally flexible and powerful routing and firewall capabilities. When building the network, I needed to satisfy the following requirements:

Security and Privacy: Encrypt DNS queries (DoH) and completely prohibit plaintext communication on UDP/TCP port 53.
Network Isolation: Create VLANs tailored to different purposes (e.g., vlan10, vlan20, vlan30) and control routing to prevent mutual interference.
Reliable Monitoring and Log Collection: Capture statistics of unauthorized access and packet drops, and send them to a Linux Syslog server. However, if the Syslog server is down, these logs should be pooled internally (on the router’s NAND) so they can be retrieved after recovery.

I also wanted to avoid a design where the scheduler itself gets enabled and disabled on every state change. Instead, I kept one five-minute job always running and let Netwatch do only one thing: flip a state flag. That separation made the behavior easier to reason about during failures.

Below are the specific configuration steps I took to achieve these goals.

Where This Fits in the Overall Topology

The larger topology here is a small CRS304-based network: Port1 is WAN, Port2 is a Mac, Port3 is a Linux server, Port4 is a Wi-Fi AP, and Port5 is a console port. The wired Mac and Linux links are meant primarily for local 10GbE traffic, while internet-bound access and client isolation are handled across separate segments and VLAN policies.

In that kind of design, RouterOS is not just the router. It is also the boundary between local-only traffic, WAN-facing policy, and operational visibility. That is why it made sense to keep a lightweight fallback path for drop counters directly on the router when the Linux syslog destination is down.

Main Configuration: Building the Network Infrastructure

I started with the basic WAN-side configuration. I defined ether1 as the WAN interface and enabled the DHCP client.

  /interface ethernet set ether1 name=WAN
/ip dhcp-client add interface=WAN disabled=no

1. DNS (Enabling DoH)

For DNS, I configured the router to prevent plaintext requests to ISPs or public DNS servers, enforcing all queries to go through DoH (DNS over HTTPS).

  /ip dns set allow-remote-requests=yes \
    servers=1.1.1.1,1.0.0.1 \
    use-doh-server=https://1.1.1.1/dns-query \
    verify-doh-cert=yes

With the firewall settings I will describe later, outbound UDP/TCP port 53 is strictly prohibited, meaning the router itself only utilizes DoH. Client devices are configured to reference their respective VLAN gateway IPs as their DNS servers.

2. VLAN and Bridge Setup

To separate the network by purpose, I created vlan10, vlan20, and vlan30 and assigned them to their respective ports. ether4 is assigned to vlan10, ether2 to vlan20, and ether3 to vlan30.

  /interface vlan
add name=vlan10 interface=bridge vlan-id=10
add name=vlan20 interface=bridge vlan-id=20
add name=vlan30 interface=bridge vlan-id=30

/interface bridge vlan
add bridge=bridge tagged=bridge,ether4 vlan-ids=10
add bridge=bridge tagged=bridge,ether2 vlan-ids=20
add bridge=bridge tagged=bridge,ether3 vlan-ids=30

/interface bridge port
set [find interface=ether4] pvid=10
set [find interface=ether2] pvid=20
set [find interface=ether3] pvid=30

/ip address
add address=192.168.10.1/24 interface=vlan10
add address=192.168.20.1/24 interface=vlan20
add address=192.168.30.1/24 interface=vlan30

3. DHCP Server Configuration

For this example, I configured the router to distribute IP addresses via DHCP only to vlan10.

  /ip pool add name=pool10 ranges=192.168.10.100-192.168.10.200
/ip dhcp-server
add name=dhcp10 interface=vlan10 address-pool=pool10 disabled=no
/ip dhcp-server network
add address=192.168.10.0/24 gateway=192.168.10.1 dns-server=192.168.10.1

4. Firewall (Filters)

The design of the firewall rules focuses on isolating the VLANs, blocking unauthorized external access, and permitting access from the management VLAN (vlan88). Particularly in the forward chain, communication between vlan10 and vlan20 is explicitly dropped, and only new connections from the LAN to the WAN are permitted.

  /ip firewall filter
# input
add chain=input connection-state=invalid action=drop comment="Drop invalid input"
add chain=input connection-state=established,related action=accept comment="Allow est/rel input"
add chain=input protocol=udp in-interface=ether1 src-port=67 dst-port=68 action=accept comment="Allow DHCP from ISP"
add chain=input protocol=tcp src-address=192.168.88.2 dst-port=22,8291,8728,8729 action=accept comment="Mgmt from vlan88"
add chain=input protocol=udp src-address=192.168.88.2 dst-port=161 action=accept comment="SNMP from vlan88"
add chain=input protocol=tcp src-address=192.168.88.2 dst-port=80,443 action=accept comment="Web mgmt from vlan88"
add chain=input protocol=icmp src-address=192.168.88.2 action=accept comment="ICMP from vlan88"
add chain=input action=drop log=yes log-prefix="IN_DROP:"

# forward
add chain=forward connection-state=invalid action=drop comment="Drop invalid forward"
add chain=forward action=fasttrack-connection connection-state=established,related hw-offload=yes comment="FastTrack"
add chain=forward connection-state=established,related action=accept comment="Allow est/rel forward"
add chain=forward in-interface=vlan10 out-interface=vlan20 action=drop comment="Isolate vlan10->vlan20"
add chain=forward in-interface=vlan20 out-interface=vlan10 action=drop comment="Isolate vlan20->vlan10"
add chain=forward connection-state=new out-interface=ether1 action=accept comment="LAN->WAN new"
add chain=forward action=drop comment="Drop remaining forward"

# output
add chain=output connection-state=established,related action=accept comment="Allow est/rel output"
add chain=output protocol=udp out-interface=ether1 src-port=68 dst-port=67 action=accept comment="Allow DHCP client"
add chain=output protocol=udp out-interface=ether1 dst-port=123 action=accept comment="Allow NTP"
add chain=output protocol=icmp out-interface=ether1 action=accept comment="Allow ping"
add chain=output protocol=tcp out-interface=ether1 dst-port=80,443 action=accept comment="Allow HTTP/HTTPS"
add chain=output protocol=udp dst-address=192.168.88.2 dst-port=514 action=accept comment="Syslog"
add chain=output protocol=tcp dst-address=192.168.88.2 dst-port=3100 action=accept comment="Loki ingest"
add chain=output action=drop comment="Drop remaining output"

Implementing State-Driven Syslog and Five-Minute Counter Snapshots

The part I put the most effort into this time is the logging and monitoring mechanism for statistical data (such as drop counters). Normally, logs are constantly forwarded via UDP to the Syslog server (Linux). However, I did not want to lose the records of drops or attacks that occurred while the Syslog server was down for maintenance or due to a fault.

Therefore, I built a system that uses Netwatch to monitor the liveness of the Syslog server and automatically switches the log transmission destination based on the state (UP/DOWN). The important implementation detail is that Netwatch does not own the processing logic. It only updates the linux_up flag. The actual work is centralized in the five-minute counter_tick script, which branches between UDP syslog delivery plus counter reset in the UP case, and NAND append-only buffering in the DOWN case.

Base Logging and Counter DROP Setup

First, I added counter rules to measure packet drops (applying a light rate limit to prevent log flooding).

  /system logging action set memory memory-lines=8
/system logging disable [find action=disk]

/system logging action
add name=to-syslog target=remote remote=LINUX_IP remote-port=514 src-address=MGMT_IP bsd-syslog=yes

# Also send system,info to output COUNTER to syslog
/system logging
add topics=system,info action=to-syslog
add topics=firewall,info action=to-syslog
add topics=ssh,warning action=to-syslog
add topics=firewall,info action=memory

/ip firewall filter
add chain=input in-interface=WAN_IF protocol=tcp dst-port=22 connection-state=new \
    action=drop comment="CNT_SSH_IN" log=yes log-prefix="SSH_FAIL " limit=20/1m,40
add chain=input in-interface=WAN_IF connection-state=new action=drop \
    comment="CNT_DROP_IN" log=yes log-prefix="DROP_IN " limit=30/1m,60
add chain=forward in-interface=WAN_IF connection-state=new action=drop \
    comment="CNT_DROP_FW" log=yes log-prefix="DROP_FW " limit=30/1m,60

State Monitoring with Netwatch (90 seconds)

I implemented a script that pings the Syslog server (LINUX_IP) every 90 seconds and toggles a global variable flag linux_up according to the response.

  # Initial value: Start as UP
:global linux_up true

/system script add name=netwatch_up source={
  :global linux_up true
  :log info "NW:UP Linux reachable"
}
/system script add name=netwatch_down source={
  :global linux_up false
  :log warning "NW:DOWN Linux unreachable"
}

# 90s interval monitoring
/tool netwatch add host=LINUX_IP interval=00:01:30 up-script=netwatch_up down-script=netwatch_down

The 5-Minute Counter Processing Script

I set up a scheduler that runs every 5 minutes and evaluates the state of the linux_up variable.

When UP: It formats the counter values into JSON, outputs them to Syslog, and then resets the counters.
When DOWN: Instead of sending them to Syslog, it appends the JSON to a file named counter.log on the local NAND (without resetting the counters).

  /system script add name=counter_tick source={
  :global linux_up

  :local now [/system clock get time]
  :local date [/system clock get date]
  :local ssh [/ip firewall filter get [find comment="CNT_SSH_IN"] packets]
  :local in  [/ip firewall filter get [find comment="CNT_DROP_IN"] packets]
  :local fw  [/ip firewall filter get [find comment="CNT_DROP_FW"] packets]
  :local json ("{\"date\":\"$date\",\"time\":\"$now\",\"ssh_fail\":$ssh,\"drop_in\":$in,\"drop_fw\":$fw}")

  :if ($linux_up = true) do={
    # ---- UP: Send to syslog(UDP) -> reset immediately ----
    /log info ("COUNTER " . $json)
    /ip firewall reset-counters
  } else={
    # ---- DOWN: Append to one file on NAND (no reset) ----
    :if ([:len [/file find name=counter.log]] = 0) do={
      /file print file=counter.log where name=counter.log
      /file set counter.log contents=$json
    } else={
      /file set counter.log contents=([/file get counter.log contents] . "\n" . $json)
    }
  }
}

# Run constantly every 5m (behavior branches automatically on UP/DOWN)
/system scheduler add name=counter_tick_5m interval=5m on-event=counter_tick

This pattern keeps state transitions simple. Netwatch only flips true or false, and all side effects remain inside counter_tick. On RouterOS, I wanted to avoid a situation where multiple scripts touch the same counters and files in different places, so centralizing the behavior in one scheduler turned out to be a practical operational choice.

It is also worth noting what this design does not do. It does not try to replay every missed minute of buffered events over UDP after recovery. Instead, it sends one snapshot every five minutes. That matches the real goal here: preserving useful statistics without building a fragile replay pipeline.

Data Retrieval Upon Recovery (Linux Side)

This is a script to import the past counter logs that accumulated in NAND when the Syslog server recovers. It is executed via SSH from the Linux side to fetch counter.log, and then it deletes the file from the MikroTik side and resets the counters.

  #!/usr/bin/env bash
ROUTER=192.168.30.1
DEST=/var/log/routeros/counter-recovered.log
mkdir -p "$(dirname "$DEST")"

if ssh admin-full@"$ROUTER" '[:len [/file find name=counter.log]]' >/dev/null 2>&1; then
  scp admin-full@"$ROUTER":counter.log "$DEST.tmp" || exit 0
  [ -f "$DEST" ] && cat "$DEST.tmp" >> "$DEST" || mv "$DEST.tmp" "$DEST"
  rm -f "$DEST.tmp"
  ssh admin-full@"$ROUTER" '/file remove counter.log; /ip firewall reset-counters'
fi

Results and Operation

With this configuration, I was able to achieve complete VLAN isolation and enforce secure name resolution (DoH) on RouterOS. Furthermore, on the monitoring front, I was able to compensate for the shortcoming of Syslog – “UDP transmission is simple but delivery is not guaranteed” – by falling back to local storage via Netwatch. Thanks to this mechanism, I can now perform restarts and maintenance on the Syslog server without worrying about losing crucial data.

That tradeoff fits this topology well. The wired Mac and Linux segments are intentionally local-first, and the router’s job is to preserve enough operational evidence when something abnormal happens on the WAN side. The counter_tick_5m plus counter.log approach gave me exactly that level of resilience without overcomplicating the setup.

Future Considerations

In this implementation, there is a possibility that counter information could be lost for a few seconds up to tens of minutes right at the moment a transition between UP and DOWN occurs. This is due to the time lag between the 90-second interval of Netwatch and the 5-minute interval of the scheduler. However, for the purpose of tracking statistics in a small-scale environment, I consider this an acceptable margin. In the future, I would like to consider direct ingestion into tools like Loki or implementing more granular metrics visualization in Prometheus/Grafana (e.g., by introducing an exporter).

Using a CRS304 for Local 10GbE and Staged Egress Control

Using a CRS304 as both the …

Building a Container Platform with Rootless Podman and Quadlet: UID Mapping, Permission Design, and macOS DNS Resolution

Complete record of building a …