Goal of Monitoring

The goal of monitoring is to ensure that the system, i.e. the web server and all software on it, is working properly and within established parameters. If at any time a website or a subsystem on the web server stops functioning, a signal should be sent out to the sysop, who maintains the system.

In addition, it should also be possible to examine trends over time, or historic data, to evaluate whether or not the system’s resources should be expanded (or scaled back) in the future.

You will notice that we are relying on two monitoring systems now: one provided by the data center, and a monitoring system based on Webmin, which is an administrative system for (web) servers. The reason for adding Webmin’s monitoring is that the data center does not allow you to monitor specific websites, but Webmin does.

1. Check Monitoring Settings of the Data Center

The data center may have its own monitoring that comes pre-installed and configured with a new web server (VPS). Just make sure that everything is set up correctly.

Don’t bother configuring Strato’s Monitoring Service: there is only one check available in the free plan, and it entails a ping every 30 minutes.

Use something like lms.example.com/ instead, which is free.

For instance, for hosting provider, do the following. Sign in to hosting provider’s KIS website: https://kis.hosting provider.de/ and click on the appropriate type of server: either Virtual Server 10+ or Virtual Server. In this guide we show the first type.

In the following screen, click on the login button, under the Contract column:

This will open a new browser window (or tab). Here you see the current usage:

The following metrics should not exceed 80%:

  • CPU cores

  • RAM

And Disk space should not exceed 95%.

If the system is not used to send out email, then the SMTP relays metric is typically 0.

Ideally, Uptime monitoring is 100%, but may decrease slightly to 99.91% over time.

Now click on the Monitoring tab, which should take you to the next screen:

Here, make sure all the settings for Manage Email Alerts are switched on.

This monitor will send out an email to the owner of the KIS account with an alert if either CPU, Disk or RAM usage exceeds 80%.

External Monitoring

It is also recommended to add an external monitor. An external monitor is a monitor that resides on another system. For instance, you can use lms.example.com for free to perform a GET request every five minutes to a website on the server you want to monitor. Don’t forget to add your email address so you will receive notifications when the monitor fails.

Using an external monitor ensures you get alerted if the server goes down even if the entire data center goes down with it.

Heartbeat Monitor

We have a custom plugin, tool_heartbeat, which can be used to send out an “I’m alive” signal to lms.example.com (or a comparable service). Use this tool to make sure Moodle’s (or Totara’s) cron is still working.

Here’s how it works:

  • The Moodle or Totara site stops telling Cronitor “I’m alive!” for whatever reason. (The Heartbeat plugin does this, hence the name.)

  • Cronitor notices Totara is no longer alive, waits 5 minutes just in case, and then sends out an alert “Type: Alert” (“Event not received on schedule”).

  • If (when) Totara is reanimated, Cronitor sends out an alert “Type: Recovery”.

So, in the email messages from Cronitor, “Alert” means there’s a problem, and “Recovery” means it’s fixed.

Installation and configuration

  1. Place the contents of this directory inside the /admin/tool/heartbeat folder relative to your Moodle or Totara install path.

  2. Configure the cron job to * * * * * php /path_to_your_moodle/admin/cli/cron.php | php /path_to_your_moodle/admin/tool/heartbeat/cli/cron.php > /dev/null

Plugins settings

  • Cron monitor: Enable the monitor and add the url of the external cron monitor service

  • Email settings: Enable the email notifications, add the email subject and body, select recipients that get the email.

2. Make sure Webmin is Installed

Our standard procedure is to install Webmin, an administrative system for web servers. So Webmin should be installed and accessible, typically through the hostname and the 10000 port, e.g.: lms.example.com:10000/.

3. Configure Webmin to Monitor Critical Systems and Websites

Go to Webmin and open the Tools > System and Server Status section:

We need to add five types of monitors:

  • Load average: what is the average usage of the system in during the last 15 minutes

  • Disk space: how much is left on the disk (typically an SSD drive)

  • Apache web server: is the web server up and running?

  • Free memory: how much free memory do we have left?

  • MySQL database server: is the database server up and running?

To add a new monitor in Webmin, use the select box next to the button Add monitor of type and then click the button.

Settings for All New Monitors

For all new monitors, do not forget to add a Description that includes the customer’s name (or main website), and fill out the field “Also send email for this service to” with the address of the person in the sysop role for this server. Set the field “Failures before reporting” to 1. (See the screenshots below for some examples of where to find these fields.)

Load Average Monitor

The average load is the usage of the system (mainly CPU usage) during the past 5, 10 and 15 minutes. To get a good perspective, we set this monitor to 15 minutes, under Load average to check.

The Maximum load average value is critical: it should not exceed 80%. The actual value to fill in, is based on the number of CPU cores. This is the computation:

n cores x .8

For instance, 1 core is 0.8, and 4 cores gives you a value of 3.2.

The number of cores can be retrieved from Webmin as well. Simply go to Webmin’s homepage and look for Processor information. There you find the number of cores:

You can also use the command lscpu:

admin@example-host:~$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

Address sizes: 48 bits physical, 48 bits virtual

CPU(s): 4

Disk Space Monitor

This is pretty straightforward: just fill in 5%. This should send out an alert if the disk is over 95% capacity. Filesystem to check is /.

Apache Web Server Monitor

The defaults for this monitor should be fine.

Free Memory Monitor

For this monitor, two values are critical:

  1. Minimum free real memory: we want 20% to be free (or max 80% used)

  2. Minimum free virtual memory: we want this set equal to the amount of physical RAM.

To compute the 20% minimum free RAM, we need to know the total available real memory. You can find this on the “homepage” of Webmin:

Webmin reports the total memory in Gigibytes (GiB). But the Free Memory monitor uses megabytes (MB). To convert the free memory from GiB to MB, use the following formula:

MB = 1073.74 x n GiB

For instance, if we have 7.77 GiB that gives us 8342.9598 MB. Of this number, we take 20% to fill in for the minimum free real memory, and 25% of the virtual memory as the “Minimum free virtual memory”.

MySQL Database Server Monitor

The defaults for this monitor are fine. Make sure that the “Failures before reporting” field is set to 1 and that the “Also send email for this service to” field is filled in.

4. Add a “Remote HTTP Service” Monitor to Another Webmin

What happens if the entire web server is out or can no longer be reached? In that case, all the monitors we added in the section above will no longer run, or if they are still running, their email alerts may not reach you.

To counter this, we add a “Remote HTTP Service” monitor to a Webmin installation on another web server entirely:

As you can tell from the Status history, this check is performed every 5 minutes.

Set the field “Connection timeout” to 10 seconds. This should also notify you if the loading times for the Moodle website get unacceptable (i.e. more than 10 seconds).

5. Test the Monitoring

Testing should only be done on a completely new system that is not in use yet. The monitors are typically working – they consist of proven, well tested software. So we will not be testing that the monitoring software works, but mainly that we have configured it correctly.

The most critical monitor is the one for the actual Moodle website. We test this by simply turning off the web server. This can be done in Webmin.

Go to Servers > Apache and click the stop button, but only on a new system that is not in use yet:

If you have configured the Remote HTTP Service monitor correctly, you should receive an email very soon.

Restart the Apache web server by clicking on the play button.

You can also stop and start Apache on the command line:

sudo /etc/init.d/apache2 stop
sudo /etc/init.d/apache2 start

If you do not receive any email, make sure that you have used the correct email address, and the correct url (including the port: nowadays almost always 443).

6. Install a New Munin Node on the Web Server

Munin is a logging tool which consists of a server and a node. The node is installed on the system that you want to monitor. The server is where you login to view the historical data. We already have the server in place.

If you login to monitoring.example.internal, you will see an overview of the systems that we are currently monitoring through Munin. Click on a specific system to view the details. Here is an example of the history of the load average:

To install the node on a new web server:

  1. Make sure that the library libparse-http-useragent-perl is installed, e.g.:

      sudo apt-get install libparse-http-useragent-perl
  2. Install munin:

    1. apt-get install munin

    2. apt-get install munin-node

  3. Make sure that the Apache’s server-status module is enabled. (You can do this through Webmin.)

  4. Add the ip address of the Munin server (i.e. the “master”) to /etc/munin/munin-node.conf:

    1. allow ^xxx\.xxx\.xxx\.xxx$

  5. Configure the munin plugins.

Configuring The Munin Plugins

The default plugins for the node (so, on your Munin “client” web server) are in /usr/share/munin/plugins/. They appear in your munin website if they’re symlinked in /etc/munin/plugins. For instance:

In /etc/munin/plugins, add symlinks to the apache plugins:

ln -s /usr/share/munin/plugins/apache_accesses .
ln -s /usr/share/munin/plugins/apache_processes .
ln -s /usr/share/munin/plugins/apache_volume .

You must also configure them in the file /etc/munin/plugin-conf.d/munin-node. In that file, if you want to configure multiple plugins at once, use an asterisk notation. E.g.:

[apache*]

This addresses all apache plugins, which are by default:

apache_accesses

apache_processes

apache_volume

Usually when you look at the source code of the plugins (they’re mostly perl scripts), you will find configuration instructions. For instance, the apache plugins need access to Apache’s server status, so you have to configure Apache (i.e. httpd.conf):

<Location /server-status>
SetHandler server-status

Order deny,allow

Deny from all

Allow from 127.0.0.1

</Location>
ExtendedStatus on

We should also mention here that some plugins seem to exclude each other. For instance, the apache_average_time_last_n_requests plugin (not installed by default) seems to exclude the other (default) apache plugins.

Finally, restart the node:

/etc/init.d/munin-node restart

And open the firewall for port 4949.

Please note: if any of the Munin plugins fail, you will not see any date from that Munin node on the server (monitoring.example.internal)!

Configure The Munin Server

Finally, you also have to tell the Munin server to start polling the newly added node. Add the ip address of the node server to the file /etc/munin/munin.conf:

[ArbitraryServerName] # Apparently, you can’t use spaces in this name

address xxx.xxx.xxx.xxx

use_node_name yes

The Munin server (the ‘master’) will read the new values within 5 minutes (the default polling interval).

Detailed Monitoring

If you run into any trouble with a VPS, you can add more detailed monitoring.

Performance Monitoring

The following is a monitoring script based on an email exchange with the hosting provider, May 19th 2022 about the website outages on their VS10 Linux VPS (search for 198.51.100.10 #HE-DE:2ad1f7b4109530473 in the email history).

date >> /var/log/custom-monitoring.log; top -n 1 -b >> /var/log/custom-monitoring.log; lsof -ni >> /var/log/custom-monitoring.log

This log will contain detailed performance information which you can use to identify which particular application is causing high load, for instance.

Explanation:

  • date: current date and time

  • top: display linux processes;

    • -n 1: Specifies the maximum number of iterations, or frames, top should produce before ending.

    • -b: Starts top in Batch mode, which could be useful for sending output from top to other programs or to a file. In this mode, top will not accept input and runs until the iterations limit you’ve set with the `-n’ command-line option or until killed.

  • lsof: lists on its standard output file information about files opened by processes

    • -i: selects the listing of files any of whose Internet address matches the address specified in i. If no address is specified, this option selects the listing of all Internet and x.25 (HP-UX) network files.

    • -n: selects the listing of files any of whose Internet address matches the address specified in i. If no address is specified, this option selects the listing of all Internet and x.25 (HP-UX) network files.

Log File Rotation

This type of monitoring generates a lot of data, so put it in log file rotation, see Webmin > System > Log File Rotation (the one for /var/log/letsencrypt/*.log is a good example).

Use the default settings, except for:

  • Rotation schedule: Daily

  • Number of old logs to keep: 31, so you will always have at least a month’s worth of data.

  • Compress old log files?: Yes.

Slow Query Monitoring for MySQL

MySQL has a slow query log which records all queries which took longer than 10 seconds (by default) to execute. For Moodle, 10 seconds is not realistic because many queries take longer than that, so 30 seconds is probably better.

To activate slow query logging:

  1. Login using the mysql client: sudo mysql -uroot -p

  2. set global slow_query_log = ‘ON’;

  3. set global slow_query_log_file =’/var/log/mysql/slow-query.log’;

  4. set global long_query_time = 30;

  5. Confirm the changes are active by re-entering the MySQL shell (this reloads the system variables) and running the following command: show variables like ‘%slow%’;

Make sure the slow-query.log is in log rotation (see subsection Log File Rotation).

Incident Response

If you receive an alert from either monitoring system, take the following steps:

  1. Verify the alert

  2. If normal usage was impeded, i.e. there was an actual outage, notify the customer, with an estimated time to fix if possible

  3. Fix the issue

  4. Take steps to prevent this from happening again (and document them in a relevant SOP)

  5. If there was an outage, notify the customer that the issue is now fixed and what you have done, or will do in the very short term, to prevent a recurrence of the incident.

Appendix – Health Monitoring on Servers Without Webmin

Purpose

This section describes how basic server health monitoring is implemented on systems where Webmin is not installed or not permitted.

Instead of relying on a web-based administration interface, monitoring is achieved using:

  • a lightweight Bash script

  • systemd timers

  • standard Unix tooling (mail, logrotate)

This approach minimizes attack surface, avoids additional services, and is fully auditable.

Rationale (Why No Webmin)

Webmin provides convenient monitoring and administration features but:

  • introduces an additional web-facing service

  • increases maintenance and patching requirements

  • is not always allowed under security policies

For these reasons, this server uses a script-based monitoring approach that:

  • requires no open ports

  • has no daemon processes

  • depends only on standard OS components

  • provides clear alerting and diagnostics

Monitoring Scope

The health check verifies the following:

  • Disk usage on the root filesystem (/)

  • System load (1-minute average, normalized per CPU core)

  • Available memory (MemAvailable)

  • Required services:

    • apache2

    • postgresql

  • Local HTTP availability via http://127.0.0.1/

On failure:

  • a diagnostics snapshot is appended to a log file

  • an alert email is sent

On success:

  • a single “OK” line is written to the log

  • no email is sent

Installation

Prerequisites

Ensure mail utilities are installed:

apt update

apt install mailutils

Postfix is already present on this system.

Script Installation

Create the monitoring script:

vim /usr/local/sbin/healthcheck.sh

Insert the full script source provided below.

Set permissions:

chmod 0755 /usr/local/sbin/healthcheck.sh

Create the state directory:

mkdir -p /var/lib/healthcheck

systemd Configuration

Create the service unit:

vim /etc/systemd/system/healthcheck.service

[Unit]

Description=Basic server health check

[Service]

Type=oneshot

ExecStart=/usr/local/sbin/healthcheck.sh

Create the timer unit:

vim /etc/systemd/system/healthcheck.timer

[Unit]

Description=Run healthcheck every 5 minutes

[Timer]

OnBootSec=2min

OnUnitActiveSec=5min

AccuracySec=30s

[Install]

WantedBy=timers.target

Enable and start the timer:

systemctl daemon-reload
systemctl enable --now healthcheck.timer

Verify:

systemctl list-timers | grep healthcheck

Validation

To verify alerting end-to-end, force a failure:

DISK_MAX_PCT=1 /usr/local/sbin/healthcheck.sh

Expected result:

  • exit code 1

  • alert email is sent

  • diagnostics appear in /var/log/healthcheck.log

Logging and Log Rotation

Log File

All output is written to:

/var/log/healthcheck.log

This file contains:

  • one-line OK entries for successful runs

  • full diagnostics snapshots for failures

Log Rotation Configuration

Create logrotate configuration:

vim /etc/logrotate.d/healthcheck
/var/log/healthcheck.log {

weekly

rotate 8

dateext

compress

delaycompress

missingok

notifempty

copytruncate

}

Force a test rotation:

logrotate -vf /etc/logrotate.d/healthcheck

Email Alert Handling

Recipients

Alert emails are sent to multiple recipients using standard Postfix delivery.

Recipients are configured in the script via:

ALERT_EMAIL=”admin@lms.example.com admin@lms.example.com admin@lms.example.com”

To prevent alert emails from being classified as spam or overlooked:

Create a mail filter or rule in the mail client:

Match subject containing:
[ALERT][Totara][ubuntu]

  • Always deliver to inbox (or mark as important)

  • Optionally apply a label such as “Server Monitoring”

This ensures alerts remain visible while avoiding unnecessary inbox noise.

Script Source Code

/usr/local/sbin/healthcheck.sh

#!/usr/bin/env bash

set -euo pipefail

HOSTNAME_SHORT="$(hostname -s)"
HOSTNAME_FQDN="$(hostname -f 2>/dev/null || hostname)"
NOW="$(date -Is)"

# —————————–

# CONFIG (defaults, overridable via environment)

# —————————–

: “${ALERT_EMAIL:=admin@lms.example.com admin@lms.example.com admin@lms.example.com}”

: “${MAIL_FROM:=admin@lms.example.com}”

: “${DISK_MAX_PCT:=95}”

: “${LOAD_PER_CORE_MAX:=1.50}”

: “${MEM_AVAIL_MIN_MB:=512}”

: “${HTTP_URL:=http://127.0.0.1/}”

: “${ALERT_COOLDOWN_SECONDS:=1800}”

: “${STATE_DIR:=/var/lib/healthcheck}”

SERVICES=("apache2" "postgresql")

# —————————–

log_line() {

echo "[$NOW] $*" >> /var/log/healthcheck.log
}

send_alert() {

local subject=”$1″

local body=”$2″

printf “%s\n” “$body” | mail -a “From: ${MAIL_FROM}” -s “$subject” ${ALERT_EMAIL} || true

}

rate_limited() {

local key=”$1″

local stamp=”${STATE_DIR}/${key}.stamp”

local now

now=”$(date +%s)”

mkdir -p "$STATE_DIR"
if [[ -f "$stamp" ]]; then

local last

last=”$(cat “$stamp” || echo 0)”

now – last < ALERT_COOLDOWN_SECONDS && return 0

fi
echo "$now" > "$stamp"
return 1
}

fail() {

local key=”$1″

local msg=”$2″

log_line “FAIL ${HOSTNAME_SHORT}: ${msg}”

{
echo "----- failure snapshot ($NOW) -----"

uptime

echo

df -h

echo

free -m

echo

top -b -n1 | head -n 60

echo

ss -tulpn

echo
systemctl --failed
echo "----------------------------------"

} >> /var/log/healthcheck.log

rate_limited “$key” && exit 1

send_alert “[ALERT][Totara][${HOSTNAME_SHORT}] healthcheck failed: ${key}” \

“Time: $NOW

Host: ${HOSTNAME_FQDN}

Reason:

${msg}

See /var/log/healthcheck.log for diagnostics.”

exit 1

}

touch /var/log/healthcheck.log

disk_pct=”$(df -P / | awk ‘NR==2{gsub(“%”,””,$5); print $5}’)”

[[ “$disk_pct” -lt “$DISK_MAX_PCT” ]] || fail disk “Disk usage ${disk_pct}%”

cores=”$(nproc)”

load_1m=”$(awk ‘{print $1}’ /proc/loadavg)”

awk -v l=”$load_1m” -v c=”$cores” -v t=”$LOAD_PER_CORE_MAX” ‘BEGIN{ exit !l/c)<=t) }’ \

|| fail load “Load ${load_1m} on ${cores} cores”

mem_avail_mb=”$(awk ‘/MemAvailable/ {print int($2/1024)}’ /proc/meminfo)”

[[ “$mem_avail_mb” -ge “$MEM_AVAIL_MIN_MB” ]] \

|| fail memory “MemAvailable ${mem_avail_mb}MB”

for svc in "${SERVICES[@]}"; do
systemctl is-active --quiet "$svc" \

|| fail “service-${svc}” “Service not active: ${svc}”

done
curl -fsS --max-time 10 "$HTTP_URL" >/dev/null \

|| fail http “Local HTTP check failed”

log_line “OK ${HOSTNAME_SHORT}”

exit 0

Solin specializes in Moodle hosting, monitoring, and incident response. Need help? Contact us.

Contact us