Server Monitoring Basics Every Sysadmin Should Know
Thinking about becoming a system administrator or purchasing a dedicated server for your application? Either way you should be familiar with the management of your server - including server monitoring which we will take a closer look at in this article.
To give you a small hint - no, a simple ping test really won’t do. :)
A key to a successful server monitoring is to understand the behavior of a properly functioning server. Every sysadmin must be aware of standard values of each server element in production without malfunctions. Monitoring can only be done right in compliance with a good knowledge of your server, its function and standard behavior.
A malfunctioning system will cause severe downtimes, and as a result, it may cost you and your business more than you could have thought. Downtimes will guaranteedly have a considerable impact on your business and tremendous consequences such as:
- Losing numerous sales opportunities (especially painful for eshops).
- Facing the risk of productivity losses (an ERP system critical for daily operations).
- Your company’s reputation (harmful and irreversible).
As the server usually runs the most important element of your business - be it a web application or a mail server - you need to make sure your server monitoring implementation is done right, including monitoring relevant metrics, setting a reliable alerting system, and analysing root problems and their causes.
The main goals of a system administrator taking care of any sort of system in production, regardless of its function, should be:
- Identifying all useful KPIs.
- Choosing a suitable monitoring tool or creating your own monitoring / early warning system.
- Escalating all incidents according to their priorities.
- Avoiding false alarms and false positives.
- Analysing incidents.
- Incorporating best practices.
Targeted Server Monitoring Metrics and KPIs
Generally speaking, there are four categories to take into consideration when monitoring your server’s health and performance. When tracking the condition of a server, keep the metrics that matter most in focus and, as already mentioned, know what the normal server activity should be like. With no regard to whether your server runs on Windows or Unix based operating system, you should look after the CPU, memory, disk and network.
Although there are some common metrics to keep an eye on, the key performance indicators which are to be tracked fully depend on the server purpose and the application running on your server.
Keep in mind that you should define what is a server activity that does not require taking an immediate action and, on the contrary, what incidents need to be escalated. Put all supposed incidents in context with your application.
For Apache server, be aware if there is still enough capacity available for processing new requests - whether there are 4xx and 5xx errors https://httpstatusdogs.com/ For database server, you might want to avoid long running processes which could lead to a lot of locked queries - for example when a group of processed queries awaits an end of another query which then releases locks on used tables.
All in all, there is not an essential KPI pool with preselected optimal values and thresholds for server monitoring which you could pick your set of metrics from. Although we can’t provide a custom-made metric for each of you, we selected a few metrics that we consider most essential.
Average Time/Operation (read, write) - In general, the shorter the time needed to read and write data the better. Satisfactory values would be all values under 10 ms, everything under 40 ms is acceptable.
Disk Usage - Shows the percentage of disk used. Naturally, you should avoid using your disk up to 100%.
Disk Operations - As there is a limited amount of IOps every disk is able to process, it’s highly recommended to track currently running IOps.
Load Average - Informs you on how many processes are waiting for CPU resources on average for a given period of time. The ideal value equals the number of processors or is slightly below it. The smaller the number of processes in a queue the better. If a quad-core CPU is in use, load 4 means 100% utilization.
CPU Usage - Measures the percentage of all CPU resources being used on requests for running scripts, sending emails, etc. A problem may occur when all CPU cores are overloaded in long run.
Memory Utilization - Set a threshold and track the memory being actively used. Running out of memory is especially dangerous for database servers because you face a data loss. On the contrary, when you serve static files from disk, you probably don't have to worry about running out of memory.
Average Response Time (ART) - Implies to mean duration of every round trip - request/response a packet makes. An average response time of your server can be determined by many factors, including traffic and its utilization or the routing policy of your server hosting provider, to mention some of the network related factors. A well programmed application can still have an unsatisfying response time when the requested data travels around the world to reach its destination. If this is the case, and your target audience is international, you might want to consider using a CDN for your content delivery.
The distance between src/dst should be factored in while evaluating the response time.
Server Monitoring Alerts
To keep a system, and your business, up and running, every sysadmin should be alerted of system issues as soon as they occur. A successful server monitoring implementation leads to a system that is functionally stable and efficient.
Setting up a system with too little alerts can cause you potential troubles, so can a system with too many alerts.
Prioritising is the key. Separate issues that need to be dealt with immediately after they occur from those that can wait a bit. We recommend you to keep it simple and classify each alert as low, medium and high priority. Obviously, the urgency and impact of each incident would depend on the nature of your business.
A detection method represents defining an event for an alert to be triggered. It might be a threshold - when a certain value has been exceeded (e.g. a threshold for CPU usage would be < 90% for 15 mins) or an anomalous rate of change occurs (throughput drops from 4 Gbps to 2 Gbps).
Setting the right sensors and triggering events is crucial in situations when you feel overwhelmed with incidents, otherwise it could easily happen that you overlook or underestimate an important alarm. Ignoring alerts may cause server downtimes which can cost your company lost revenue, penalties for not meeting SLA and hours of worktime. Especially harmful are downtimes for businesses generating their revenue online.
Server Monitoring Software
Prior to choosing any server monitoring tool, make sure you know exactly what server metrics you need to monitor (don’t forget to check your server for unexpected processes though :) ) and how are you going to monitor them.
Monitoring tools can be free or commercial, built on open-source, proprietary and SaaS basis. The decision which server monitoring tool to choose should also be based on how many servers you are going to manage.
SaaS tools are best suited for hosting or dedicated server providers and IaaS businesses offering monitoring tools to their clients. A highly flexible monitoring option for sysadmins who don’t put a priority on support and call for a customised solution could be an open-source monitoring tool. Proprietary tools unlike open source tools lack an active community.
Some of the most popular server monitoring tools are Acronis, LogicMonitor, New Relic, Collectd, Anturis, Datadog, Nagios, Icinga, Sensu, Zabbix, Cacti, Paessler, SolarWinds, ManageEngine.
Don’t get me wrong, this is not a comprehensive list of dos and don’ts while monitoring your server, but recommendations backed by our experience.
What our system administrators emphasize the most is that the notification system must be independent on the server in production, otherwise there is no way how to be notified about a downtime if both systems run on the same infrastructure. Make sure your server monitoring runs on a separate hardware to bypass the potentially affected system.
Every once in awhile, check if the alerting system along with its sensors work properly. Downtimes are unavoidable. Your monitoring needs to be put under scrutiny especially after a migration or maintenance.
Use alerts for monitoring, graphs for analysing issues and an overall trend and log files for whenever there is a need to dig deeper into the raw data, and to find the root of a problem.