En:Nagios

This page is a work in progress.

This article attempts to be a concise, to the point introduction to the guts of Nagios. It is assumed that the reader is familiar with what Nagios is, what it does, and has at least a generic idea of how it works. Basic installation and setup will be covered, not with the goal of attaining a specific working configuration, but more with a look to helping you understand what can be tweaked where in order to do what.

The text below applies to the version of nagios3 in Debian unstable ("sid") as of March 2010. The stable and testing distributions may behave slightly differently.

1 Important concepts

Before we continue, there are some concepts to be introduced. All of these provide some kind of indirection, mostly aimed at saving typing while writing the configuration (which does take very long even so, at least for a system of any complexity).

1.1 Macro

A macro is something most people would probably call a variable. Nagios macros have upper-case names, enclosed in dollar signs; whenever they are referenced, they are replaced with the value associated with that particular macro in that particular context.

1.2 Command

A command is some external binary Nagios can run. Its definition includes its name, the full path of the binary, and optionally, command line arguments to pass the binary. These arguments can reference macros (such as $HOSTADDRESS$ ) that are derived from the context the command is used in as well as macros of the form $ARG1$ . The values for these $ARGx$ macros are passed in when referencing the command like this:

check_command                   check_all_disks!20%!10%

The check_all_disks command is defined as follows:

define command{
	command_name	check_all_disks
	command_line	/usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e
	}

When invoking this check as shown above, $ARG1$ will have a value of 20% while $ARG2$ will expand to 10%. In case you were wondering, these specify the "warning" and "critical" thresholds for the plugin (the -e switch causes it to only report filesystems that are too full). The idea here is that you could easily modify the check_all_disks command definition to call a different binary as the binary itself is only referenced in this one place. This is the advantage of the indirection. The disadvantage is that its use is mandatory: if you need a one-shot command for a specific service, you can't just define it along with the service. You must define the command and then reference it in the service definition.

1.3 Host

A host is an object that has one or more services associated with it. These services are what Nagios monitors (often by attempting to use them). A host is basically a group of services reachable via the IP address of the host. Hosts also appear in various parts of the web interface as clickable objects.

1.4 Service

A service is something for which we can define a command that checks its status (which, for the sake of simplicity, can be "OK", "WARNING" or "CRITICAL"). For the purposes of Nagios, the availability of disk space is also a "service" (as shown above). To further drive home the point that Nagios "services" aren't necessarily network services, consider that you could, for example, define a command that checks the value of some stock market commodity and reports "OK" if it gained value compared to the value it had 24 hours ago, WARNING if its value stayed constant or decreased by at most $ARG1$ percent, and CRITICAL if its price dropped even further. You could then define a "service" for each commodity you want to track, with different thresholds, if you wanted.

I don't mean to suggest that this is a sensible use of Nagios, but it's definitely within the realm of possibility.

1.5 Templates

You can use host and service templates as the basis of many similar host or services. This makes sense because there are many individual settings to configure for each host/service, and some of those have pretty longs names; a template based configuration is thus a lot shorter and more readable (provided you can remember what the default setting in each template was).

You can think of templates in terms of objects as they are used in OOP: a template is a parent object the child object inherits properties from. You can even use multiple inheritance, and a "real" host/service (that isn't just a template) can also be specified as an inheritance source, so pretty complex/weird setups are possible. The details are explained in the official documentation.

1.6 Host groups

A host group is, as the name suggests, a group of hosts that share some aspect of their configuration. For example, the default configuration shipped with the Nagios3 Debian package includes the following hostgroups:

debian-servers (they will all be represented by a Debian Swirl logo in the Web UI);
http-servers (members of this group have a webserver running on them, which will be monitored by Nagios);
ssh-servers (self-explanatory);
ping-servers (members of this group have a "ping service" associated with them, which means Nagios will periodically check whether they are reachable. By default, the default gateway is a member of this hostgroup).

1.7 Contacts and contact groups

A contactgroup is a collection of contacts that are notified when monitored services change state. Contacts have time periods (defined separately and referenced by name) associated with them; they only get notifications during their respective time periods. The idea is to have, e.g., a contact group consisting of, say, webserver operators who work in shifts. Whenever there is trouble with a webserver, Nagios would check which member(s) of the contact group to notify based on what day and time it is.

2 Installing Nagios

First, install the nagios3 package and some monitoring plug-ins:

apt-get install nagios3 nagios-plugins-basic nagios-plugins-standard

This will install the binaries and a very basic configuration that monitors some aspects of "localhost".

3 Configuration overview

This section will read like a telephone directory; feel free to just skim through it. The purpose is to give you a general idea of what can be configured where, as the configuration can seem a little chaotic at first.

Let's take a look at the configuration installed in /etc/nagios3.

commands.cfg: command definitions (unlikely to need modification)
- notify-host-by-email
- notify-service-by-email
- process-host-perfdata
- process-service-perfdata

Any files in the conf.d/ subdirectory whose name ends in ".cfg" will be read and processed by Nagios.

conf.d/contacts_nagios2.cfg: default contacts
- contact "root" (email root@localhost)
- contactgroup "admins" (only member: root)

conf.d/extinfo_nagios2.cfg:
- hostextinfo hostgroup debian-servers (adds fancy icons and such)

conf.d/generic-host_nagios2.cfg:
- generic-host template (enables flap detection, notification etc.)

conf.d/generic-service_nagios2.cfg:
- generic-service template (sets defaults, as above)

conf.d/host-gateway_nagios3.cfg:
- defines the 'gateway' host as a generic-host; its IP probably needs to be set manually.

conf.d/hostgroups_nagios2.cfg:
- hostgroup all (members *)
- hostgroup debian-servers (members localhost)
- hostgroup http-servers (members localhost)
- hostgroup ssh-servers (members localhost)
- hostgroup ping-servers (members gateway)
  - for hosts that don't even have snmp; nagios needs a "service" it can monitor, so for these hosts, we define "ping" as a service.

conf.d/localhost_nagios2.cfg:
- defines the 'localhost' host as a generic-host and some "services" on it:
  - diskspace (check_all_disks);
  - logged in users (check_users);
  - total processes (check_procs);
  - load average (check_load).

conf.d/services_nagios2.cfg: defines the services associated with service-based hostgroups
- check_http for http-servers;
- check_ssh for ssh-servers;
- check_ping for ping-servers.

conf.d/timeperiods_nagios2.cfg: defines various time periods (which can be used to decide which contact to notify):
- 24x7;
- workhours (Monday-Friday, 9:00-17:00);
- nonworkhours (complements workhours);
- never.

nagios.cfg: main config; lists other files and directories to include and contains some global directives.
- log_file=/var/log/nagios3/nagios.log
- cfg_file=/etc/nagios3/commands.cfg (command definitions, see above)
- cfg_dir=/etc/nagios-plugins/config (shipped by the nagios-plugins package)
- cfg_dir=/etc/nagios3/conf.d (this is where we're supposed to create our own configfiles; they all must have a .cfg extension)
- object_cache_file=/var/cache/nagios3/objects.cache (generated based on the startup config; used by the CGIs)
- precached_object_file=/var/lib/nagios3/objects.precache (useful for complex configurations; can speed up restarting nagios)
- resource_file=/etc/nagios3/resource.cfg (resource files can contain macro definitions and are not read by CGIs; so resource files are the place to record passwords and suchlike)
- status_file=/var/cache/nagios3/status.dat (stores status of monitored services and hosts; used by the CGIs)
- status_update_interval=10 (how often to update status.dat, in seconds)
- nagios_user=nagios (what user to run as)
- nagios_group=nagios (what group to run as)
- check_external_commands=0 (whether to enable "external commands" which can be issued from the web interface; disabled by default)
- command_check_interval=-1 (how often to check for "external commands"; -1 means "as often as possible")
- command_file=/var/lib/nagios3/rw/nagios.cmd (the "external command file"; permissions are crucial, so see the documentation)
- external_command_buffer_slots=4096 (a performance tuning setting; leave it alone)
- some other not too important settings, like the location of the pidfile, a temporary file, a temporary directory, the log rotation frequency, the directory where old logs are placed etc.
- event_broker_options= (see documentation)
- broker_module= (you can load event broker modules that process events; more on this later, hopefully)
- use_syslog={0|1} (whether to log message to syslog in addition to the nagios logfile)
- log_notifications={0|1} (whether to log notifications at all)
- log_service_retries, log_host_retries, log_event_handlers, log_initial_states, log_external_commands, log_passive_checks (whether to log the respective events at all)
- global_host_event_handler, global_service_event_handler (you can have some nagios commands executed for every host or service state change)
- service_inter_check_delay_method={n,d,s,x.xx} (how to schedule service checks; the default of "smart" is probably the best choice as it tries to spread out service checks to avoid load peaks)
- max_service_check_spread=30 (how many minutes may elaps from program start until all initial service checks should complete)
- max_host_check_spread=30 (as above, only for hosts instead of services)
- service_interleave_factor=s (configures how Nagios determines how long to wait between two service checks; leave it alone)
- host_inter_check_delay_method=s (as above, only for hosts instead of services)
- max_concurrent_checks=0 (how many service checks may run in parallel. 0 means no limit and is probably a good choice in most situations)
- check_result_reaper_frequency=10 (how often, in seconds, to process the results of checks; leave it alone)
- max_check_result_reaper_time=30 (a performance tuning setting; leave it alone)
- check_result_path=/var/lib/nagios3/spool/checkresults (a spool directory of incoming unprocessed check results; leave it alone)
- max_check_result_file_age=3600 (how old an unprocessed check result file can be to still be considered valid and processed)
- cached_host_check_horizon=15 (a performance tuning setting; leave it alone)
- cached_service_check_horizon=15 (a performance tuning setting; leave it alone)
- enable_predictive_host_dependency_checks=1 (there should be no need to disable this)
- enable_predictive_service_dependency_checks=1 (there should be no need to disable this)
- soft_state_dependencies=0 (whether to consider "soft" states in dependency calculation; enabling may decrease accuracy but cut down on notification floods)
- auto_reschedule_checks=0 (a check scheduling option that may improve or degrade performance)
- auto_rescheduling_interval=30 (a fine tuning option related to auto_reschedule_checks)
- auto_rescheduling_window=180 (a fine tuning option related to auto_reschedule_checks)
- service_check_timeout, host_check_timeout, event_handler_timeout, notification_timeout, ocsp_timeout, perfdata_timeout (various command timeouts; if a subprocess doesn't finish in time, it's killed)
- retain_state_information=1 (whether to save host and service state information on shutdown; it probably makes little sense to disable it. state_retention_file configures where the data is saved.)
- retention_update_interval=60 (how often, in seconds, to write state retention information to disk. If 0, only update it on shutdown.)
- use_retained_program_state=1 (whether to load program status variables, including many configuration options, from the retention file; having it enabled may make nagios ignore some configuration changes, so beware)
- use_retained_scheduling_info=1 (the same for saved scheduling decisions)
- retained_host_attribute_mask, retained_service_attribute_mask, retained_process_host_attribute_mask, retained_process_service_attribute_mask, retained_contact_host_attribute_mask, retained_contact_service_attribute_mask (state retention fine-tuning)
- check_for_updates=1 (whether to periodically check for new versions; bare_update_check sets whether to also send what version you're currently running)
- use_aggressive_host_checking=0 (when set to 0, the default, host checking is supposedly smarter somehow, but potentially less reliable)
- execute_service_checks=1 (whether to perform active service checks; if disabled, Nagios still processes check results that are dropped in its spool from somewhere else)
- accept_passive_service_checks=1 (complements the above)
- execute_host_checks, accept_passive_host_checks (as above, only for hosts instead of services)
- enable_notifications=1 (whether to send notifications at all)
- enable_event_handlers=1 (self-explanatory)
- process_performance_data=0 (whether to run host_perfdata_command and service_perfdata_command. These allow munin-like monitoring of numeric metrics in addition to up-warning-down type states.)
- host_perfdata_file, service_perfdata_file, host_perfdata_file_template, service_perfdata_file_template (where to store performance data and how to name the files themselves)
- host_perfdata_file_mode={a|w|p}, service_perfdata_file_mode={a|w|p} (whether to open perfdata files in append or write mode; p is for named pipes)
- host_perfdata_file_processing_interval, service_perfdata_file_processing_interval, host_perfdata_file_processing_command, service_perfdata_file_processing_command (performance data can be periodically processed. These directive tell Nagios how often to process the data and what commands to run on it.)
- obsess_over_services, ocsp_command, obsess_over_hosts, ochp_command, translate_passive_host_checks, passive_host_checks_are_soft, check_service_freshness, service_freshness_check_interval, check_host_freshness, host_freshness_check_interval, additional_freshness_latency (used for distributed monitoring)
- check_for_orphaned_services=1, check_for_orphaned_hosts=1 (leave enabled)
- enable_flap_detection=1 (whether to detect rapid up/down state changes of a host and service and suppress notifications temporarily when such occur)
- low_service_flap_threshold=5.0, high_service_flap_threshold=20.0, low_host_flap_threshold=5.0, high_host_flap_threshold=20.0 (flap detection fine-tuning)
- date_format=iso8601 (the other formats are... not useful, so leave this alone)
- use_timezone (override system timezone)
- p1_file, enable_embedded_perl, use_embedded_perl_implicitly (options related to embedded Perl interpreter; normally, you can leave these alone)
- illegal_object_name_chars, illegal_macro_output_chars (leave them alone)
- use_regexp_matching=0 (if enabled, regular expression matching is used to match host, hostgroup, service, and service group names/descriptions in some fields of various object types)
- use_true_regexp_matching=0 (if disabled, only use regex matching if a string contains "*" or "?"; otherwise, always use regex matching)
- admin_email=root@localhost, admin_pager=pageroot@localhost (these are made available to notification commands as $ADMINEMAIL$ and $ADMINPAGER$)
- daemon_dumps_core=0 (whether to produce coredumps on crashes; may be useful for debugging)
- use_large_installation_tweaks=0, enable_environment_macros=1, free_child_process_memory, child_processes_fork_twice (performance fine-tuning, mainly for large installations)
- debug_level=0, debug_verbosity=1, debug_file, max_debug_file_size (see configfile comments for details)

resource.cfg: used to define variables (which Nagios calls "macros").
- These can be referenced in command definitions.
- Only 32 are supported and they all must have names of the form $USERx$.

Plugin configuration files reside in /etc/nagios-plugins/config. The following are shipped by default (by the nagios-plugins-basic package):

apt.cfg defines two commands:
- check_apt (checks how many packages could be upgraded; apparently warns if there are any, and reports critical status if any possible upgrades are "critical")
- check_apt_distupgrade (same as above, but for APT's dist-upgrade operation)

dhcp.cfg defines two commands (both of which need root privileges):
- check_dhcp
- check_dhcp_interface

disk.cfg defines the following commands:
- check_disk
- check_all_disks
- ssh_disk
- ssh_disk_4 (to test IPv4 connectivity on IPv6 enabled systems)

dummy.cfg contains some commands that are only useful for testing; they always return a fixed status.
- return-ok
- return-warning
- return-critical
- return-unknown
- return-numeric

ftp.cfg defines the following commands:
- check_ftp
- check_ftp_4 (to test IPv4 connectivity on IPv6 enabled systems)

http.cfg defines many commands:
- check_http (will try to fetch http://ip.of.host/)
- check_httpname (will try to fetch http://name.of.virtual.host/ from ip.of.host)
- check_http2 (permits manual tuning of critical and warning thresholds)
- check_squid
- check_https
- check_https_hostname
- check_https_auth
- check_https_auth_hostname
- check_cups (will try a http request to port 631)
- All of the above also exist with a "_4" suffix which forces the plugin to use IPv4.

load.cfg defines:
- check_load

mail.cfg defines:
- check_pop
- check_smtp
- check_ssmtp
- check_imap
- check_spop (this should actually be called check_pop3s)
- check_simap (this should actually be called check_imaps)
- check_mailq_sendmail
- check_mailq_postfix
- check_mailq_exim
- check_mailq_qmail
- As usual, these also come with IPv4-only variants.

nntp.cfg defines:
- check_nntp
- check_nntp_4

ntp.cfg defines:
- check_ntp
- check_ntp_ntpq
- check_time

ping.cfg defines:
- check_ping
- The following are actually defined identically. The aliases help keep the distinction between hosts, printers, switches and routers; also, they allow you to modify the ping command used to test the reachability of one kind of device without affecting the others.
  - check-host-alive
  - check-printer-alive
  - check-switch-alive
  - check-router-alive
- Again, IPv4-only variants are provided.

procs.cfg defines:
- check_procs
- check_procs_zombie
- check_procs_httpd
  - This is more an example than something actually useful on Debian: it checks for the existence of processes named "httpd".
  - Also, the test isn't very meaningful: the existence of httpd processes doesn't mean that the website they are supposed to serve is available.

real.cfg defines commands to test the availability of RTSP servers:
- check_real_url
- check_real

ssh.cfg defines:
- check_ssh
- check_ssh_port (to check ssh on a nonstandard port)
- check_ssh_4
- check_ssh_port_4

tcp_udp.cfg defines commands to test the availability of arbitrary TCP/UDP ports (without any application layer test):
- check_tcp
- check_udp
- check_tcp_4
- check_udp_4

telnet.cfg defines:
- check_telnet
- check_telnet_4

users.cfg defines:
- check_users (checks whether the number of logged-in users exceeds a threshold)

Installing nagios-plugins-standard yields the following additional plugin configuration files:

breeze.cfg:
- check_breeze (checks the signal strength of a piece of Breezecom wireless equipment)

disk-smb.cfg:
- check_disk_smb (checks the amount of available free space on an SMB share)
- check_disk_smb_workgroup (same as above, but the name of the workgroup can also be specified)
- check_disk_smb_host (also specifies the IP of the server on the command line)
- check_disk_smb_workgroup_host
- check_disk_smb_user (also specifies a username to connect as)
- check_disk_smb_workgroup_user
- check_disk_smb_host_user
- check_disk_smb_workgroup_host_user

dns.cfg:
- check_dns (checks the availability of recursive DNS)
- check_dig (checks the availabiltiy of authoritative DNS)

flexlm.cfg:
- check_flexlm (checks the availability of a flexlm license manager)

fping.cfg:
- check-fast-alive (uses fping to check reachability, which may be faster than regular ping)

games.cfg:
- check_quake
- check_unreal

hppjd.cfg:
- check_hpjd (uses SNMP to check the status of HP printer that has JetDirect)

ifstatus.cfg:
- check_ifstatus (SNMP based network interface status check)
- check_ifstatus_exclude (as above, but allows exclusion of specified interface types, such as PPP)
- check_ifoperstatus_ifindex
- check_ifoperstatus_ifdescr

ldap.cfg:
- check_ldap
- check_ldaps
- check_ldap_4
- check_ldaps_4

mrtg.cfg:
- check_mrtg
- traffic_average

mysql.cfg:
- check_mysql
- check_mysql_cmdlinecred
- check_mysql_database

netware.cfg:
- check_netware_logins
- check_nwstat_conns
- check_netware_1load
- check_netware_5load
- check_netware_15load
- check_nwstat_vol_p
- check_nwstat_vol_k
- check_nwstat_ltch
- check_nwstat_puprb
- check_nwstat_dsdb
- check_netware_abend
- check_nwstat_csprocs

nt.cfg (these commands depend on an "NSClient" service running on a Windows box and allow you to monitor the Windows box):
- check_nt
- check_nscp

pgsql.cfg:
- check_pgsql
- check_pgsql_4

radius.cfg:
- check_radius

rpc-nfs.cfg:
- check-rpc
- check-nfs

snmp.cfg
- snmp_load
- snmp_cpustats
- snmp_procname
- snmp_disk
- snmp_mem
- snmp_swap
- snmp_procs
- snmp_users
- snmp_mem2
- snmp_swap2
- snmp_mem3
- snmp_swap3
- snmp_disk2
- snmp_tcpopen
- snmp_tcpstats
- check_snmp_bgpstate
- check_netapp_uptime
- check_netapp_cpuload
- check_netapp_numdisks
- check_compaq_thermalCondition

4 First steps

4.1 Contact address

The very first thing you should do is make sure Nagios has a way of sending you notifications. There are several ways to do that: the default configuration already includes a contact named "root" and a contactgroup called "admins", so you could modify those; or you could define a new contact and a new contact group (in a new configuration file); or some combination of these.

I suggest that you modify the existing contacts_nagios2.cfg file for the sake of simplicity. Enter your own working email address instead of root@localhost, or make sure root@localhost can receive locally generated mail and that it's forwarded to you.