En:Nagios

A Unix/Linux szerverek üzemeltetése wikiből
A lap korábbi változatát látod, amilyen KornAndras (vitalap | szerkesztései) 2010. március 23., 15:24-kor történt szerkesztése után volt.

This page is a work in progress.

This article attempts to be a concise, to the point introduction to the guts of Nagios. It is assumed that the reader is familiar with what Nagios is, what it does, and has at least a generic idea of how it works. Basic installation and setup will be covered, not with the goal of attaining a specific working configuration, but more with a look to helping you understand what can be tweaked where in order to do what.

The text below applies to the version of nagios3 in Debian unstable ("sid") as of March 2010. The stable and testing distributions may behave slightly differently.

Tartalomjegyzék

1 Important concepts

Before we continue, there are some concepts to be introduced. All of these provide some kind of indirection, mostly aimed at saving typing while writing the configuration (which does take very long even so, at least for a system of any complexity).

1.1 Macro

A macro is something most people would probably call a variable. Nagios macros have upper-case names, enclosed in dollar signs; whenever they are referenced, they are replaced with the value associated with that particular macro in that particular context.

1.2 Command

A command is some external binary Nagios can run. Its definition includes its name, the full path of the binary, and optionally, command line arguments to pass the binary. These arguments can reference macros (such as $HOSTADDRESS$) that are derived from the context the command is used in as well as macros of the form $ARG1$. The values for these $ARGx$ macros are passed in when referencing the command like this:

check_command                   check_all_disks!20%!10%

The check_all_disks command is defined as follows:

define command{
	command_name	check_all_disks
	command_line	/usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e
	}

When invoking this check as shown above, $ARG1$ will have a value of 20% while $ARG2$ will expand to 10%. In case you were wondering, these specify the "warning" and "critical" thresholds for the plugin (the -e switch causes it to only report filesystems that are too full). The idea here is that you could easily modify the check_all_disks command definition to call a different binary as the binary itself is only referenced in this one place. This is the advantage of the indirection. The disadvantage is that its use is mandatory: if you need a one-shot command for a specific service, you can't just define it along with the service. You must define the command and then reference it in the service definition.

1.3 Host

A host is an object that has one or more services associated with it. These services are what Nagios monitors (often by attempting to use them). A host is basically a group of services reachable via the IP address of the host. Hosts also appear in various parts of the web interface as clickable objects.

1.4 Service

A service is something for which we can define a command that checks its status (which, for the sake of simplicity, can be "OK", "WARNING" or "CRITICAL"). For the purposes of Nagios, the availability of disk space is also a "service" (as shown above). To further drive home the point that Nagios "services" aren't necessarily network services, consider that you could, for example, define a command that checks the value of some stock market commodity and reports "OK" if it gained value compared to the value it had 24 hours ago, WARNING if its value stayed constant or decreased by at most $ARG1$ percent, and CRITICAL if its price dropped even further. You could then define a "service" for each commodity you want to track, with different thresholds, if you wanted.

I don't mean to suggest that this is a sensible use of Nagios, but it's definitely within the realm of possibility.

1.5 Templates

You can use host and service templates as the basis of many similar host or services. This makes sense because there are many individual settings to configure for each host/service, and some of those have pretty longs names; a template based configuration is thus a lot shorter and more readable (provided you can remember what the default setting in each template was).

You can think of templates in terms of objects as they are used in OOP: a template is a parent object the child object inherits properties from. You can even use multiple inheritance, and a "real" host/service (that isn't just a template) can also be specified as an inheritance source, so pretty complex/weird setups are possible. The details are explained in the official documentation.

1.6 Host groups

A host group is, as the name suggests, a group of hosts that share some aspect of their configuration. For example, the default configuration shipped with the Nagios3 Debian package includes the following hostgroups:

  • debian-servers (they will all be represented by a Debian Swirl logo in the Web UI);
  • http-servers (members of this group have a webserver running on them, which will be monitored by Nagios);
  • ssh-servers (self-explanatory);
  • ping-servers (members of this group have a "ping service" associated with them, which means Nagios will periodically check whether they are reachable. By default, the default gateway is a member of this hostgroup).

1.7 Contacts and contact groups

A contactgroup is a collection of contacts that are notified when monitored services change state. Contacts have time periods (defined separately and referenced by name) associated with them; they only get notifications during their respective time periods. The idea is to have, e.g., a contact group consisting of, say, webserver operators who work in shifts. Whenever there is trouble with a webserver, Nagios would check which member(s) of the contact group to notify based on what day and time it is.

2 Installing Nagios

First, install the nagios3 package and some monitoring plug-ins:

apt-get install nagios3 nagios-plugins-basic nagios-plugins-standard

This will install the binaries and a very basic configuration that monitors some aspects of "localhost".

3 Configuration overview

This section will read like a telephone directory; feel free to just skim through it. The purpose is to give you a general idea of what can be configured where, as the configuration can seem a little chaotic at first.

Let's take a look at the configuration installed in /etc/nagios3.

  • commands.cfg: command definitions (unlikely to need modification)
    • notify-host-by-email
    • notify-service-by-email
    • process-host-perfdata
    • process-service-perfdata

Any files in the conf.d/ subdirectory whose name ends in ".cfg" will be read and processed by Nagios.

  • conf.d/contacts_nagios2.cfg: default contacts
    • contact "root" (email root@localhost)
    • contactgroup "admins" (only member: root)
  • conf.d/extinfo_nagios2.cfg:
    • hostextinfo hostgroup debian-servers (adds fancy icons and such)
  • conf.d/generic-host_nagios2.cfg:
    • generic-host template (enables flap detection, notification etc.)
  • conf.d/generic-service_nagios2.cfg:
    • generic-service template (sets defaults, as above)
  • conf.d/host-gateway_nagios3.cfg:
    • defines the 'gateway' host as a generic-host; its IP probably needs to be set manually.
  • conf.d/hostgroups_nagios2.cfg:
    • hostgroup all (members *)
    • hostgroup debian-servers (members localhost)
    • hostgroup http-servers (members localhost)
    • hostgroup ssh-servers (members localhost)
    • hostgroup ping-servers (members gateway)
      • for hosts that don't even have snmp; nagios needs a "service" it can monitor, so for these hosts, we define "ping" as a service.
  • conf.d/localhost_nagios2.cfg:
    • defines the 'localhost' host as a generic-host and some "services" on it:
      • diskspace (check_all_disks);
      • logged in users (check_users);
      • total processes (check_procs);
      • load average (check_load).
  • conf.d/services_nagios2.cfg: defines the services associated with service-based hostgroups
    • check_http for http-servers;
    • check_ssh for ssh-servers;
    • check_ping for ping-servers.
  • conf.d/timeperiods_nagios2.cfg: defines various time periods (which can be used to decide which contact to notify):
    • 24x7;
    • workhours (Monday-Friday, 9:00-17:00);
    • nonworkhours (complements workhours);
    • never.
  • resource.cfg: used to define variables (which Nagios calls "macros").
    • These can be referenced in command definitions.
    • Only 32 are supported and they all must have names of the form $USERx$.

Plugin configuration files reside in /etc/nagios-plugins/config. The following are shipped by default (by the nagios-plugins-basic package):

  • apt.cfg defines two commands:
    • check_apt (checks how many packages could be upgraded; apparently warns if there are any, and reports critical status if any possible upgrades are "critical")
    • check_apt_distupgrade (same as above, but for APT's dist-upgrade operation)
  • dhcp.cfg defines two commands (both of which need root privileges):
    • check_dhcp
    • check_dhcp_interface
  • disk.cfg defines the following commands:
    • check_disk
    • check_all_disks
    • ssh_disk
    • ssh_disk_4 (to test IPv4 connectivity on IPv6 enabled systems)
  • dummy.cfg contains some commands that are only useful for testing; they always return a fixed status.
    • return-ok
    • return-warning
    • return-critical
    • return-unknown
    • return-numeric
  • ftp.cfg defines the following commands:
    • check_ftp
    • check_ftp_4 (to test IPv4 connectivity on IPv6 enabled systems)
  • http.cfg defines many commands:
    • check_http (will try to fetch http://ip.of.host/)
    • check_httpname (will try to fetch http://name.of.virtual.host/ from ip.of.host)
    • check_http2 (permits manual tuning of critical and warning thresholds)
    • check_squid
    • check_https
    • check_https_hostname
    • check_https_auth
    • check_https_auth_hostname
    • check_cups (will try a http request to port 631)
    • All of the above also exist with a "_4" suffix which forces the plugin to use IPv4.
  • load.cfg defines:
    • check_load
  • mail.cfg defines:
    • check_pop
    • check_smtp
    • check_ssmtp
    • check_imap
    • check_spop (this should actually be called check_pop3s)
    • check_simap (this should actually be called check_imaps)
    • check_mailq_sendmail
    • check_mailq_postfix
    • check_mailq_exim
    • check_mailq_qmail
    • As usual, these also come with IPv4-only variants.
  • nntp.cfg defines:
    • check_nntp
    • check_nntp_4
  • ntp.cfg defines:
    • check_ntp
    • check_ntp_ntpq
    • check_time
  • ping.cfg defines:
    • check_ping
    • The following are actually defined identically. The aliases help keep the distinction between hosts, printers, switches and routers; also, they allow you to modify the ping command used to test the reachability of one kind of device without affecting the others.
      • check-host-alive
      • check-printer-alive
      • check-switch-alive
      • check-router-alive
    • Again, IPv4-only variants are provided.
  • procs.cfg defines:
    • check_procs
    • check_procs_zombie
    • check_procs_httpd
      • This is more an example than something actually useful on Debian: it checks for the existence of processes named "httpd".
      • Also, the test isn't very meaningful: the existence of httpd processes doesn't mean that the website they are supposed to serve is available.
  • real.cfg defines commands to test the availability of RTSP servers:
    • check_real_url
    • check_real
  • ssh.cfg defines:
    • check_ssh
    • check_ssh_port (to check ssh on a nonstandard port)
    • check_ssh_4
    • check_ssh_port_4
  • tcp_udp.cfg defines commands to test the availability of arbitrary TCP/UDP ports (without any application layer test):
    • check_tcp
    • check_udp
    • check_tcp_4
    • check_udp_4
  • telnet.cfg defines:
    • check_telnet
    • check_telnet_4
  • users.cfg defines:
    • check_users (checks whether the number of logged-in users exceeds a threshold)

Installing nagios-plugins-standard yields the following additional plugin configuration files:

  • breeze.cfg:
    • check_breeze (checks the signal strength of a piece of Breezecom wireless equipment)
  • disk-smb.cfg:
    • check_disk_smb (checks the amount of available free space on an SMB share)
    • check_disk_smb_workgroup (same as above, but the name of the workgroup can also be specified)
    • check_disk_smb_host (also specifies the IP of the server on the command line)
    • check_disk_smb_workgroup_host
    • check_disk_smb_user (also specifies a username to connect as)
    • check_disk_smb_workgroup_user
    • check_disk_smb_host_user
    • check_disk_smb_workgroup_host_user
  • dns.cfg:
    • check_dns (checks the availability of recursive DNS)
    • check_dig (checks the availabiltiy of authoritative DNS)
  • flexlm.cfg:
    • check_flexlm (checks the availability of a flexlm license manager)
  • fping.cfg:
    • check-fast-alive (uses fping to check reachability, which may be faster than regular ping)
  • games.cfg:
    • check_quake
    • check_unreal
  • hppjd.cfg:
    • check_hpjd (uses SNMP to check the status of HP printer that has JetDirect)
  • ifstatus.cfg:
    • check_ifstatus (SNMP based network interface status check)
    • check_ifstatus_exclude (as above, but allows exclusion of specified interface types, such as PPP)
    • check_ifoperstatus_ifindex
    • check_ifoperstatus_ifdescr
  • ldap.cfg:
    • check_ldap
    • check_ldaps
    • check_ldap_4
    • check_ldaps_4
  • mrtg.cfg:
    • check_mrtg
    • traffic_average
  • mysql.cfg:
    • check_mysql
    • check_mysql_cmdlinecred
    • check_mysql_database
  • netware.cfg:
    • check_netware_logins
    • check_nwstat_conns
    • check_netware_1load
    • check_netware_5load
    • check_netware_15load
    • check_nwstat_vol_p
    • check_nwstat_vol_k
    • check_nwstat_ltch
    • check_nwstat_puprb
    • check_nwstat_dsdb
    • check_netware_abend
    • check_nwstat_csprocs
  • nt.cfg (these commands depend on an "NSClient" service running on a Windows box and allow you to monitor the Windows box):
    • check_nt
    • check_nscp
  • pgsql.cfg:
    • check_pgsql
    • check_pgsql_4
  • radius.cfg:
    • check_radius
  • rpc-nfs.cfg:
    • check-rpc
    • check-nfs
  • snmp.cfg
    • snmp_load
    • snmp_cpustats
    • snmp_procname
    • snmp_disk
    • snmp_mem
    • snmp_swap
    • snmp_procs
    • snmp_users
    • snmp_mem2
    • snmp_swap2
    • snmp_mem3
    • snmp_swap3
    • snmp_disk2
    • snmp_tcpopen
    • snmp_tcpstats
    • check_snmp_bgpstate
    • check_netapp_uptime
    • check_netapp_cpuload
    • check_netapp_numdisks
    • check_compaq_thermalCondition

4 First steps

4.1 Contact address

The very first thing you should do is make sure Nagios has a way of sending you notifications. There are several ways to do that: the default configuration already includes a contact named "root" and a contactgroup called "admins", so you could modify those; or you could define a new contact and a new contact group (in a new configuration file); or some combination of these.

I suggest that you modify the existing contacts_nagios2.cfg file for the sake of simplicity. Enter your own working email address instead of root@localhost, or make sure root@localhost can receive locally generated mail and that it's forwarded to you.

Személyes eszközök