En:Nagios

A Unix/Linux szerverek üzemeltetése wikiből
A lap korábbi változatát látod, amilyen KornAndras (vitalap | szerkesztései) 2010. március 24., 18:09-kor történt szerkesztése után volt.

This page is a work in progress.

This article attempts to be a concise, to the point introduction to the guts of Nagios. It is assumed that the reader is familiar with what Nagios is, what it does, and has at least a generic idea of how it works. Basic installation and setup will be covered, not with the goal of attaining a specific working configuration, but more with a look to helping you understand what can be tweaked where in order to do what.

The text below applies to the version of nagios3 in Debian unstable ("sid") as of March 2010. The stable and testing distributions may behave slightly differently.

Tartalomjegyzék

1 Important concepts

Before we continue, there are some concepts to be introduced. All of these provide some kind of indirection, mostly aimed at saving typing while writing the configuration (which does take very long even so, at least for a system of any complexity).

1.1 Macro

A macro is something most people would probably call a variable. Nagios macros have upper-case names, enclosed in dollar signs; whenever they are referenced, they are replaced with the value associated with that particular macro in that particular context.

1.2 Command

A command is some external binary Nagios can run. Its definition includes its name, the full path of the binary, and optionally, command line arguments to pass the binary. These arguments can reference macros (such as $HOSTADDRESS$) that are derived from the context the command is used in as well as macros of the form $ARG1$. The values for these $ARGx$ macros are passed in when referencing the command like this:

check_command                   check_all_disks!20%!10%

The check_all_disks command is defined as follows:

define command{
	command_name	check_all_disks
	command_line	/usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e
	}

When invoking this check as shown above, $ARG1$ will have a value of 20% while $ARG2$ will expand to 10%. In case you were wondering, these specify the "warning" and "critical" thresholds for the plugin (the -e switch causes it to only report filesystems that are too full). The idea here is that you could easily modify the check_all_disks command definition to call a different binary as the binary itself is only referenced in this one place. This is the advantage of the indirection. The disadvantage is that its use is mandatory: if you need a one-shot command for a specific service, you can't just define it along with the service. You must define the command and then reference it in the service definition.

1.3 Hosts

A host is an object that has one or more services associated with it. These services are what Nagios monitors (often by attempting to use them). A host is basically a group of services reachable via the IP address of the host. Hosts also appear in various parts of the web interface as clickable objects. Hosts can depend on each other; if a "closer" host is unreachable, services on hosts "behind" it won't even be tested.

Also see the official documentation.

1.4 Services

A service is something for which we can define a command that checks its status (which, for the sake of simplicity, can be "OK", "WARNING" or "CRITICAL"). For the purposes of Nagios, the availability of disk space is also a "service" (as shown above). To further drive home the point that Nagios "services" aren't necessarily network services, consider that you could, for example, define a command that checks the value of some stock market commodity and reports "OK" if it gained value compared to the value it had 24 hours ago, WARNING if its value stayed constant or decreased by at most $ARG1$ percent, and CRITICAL if its price dropped even further. You could then define a "service" for each commodity you want to track, with different thresholds, if you wanted.

I don't mean to suggest that this is a sensible use of Nagios, but it's definitely within the realm of possibility.

Every service must be associated with a host in order to be checked. You can think of "services" as being abstract concepts; only their specific, existing instances (which are each bound to an IP address, represented by a host) can actually be checked for availability.

It is also possible for services to depend on each other; Nagios allows you to define service dependencies. We won't be going into that here.

1.5 Templates

You can use host and service templates as the basis of many similar host or services. This makes sense because there are many individual settings to configure for each host/service, and some of those have pretty longs names; a template based configuration is thus a lot shorter and more readable (provided you can remember what the default setting in each template was).

You can think of templates in terms of objects as they are used in OOP: a template is a parent object the child object inherits properties from. You can even use multiple inheritance, and a "real" host/service (that isn't just a template) can also be specified as an inheritance source, so pretty complex/weird setups are possible. The details are explained in the official documentation.

1.6 Host groups

A host group is, as the name suggests, a group of hosts that share some aspect of their configuration. For example, the default configuration shipped with the Nagios3 Debian package includes the following hostgroups:

  • debian-servers (they will all be represented by a Debian Swirl logo in the Web UI);
  • http-servers (members of this group have a webserver running on them, which will be monitored by Nagios);
  • ssh-servers (self-explanatory);
  • ping-servers (members of this group have a "ping service" associated with them, which means Nagios will periodically check whether they are reachable. By default, the default gateway is a member of this hostgroup).

1.7 Service groups

Service groups are similar to host groups. Using service groups can simplify configuration and help unclutter the web interface by visually grouping related services together.

1.8 Contacts and contact groups

A contactgroup is a collection of contacts that are notified when monitored services change state. Contacts have time periods (defined separately and referenced by name) associated with them; they only get notifications during their respective time periods. The idea is to have, e.g., a contact group consisting of, say, webserver operators who work in shifts. Whenever there is trouble with a webserver, Nagios would check which member(s) of the contact group to notify based on what day and time it is. (Note: time periods can also be used to restrict when the availability of a service can/should be checked.)

2 How it fits together

All of these concepts might be difficult to make sense of and understand intuitively, so here is some help. This link was provided for a reason; I strongly suggest you go and read the documentation section it points to.

Normally, you'd have many hosts that all run the same service; therefore, you would:

  • define each host separately; then either
    • define a hostgroup for each service;
    • add every host that runs the service to this hostgroup (in the hostgroup definition);
    • and assign the service to the hostgroup in the definition of the service;
  • or:
    • define the service, and
    • in its definition, assign it to a list of hosts.

This way you don't have to define the service separately for each host that runs it.

3 Installing Nagios

First, install the nagios3 package and some monitoring plug-ins:

apt-get install nagios3 nagios-plugins-basic nagios-plugins-standard

This will install the binaries and a very basic configuration that monitors some aspects of "localhost".

During installation, the nagios3-cgi package will ask you what webserver to set up for Nagios. If you select your webserver (which should of course be installed already), you'll get a chance to specify a password for the nagiosadmin user.

4 Configuration overview

This section will read like a telephone directory; feel free to just skim through it. The purpose is to give you a general idea of what can be configured where, as the configuration can seem a little chaotic at first.

Let's take a look at the configuration installed in /etc/nagios3.

  • commands.cfg: command definitions (unlikely to need modification)
    • notify-host-by-email
    • notify-service-by-email
    • process-host-perfdata
    • process-service-perfdata

Any files in the conf.d/ subdirectory whose name ends in ".cfg" will be read and processed by Nagios. Note that if you modify configuration files Nagios was shipped with, you may be asked on upgrades whether to overwrite your modified configuration file with the newer configuration from the newer package. This may or may not be what you want, so keep it in mind. For most purposes, just creating a new configuration file is probably the best approach.

  • conf.d/contacts_nagios2.cfg: default contacts
    • contact "root" (email root@localhost)
    • contactgroup "admins" (only member: root)
  • conf.d/extinfo_nagios2.cfg:
    • hostextinfo hostgroup debian-servers (adds fancy icons and such)
  • conf.d/generic-host_nagios2.cfg:
    • generic-host template (enables flap detection, notification etc.)
  • conf.d/generic-service_nagios2.cfg:
    • generic-service template (sets defaults, as above)
  • conf.d/host-gateway_nagios3.cfg:
    • defines the 'gateway' host as a generic-host; its IP probably needs to be set manually.
  • conf.d/hostgroups_nagios2.cfg:
    • hostgroup all (members *)
    • hostgroup debian-servers (members localhost)
    • hostgroup http-servers (members localhost)
    • hostgroup ssh-servers (members localhost)
    • hostgroup ping-servers (members gateway)
      • for hosts that don't even have snmp; nagios needs a "service" it can monitor, so for these hosts, we define "ping" as a service.
  • conf.d/localhost_nagios2.cfg:
    • defines the 'localhost' host as a generic-host and some "services" on it:
      • diskspace (check_all_disks);
      • logged in users (check_users);
      • total processes (check_procs);
      • load average (check_load).
  • conf.d/services_nagios2.cfg: defines the services associated with service-based hostgroups
    • check_http for http-servers;
    • check_ssh for ssh-servers;
    • check_ping for ping-servers.
  • conf.d/timeperiods_nagios2.cfg: defines various time periods (which can be used to decide which contact to notify):
    • 24x7;
    • workhours (Monday-Friday, 9:00-17:00);
    • nonworkhours (complements workhours);
    • never.
  • resource.cfg: used to define variables (which Nagios calls "macros").
    • These can be referenced in command definitions.
    • Only 32 are supported and they all must have names of the form $USERx$.

Plugin configuration files reside in /etc/nagios-plugins/config. The following are shipped by default (by the nagios-plugins-basic package):

  • apt.cfg defines two commands:
    • check_apt (checks how many packages could be upgraded; apparently warns if there are any, and reports critical status if any possible upgrades are "critical")
    • check_apt_distupgrade (same as above, but for APT's dist-upgrade operation)
  • dhcp.cfg defines two commands (both of which need root privileges):
    • check_dhcp
    • check_dhcp_interface
  • disk.cfg defines the following commands:
    • check_disk
    • check_all_disks
    • ssh_disk
    • ssh_disk_4 (to test IPv4 connectivity on IPv6 enabled systems)
  • dummy.cfg contains some commands that are only useful for testing; they always return a fixed status.
    • return-ok
    • return-warning
    • return-critical
    • return-unknown
    • return-numeric
  • ftp.cfg defines the following commands:
    • check_ftp
    • check_ftp_4 (to test IPv4 connectivity on IPv6 enabled systems)
  • http.cfg defines many commands:
    • check_http (will try to fetch http://ip.of.host/)
    • check_httpname (will try to fetch http://name.of.virtual.host/ from ip.of.host)
    • check_http2 (permits manual tuning of critical and warning thresholds)
    • check_squid
    • check_https
    • check_https_hostname
    • check_https_auth
    • check_https_auth_hostname
    • check_cups (will try a http request to port 631)
    • All of the above also exist with a "_4" suffix which forces the plugin to use IPv4.
  • load.cfg defines:
    • check_load
  • mail.cfg defines:
    • check_pop
    • check_smtp
    • check_ssmtp
    • check_imap
    • check_spop (this should actually be called check_pop3s)
    • check_simap (this should actually be called check_imaps)
    • check_mailq_sendmail
    • check_mailq_postfix
    • check_mailq_exim
    • check_mailq_qmail
    • As usual, these also come with IPv4-only variants.
  • nntp.cfg defines:
    • check_nntp
    • check_nntp_4
  • ntp.cfg defines:
    • check_ntp
    • check_ntp_ntpq
    • check_time
  • ping.cfg defines:
    • check_ping
    • The following are actually defined identically. The aliases help keep the distinction between hosts, printers, switches and routers; also, they allow you to modify the ping command used to test the reachability of one kind of device without affecting the others.
      • check-host-alive
      • check-printer-alive
      • check-switch-alive
      • check-router-alive
    • Again, IPv4-only variants are provided.
  • procs.cfg defines:
    • check_procs
    • check_procs_zombie
    • check_procs_httpd
      • This is more an example than something actually useful on Debian: it checks for the existence of processes named "httpd".
      • Also, the test isn't very meaningful: the existence of httpd processes doesn't mean that the website they are supposed to serve is available.
  • real.cfg defines commands to test the availability of RTSP servers:
    • check_real_url
    • check_real
  • ssh.cfg defines:
    • check_ssh
    • check_ssh_port (to check ssh on a nonstandard port)
    • check_ssh_4
    • check_ssh_port_4
  • tcp_udp.cfg defines commands to test the availability of arbitrary TCP/UDP ports (without any application layer test):
    • check_tcp
    • check_udp
    • check_tcp_4
    • check_udp_4
  • telnet.cfg defines:
    • check_telnet
    • check_telnet_4
  • users.cfg defines:
    • check_users (checks whether the number of logged-in users exceeds a threshold)

Installing nagios-plugins-standard yields the following additional plugin configuration files:

  • breeze.cfg:
    • check_breeze (checks the signal strength of a piece of Breezecom wireless equipment)
  • disk-smb.cfg:
    • check_disk_smb (checks the amount of available free space on an SMB share)
    • check_disk_smb_workgroup (same as above, but the name of the workgroup can also be specified)
    • check_disk_smb_host (also specifies the IP of the server on the command line)
    • check_disk_smb_workgroup_host
    • check_disk_smb_user (also specifies a username to connect as)
    • check_disk_smb_workgroup_user
    • check_disk_smb_host_user
    • check_disk_smb_workgroup_host_user
  • dns.cfg:
    • check_dns (checks the availability of recursive DNS)
    • check_dig (checks the availabiltiy of authoritative DNS)
  • flexlm.cfg:
    • check_flexlm (checks the availability of a flexlm license manager)
  • fping.cfg:
    • check-fast-alive (uses fping to check reachability, which may be faster than regular ping)
  • games.cfg:
    • check_quake
    • check_unreal
  • hppjd.cfg:
    • check_hpjd (uses SNMP to check the status of HP printer that has JetDirect)
  • ifstatus.cfg:
    • check_ifstatus (SNMP based network interface status check)
    • check_ifstatus_exclude (as above, but allows exclusion of specified interface types, such as PPP)
    • check_ifoperstatus_ifindex
    • check_ifoperstatus_ifdescr
  • ldap.cfg:
    • check_ldap
    • check_ldaps
    • check_ldap_4
    • check_ldaps_4
  • mrtg.cfg:
    • check_mrtg
    • traffic_average
  • mysql.cfg:
    • check_mysql
    • check_mysql_cmdlinecred
    • check_mysql_database
  • netware.cfg:
    • check_netware_logins
    • check_nwstat_conns
    • check_netware_1load
    • check_netware_5load
    • check_netware_15load
    • check_nwstat_vol_p
    • check_nwstat_vol_k
    • check_nwstat_ltch
    • check_nwstat_puprb
    • check_nwstat_dsdb
    • check_netware_abend
    • check_nwstat_csprocs
  • nt.cfg (these commands depend on an "NSClient" service running on a Windows box and allow you to monitor the Windows box):
    • check_nt
    • check_nscp
  • pgsql.cfg:
    • check_pgsql
    • check_pgsql_4
  • radius.cfg:
    • check_radius
  • rpc-nfs.cfg:
    • check-rpc
    • check-nfs
  • snmp.cfg
    • snmp_load
    • snmp_cpustats
    • snmp_procname
    • snmp_disk
    • snmp_mem
    • snmp_swap
    • snmp_procs
    • snmp_users
    • snmp_mem2
    • snmp_swap2
    • snmp_mem3
    • snmp_swap3
    • snmp_disk2
    • snmp_tcpopen
    • snmp_tcpstats
    • check_snmp_bgpstate
    • check_netapp_uptime
    • check_netapp_cpuload
    • check_netapp_numdisks
    • check_compaq_thermalCondition

5 First steps

5.1 Contact address

The very first thing you should do is make sure Nagios has a way of sending you notifications. There are several ways to do that: the default configuration already includes a contact named "root" and a contactgroup called "admins", so you could modify those; or you could define a new contact and a new contact group (in a new configuration file); or some combination of these.

I suggest that you modify the existing contacts_nagios2.cfg file for the sake of simplicity. Enter your own working email address instead of root@localhost, or make sure root@localhost can receive locally generated mail and that it's forwarded to you.

Restart nagios3 afterwards using

/etc/init.d/nagios3 restart

(A restart is probably not necessary, but it should certainly be sufficient.)

5.2 Setting a password for nagiosadmin

If you didn't specify a password for the nagiosadmin user during package installation, you can do so manually using something like the following command:

htpasswd -cs /etc/nagios3/htpasswd.users nagiosadmin

5.3 Trying the web interface

If you point your browser at http://your.nagios.host/nagios3, you should be able to log in to the web UI using the "nagiosadmin" username and whatever password you specified. Try it and take a few minutes to familiarise yourself with the interface.

5.4 Adding new hosts and services

If there are at most a few dozen hosts to check, and especially if these are heterogenous, I would suggest creating a separate configuration file for every host. A basic configuration might read as follows:

define host{
	use		generic-host	; Name of host template to use; settings like contacts will be inherited from there
	host_name	name-of-host
	alias		name-of-host
	address		1.2.3.4		; IP of host; FQDN also works but should only be used if DNS is realiable
	}
Személyes eszközök