Nagios
Contents of this page:
- Misc
- onboarding session on Alerting / Nagios
- Services / alerts
- Pynag / Find all alerts
- Disabling notifications in Nagios
- Hosts
- NRPE
- NSCA
- Commands
- Nagios UI
- SkipDeploy
- Pdagent
- Testing / deploying new alerts:
Misc
-
Used for alerting
-
Set up a series of alerts based on all of our systems
-
Typically notify when something goes wrong, and also when something resolves
-
Integrations with Slack and Pager Duty
-
Configure settings in Pager Duty so alerts result in texts and phone calls
-
High urgency and low urgency alerts
-
Alerts contain links to run books
-
Try to strike a balance between good coverage and excessive noise
-
Alerts when thresholds are crossed
onboarding session on Alerting / Nagios
-
Nagios asks NRPE daemon - which is installed on host - to run a check and then send the results back to Nagios - advantage being that you only need one port open
- - so effectively Nagios starts the conversation by ASKING for the data
-
NSCA daemon runs on host - often via cron - scheduled - and sends data back to Nagios
-
- the difference being that with NSCA, it runs the check automatically without being asked
-
Active vs passive checks (which is how they’re defined in Nagios):
-
active = NRPE - active from Nagios
-
passive = NSCA - no action on Nagios’ side
-
-
-
Postfix and PDAgent are both running on Nagios hosts
-
Postfix doing SMTP stuff to send emails because of alerts
- In GCP we can’t do SMTP so can’t send an email from within GCP infrastructure, so we use MailJet
-
PDAgent sending pages to PagerDuty
-
-
Catchpoint
- pings acmeweb and if response is too slow or nonexistent then after a few minutes it triggers an LSE (Large Scale Event)
Services / alerts
-
Services folder (nagios/etc/services) used to configure alerts
-
Services are basically alerts
-
!!! Services are named in their service_description field
-
!!! They’re really names not descriptions, even though they contain spaces!
-
Check_interval – how often it’s checked – default unit is minutes
-
Not all fields are mandatory
-
See Nagios object definitions: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html#service
-
Notification_options: under what circs do we send an alert: “w,c,r” = warning, critical, resolved – means we will be alerted for all three states
-
Use generic_service will use a template in the template folder – a lot of the required fields are specified there.
-
Retry can be specified, re whether it rechecks the status – can resolve itself if data changes
-
Check_command: first part is command name, then arguments are separated by !
-
So basically we’re running a script on the Nagios box
-
But sometimes we want to execute scripts on the relevant host – see NRPE below
-
Some of the Nagios config is generated from Chef automatically – things like lists of all roles and hosts
-
So for instance you might see hostgroup_name of SecThing_role – this tells you there is a role called SecThing, and the automation has appended _role to the name
-
check_aggregate will specify another service, the idea being that you aggregate across several hosts and only alert once per group of hosts, rather than alerting for every single host
- This does mean that you might have notification_period set to never on the services being aggregated, but that doesn’t mean they’re not alerting
- For instance you might have something which calls check_aggregate, passing an argument called “Disk Space on /root” which refers to another service called “Disk Space on /root”
-
notification_period
- Can be set to never – see check_aggregate
-
Pynag / Find all alerts
-
manages the configuration of Nagios
- v useful for navigating the spaghetti of Nagios config - for instance if you want to find all the services attached to a particular host?
-
type
pynag list --examples
to see some samplepynag
queries. -
It’s possible to use pynag to query Nagios and get a list of all alerts
-
Spiros put a query together for me - see Acme / scripts / FindAllAlerts or ask spiros
-
A sample pynag query: /usr/bin/pynag livestatus –get services –columns “state host_name description”| grep thingelk | grep ‘Kibana Query Interface’
- This gets Nagios services (a service is an alert definition) with the stated 3 columns and then greps for thingelk hosts and the ‘Kibana Query Interface’ service
- Here’s another one: pynag livestatus –get services –columns “state host_name description check_command_expanded” | grep ^2 | grep ‘Disk Space’ | grep thingelk
Disabling notifications in Nagios
- If you need to make an alert stop shouting at you, you can disable it via the Nagios web ui - just find it and click the disable button
Hosts
-
Each alert will specify a host or a host group
-
Hosts:
-
Hosts can be configured with an IP address as an individual host
-
But sometimes that IP address might be 127.0.0.1, eg the base host gcp-virtual-host, which is a virtual host (nothing to do with VMs)
-
A host might have register = 0, which means it is abstract and can’t be used directly
-
If an alert uses a host derived from one with register=0, this means you will see any alerts associated with that host grouped together in the Nagios UI
-
-
Host groups
-
Host groups represent groups of hosts
-
This just means that you can group together host definitions into a host group
-
The alert will be run on every host in the group
-
One host might be in many groups
-
$HOSTADRESS$ is a Nagios variable that might be referred to by a command
-
This means the command will access every host in the group, and it will be an actual host with an IP address
-
We are writing the services, so it’s up to us how the service relates to the host, eg by referring to HOSTADDRESS
-
If a service gets run on a host group, that means the command is executed several times, once for every host in the group.
-
-
NRPE
-
A plugin that allows you to run scripts remotely
-
Nagios Remote Plugin Execution
-
THE NRPE script runs locally, and that’s the thing that says connect to a partic host and run a script on that host
-
You can tell an NRPE service / alert because it will use the nrpe-service template – but actually it could use that without being NRPE – the thing which really tells you is the command itself, which is something like check_nrpe
NSCA
-
In general NSCA are passive checks that are executed not on nagios but on the monitored hosts
-
The NRPE checks are triggered by Nagios via the NRPE hosts
-
the NSCA are scheduled on the remote hosts and inform the outcome the nagios hosts via the NSCA daemon
-
The NSCA daemon runs on the nagios hosts, with an open port which it uses to receive data
-
We have had problems receiving data, where a netcat command (nc) to that port failed
-
-
so with NSCA you can have a check command run by a crontab that inform nagios host
-
so Nagios can trigger a passive check with NRPE or can just sit and wait for the results coming from NSCA on the remote hosts
-
that are triggered independently from Nagios (via cronjob)
Commands
-
Commands folder
-
Shared subfolder – commands.cfg contains most of the commands
-
These are bespoke – written by us
-
You’ll see command_line is the actual command being run – often refers to a Ruby script in the libexec folder (use find files to find it)
-
Each command will tell you whether it can result in ok, warn, critical etc – then the service will define whether to alert for each possible status
-
-
Libexec folder
-
Ruby scripts
-
Commands
-
-
Some commands
-
check_http -
-
comes with Nagios - details here: https://linux.101hacks.com/unix/check-http/
-
To run it, you have to run it from one of the Nagios hosts (nagios01.c.acme-nagios-prod.internal and nagios02.c.acme-nagios-prod.internal) and add the path to the command: /usr/nagios/libexec/check_http -I thingelk09.ab5.acme.com -H thingelk.acmecorp.com -u / -e “302” -S -N
-
…or if you run it from within /usr/nagios/libexec/ you have to add . like this: ./check_http
-
-
Nagios UI
-
Host name from service – you can search on this on the left
-
Be aware this might not be what you think
-
Eg alerts on ELK.ab5 might actually be checking both ab2 and ab5
-
-
If you click the name of an alert, it takes you through to an individual screen for that alert with service commands on the right
- For instance, to check the alert straight away, click reschedule to run the check again and see if it’s still in an alert status
SkipDeploy
-
If a particular host is causing problems, you can use nagios.skip_deploy as a node attribute to make deploys skip the host entirely
-
Same process as for nonagios (see below)
Pdagent
- There’s a piece of software called pdagent that interfaces between Nagios and PagerDuty – this is probably the thing that creates the acknowledgement comments at the bottom of the Nagios alert window
Testing / deploying new alerts:
-
Tool TryNagios allows you to check syntax
-
Just run try-nagios on the command line in your VM
-
Testing:
-
Start out with a low urgency alert to see if it’s working
-
Before you push, you can check the command to see it does what you expect when you change values
-
Basically though, you are testing in prod
-