Contents of this page:
- Misc
- onboarding session on Alerting / Nagios
- Services / alerts
- Pynag / Find all alerts
- Disabling notifications in Nagios
- Hosts
- Commands
- Nagios UI
- SkipDeploy
- Pdagent
- Testing / deploying new alerts:
Used for alerting
Set up a series of alerts based on all of our systems
Typically notify when something goes wrong, and also when something resolves
Integrations with Slack and Pager Duty
Configure settings in Pager Duty so alerts result in texts and phone calls
High urgency and low urgency alerts
Alerts contain links to run books
Try to strike a balance between good coverage and excessive noise
Alerts when thresholds are crossed
onboarding session on Alerting / Nagios
Nagios asks NRPE daemon - which is installed on host - to run a check and then send the results back to Nagios - advantage being that you only need one port open
- - so effectively Nagios starts the conversation by ASKING for the data
NSCA daemon runs on host - often via cron - scheduled - and sends data back to Nagios
- the difference being that with NSCA, it runs the check automatically without being asked
Active vs passive checks (which is how they’re defined in Nagios):
active = NRPE - active from Nagios
passive = NSCA - no action on Nagios’ side
Postfix and PDAgent are both running on Nagios hosts
Postfix doing SMTP stuff to send emails because of alerts
- In GCP we can’t do SMTP so can’t send an email from within GCP infrastructure, so we use MailJet
PDAgent sending pages to PagerDuty
- pings acmeweb and if response is too slow or nonexistent then after a few minutes it triggers an LSE (Large Scale Event)
Services / alerts
Services folder (nagios/etc/services) used to configure alerts
Services are basically alerts
!!! Services are named in their service_description field
!!! They’re really names not descriptions, even though they contain spaces!
Check_interval – how often it’s checked – default unit is minutes
Not all fields are mandatory
See Nagios object definitions:
Notification_options: under what circs do we send an alert: “w,c,r” = warning, critical, resolved – means we will be alerted for all three states
Use generic_service will use a template in the template folder – a lot of the required fields are specified there.
Retry can be specified, re whether it rechecks the status – can resolve itself if data changes
Check_command: first part is command name, then arguments are separated by !
So basically we’re running a script on the Nagios box
But sometimes we want to execute scripts on the relevant host – see NRPE below
Some of the Nagios config is generated from Chef automatically – things like lists of all roles and hosts
So for instance you might see hostgroup_name of SecThing_role – this tells you there is a role called SecThing, and the automation has appended _role to the name
check_aggregate will specify another service, the idea being that you aggregate across several hosts and only alert once per group of hosts, rather than alerting for every single host
- This does mean that you might have notification_period set to never on the services being aggregated, but that doesn’t mean they’re not alerting
- For instance you might have something which calls check_aggregate, passing an argument called “Disk Space on /root” which refers to another service called “Disk Space on /root”
- Can be set to never – see check_aggregate
Pynag / Find all alerts
manages the configuration of Nagios
- v useful for navigating the spaghetti of Nagios config - for instance if you want to find all the services attached to a particular host?
pynag list --examples
to see some samplepynag
queries. -
It’s possible to use pynag to query Nagios and get a list of all alerts
Spiros put a query together for me - see Acme / scripts / FindAllAlerts or ask spiros
A sample pynag query: /usr/bin/pynag livestatus –get services –columns “state host_name description”| grep thingelk | grep ‘Kibana Query Interface’
- This gets Nagios services (a service is an alert definition) with the stated 3 columns and then greps for thingelk hosts and the ‘Kibana Query Interface’ service
- Here’s another one: pynag livestatus –get services –columns “state host_name description check_command_expanded” | grep ^2 | grep ‘Disk Space’ | grep thingelk
Disabling notifications in Nagios
- If you need to make an alert stop shouting at you, you can disable it via the Nagios web ui - just find it and click the disable button
Each alert will specify a host or a host group
Hosts can be configured with an IP address as an individual host
But sometimes that IP address might be, eg the base host gcp-virtual-host, which is a virtual host (nothing to do with VMs)
A host might have register = 0, which means it is abstract and can’t be used directly
If an alert uses a host derived from one with register=0, this means you will see any alerts associated with that host grouped together in the Nagios UI
Host groups
Host groups represent groups of hosts
This just means that you can group together host definitions into a host group
The alert will be run on every host in the group
One host might be in many groups
$HOSTADRESS$ is a Nagios variable that might be referred to by a command
This means the command will access every host in the group, and it will be an actual host with an IP address
We are writing the services, so it’s up to us how the service relates to the host, eg by referring to HOSTADDRESS
If a service gets run on a host group, that means the command is executed several times, once for every host in the group.
A plugin that allows you to run scripts remotely
Nagios Remote Plugin Execution
THE NRPE script runs locally, and that’s the thing that says connect to a partic host and run a script on that host
You can tell an NRPE service / alert because it will use the nrpe-service template – but actually it could use that without being NRPE – the thing which really tells you is the command itself, which is something like check_nrpe
In general NSCA are passive checks that are executed not on nagios but on the monitored hosts
The NRPE checks are triggered by Nagios via the NRPE hosts
the NSCA are scheduled on the remote hosts and inform the outcome the nagios hosts via the NSCA daemon
The NSCA daemon runs on the nagios hosts, with an open port which it uses to receive data
We have had problems receiving data, where a netcat command (nc) to that port failed
so with NSCA you can have a check command run by a crontab that inform nagios host
so Nagios can trigger a passive check with NRPE or can just sit and wait for the results coming from NSCA on the remote hosts
that are triggered independently from Nagios (via cronjob)
Commands folder
Shared subfolder – commands.cfg contains most of the commands
These are bespoke – written by us
You’ll see command_line is the actual command being run – often refers to a Ruby script in the libexec folder (use find files to find it)
Each command will tell you whether it can result in ok, warn, critical etc – then the service will define whether to alert for each possible status
Libexec folder
Ruby scripts
Some commands
check_http -
comes with Nagios - details here:
To run it, you have to run it from one of the Nagios hosts (nagios01.c.acme-nagios-prod.internal and nagios02.c.acme-nagios-prod.internal) and add the path to the command: /usr/nagios/libexec/check_http -I -H -u / -e “302” -S -N
…or if you run it from within /usr/nagios/libexec/ you have to add . like this: ./check_http
Nagios UI
Host name from service – you can search on this on the left
Be aware this might not be what you think
Eg alerts on ELK.ab5 might actually be checking both ab2 and ab5
If you click the name of an alert, it takes you through to an individual screen for that alert with service commands on the right
- For instance, to check the alert straight away, click reschedule to run the check again and see if it’s still in an alert status
If a particular host is causing problems, you can use nagios.skip_deploy as a node attribute to make deploys skip the host entirely
Same process as for nonagios (see below)
- There’s a piece of software called pdagent that interfaces between Nagios and PagerDuty – this is probably the thing that creates the acknowledgement comments at the bottom of the Nagios alert window
Testing / deploying new alerts:
Tool TryNagios allows you to check syntax
Just run try-nagios on the command line in your VM
Start out with a low urgency alert to see if it’s working
Before you push, you can check the command to see it does what you expect when you change values
Basically though, you are testing in prod