Misc

  • Used for alerting

  • Set up a series of alerts based on all of our systems

  • Typically notify when something goes wrong, and also when something resolves

  • Integrations with Slack and Pager Duty

  • Configure settings in Pager Duty so alerts result in texts and phone calls

  • High urgency and low urgency alerts

  • Alerts contain links to run books

  • Try to strike a balance between good coverage and excessive noise

  • Alerts when thresholds are crossed

onboarding session on Alerting / Nagios

  • Nagios asks NRPE daemon - which is installed on host - to run a check and then send the results back to Nagios - advantage being that you only need one port open

    • - so effectively Nagios starts the conversation by ASKING for the data
  • NSCA daemon runs on host - often via cron - scheduled - and sends data back to Nagios

    • - the difference being that with NSCA, it runs the check automatically without being asked

    • Active vs passive checks (which is how they’re defined in Nagios):

      • active = NRPE - active from Nagios

      • passive = NSCA - no action on Nagios’ side

  • Postfix and PDAgent are both running on Nagios hosts

    • Postfix doing SMTP stuff to send emails because of alerts

      • In GCP we can’t do SMTP so can’t send an email from within GCP infrastructure, so we use MailJet
    • PDAgent sending pages to PagerDuty

  • Catchpoint

    • pings acmeweb and if response is too slow or nonexistent then after a few minutes it triggers an LSE (Large Scale Event)

Services / alerts

  • Services folder (nagios/etc/services) used to configure alerts

  • Services are basically alerts

    • !!! Services are named in their service_description field

    • !!! They’re really names not descriptions, even though they contain spaces!

    • Check_interval – how often it’s checked – default unit is minutes

    • Not all fields are mandatory

    • See Nagios object definitions: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html#service

    • Notification_options: under what circs do we send an alert: “w,c,r” = warning, critical, resolved – means we will be alerted for all three states

    • Use generic_service will use a template in the template folder – a lot of the required fields are specified there.

    • Retry can be specified, re whether it rechecks the status – can resolve itself if data changes

    • Check_command: first part is command name, then arguments are separated by !

    • So basically we’re running a script on the Nagios box

    • But sometimes we want to execute scripts on the relevant host – see NRPE below

    • Some of the Nagios config is generated from Chef automatically – things like lists of all roles and hosts

    • So for instance you might see hostgroup_name of SecThing_role – this tells you there is a role called SecThing, and the automation has appended _role to the name

    • check_aggregate will specify another service, the idea being that you aggregate across several hosts and only alert once per group of hosts, rather than alerting for every single host

      • This does mean that you might have notification_period set to never on the services being aggregated, but that doesn’t mean they’re not alerting
      • For instance you might have something which calls check_aggregate, passing an argument called “Disk Space on /root” which refers to another service called “Disk Space on /root”
    • notification_period

      • Can be set to never – see check_aggregate

Pynag / Find all alerts

  • manages the configuration of Nagios

    • v useful for navigating the spaghetti of Nagios config - for instance if you want to find all the services attached to a particular host?
  • type pynag list --examples to see some sample pynag queries.

  • It’s possible to use pynag to query Nagios and get a list of all alerts

  • Spiros put a query together for me - see Acme / scripts / FindAllAlerts or ask spiros

  • A sample pynag query: /usr/bin/pynag livestatus –get services –columns “state host_name description”| grep thingelk | grep ‘Kibana Query Interface’

    • This gets Nagios services (a service is an alert definition) with the stated 3 columns and then greps for thingelk hosts and the ‘Kibana Query Interface’ service
  • Here’s another one: pynag livestatus –get services –columns “state host_name description check_command_expanded” | grep ^2 | grep ‘Disk Space’ | grep thingelk

Disabling notifications in Nagios

  • If you need to make an alert stop shouting at you, you can disable it via the Nagios web ui - just find it and click the disable button

Hosts

  • Each alert will specify a host or a host group

  • Hosts:

    • Hosts can be configured with an IP address as an individual host

    • But sometimes that IP address might be 127.0.0.1, eg the base host gcp-virtual-host, which is a virtual host (nothing to do with VMs)

    • A host might have register = 0, which means it is abstract and can’t be used directly

    • If an alert uses a host derived from one with register=0, this means you will see any alerts associated with that host grouped together in the Nagios UI

  • Host groups

    • Host groups represent groups of hosts

    • This just means that you can group together host definitions into a host group

    • The alert will be run on every host in the group

    • One host might be in many groups

    • $HOSTADRESS$ is a Nagios variable that might be referred to by a command

      • This means the command will access every host in the group, and it will be an actual host with an IP address

      • We are writing the services, so it’s up to us how the service relates to the host, eg by referring to HOSTADDRESS

      • If a service gets run on a host group, that means the command is executed several times, once for every host in the group.

NRPE

  • A plugin that allows you to run scripts remotely

  • Nagios Remote Plugin Execution

  • THE NRPE script runs locally, and that’s the thing that says connect to a partic host and run a script on that host

  • You can tell an NRPE service / alert because it will use the nrpe-service template – but actually it could use that without being NRPE – the thing which really tells you is the command itself, which is something like check_nrpe

NSCA

  • In general NSCA are passive checks that are executed not on nagios but on the monitored hosts

  • The NRPE checks are triggered by Nagios via the NRPE hosts

  • the NSCA are scheduled on the remote hosts and inform the outcome the nagios hosts via the NSCA daemon

    • The NSCA daemon runs on the nagios hosts, with an open port which it uses to receive data

    • We have had problems receiving data, where a netcat command (nc) to that port failed

  • so with NSCA you can have a check command run by a crontab that inform nagios host

  • so Nagios can trigger a passive check with NRPE or can just sit and wait for the results coming from NSCA on the remote hosts

  • that are triggered independently from Nagios (via cronjob)

Commands

  • Commands folder

    • Shared subfolder – commands.cfg contains most of the commands

    • These are bespoke – written by us

    • You’ll see command_line is the actual command being run – often refers to a Ruby script in the libexec folder (use find files to find it)

    • Each command will tell you whether it can result in ok, warn, critical etc – then the service will define whether to alert for each possible status

  • Libexec folder

    • Ruby scripts

    • Commands

  • Some commands

    • check_http -

      • comes with Nagios - details here: https://linux.101hacks.com/unix/check-http/

      • To run it, you have to run it from one of the Nagios hosts (nagios01.c.acme-nagios-prod.internal and nagios02.c.acme-nagios-prod.internal) and add the path to the command: /usr/nagios/libexec/check_http -I thingelk09.ab5.acme.com -H thingelk.acmecorp.com -u / -e “302” -S -N

      • …or if you run it from within /usr/nagios/libexec/ you have to add . like this: ./check_http

Nagios UI

  • Host name from service – you can search on this on the left

    • Be aware this might not be what you think

    • Eg alerts on ELK.ab5 might actually be checking both ab2 and ab5

  • If you click the name of an alert, it takes you through to an individual screen for that alert with service commands on the right

    • For instance, to check the alert straight away, click reschedule to run the check again and see if it’s still in an alert status

SkipDeploy

  • If a particular host is causing problems, you can use nagios.skip_deploy as a node attribute to make deploys skip the host entirely

  • Same process as for nonagios (see below)

Pdagent

  • There’s a piece of software called pdagent that interfaces between Nagios and PagerDuty – this is probably the thing that creates the acknowledgement comments at the bottom of the Nagios alert window

Testing / deploying new alerts:

  • Tool TryNagios allows you to check syntax

  • Just run try-nagios on the command line in your VM

  • Testing:

    • Start out with a low urgency alert to see if it’s working

    • Before you push, you can check the command to see it does what you expect when you change values

    • Basically though, you are testing in prod