Super useful real-life sensu checks/alerts for your application -- With Bonus
19 Dec 2014One of the best things you can do for your application is set up alerts with thresholds for various events/metrics.
Gogobot uses sensu it’s primary alert/checks system.
Here’s a diagram of how sensu works (From the sensu website)
I won’t go too much into details here, there are too many moving parts and you can read about it on the sensu documentation but the essence of it is that a server orchestrates checks on clients (servers, nodes), which then distribute the status back to the server which handlers notifications and more…
NOTE: I encourage you to browse the documentation, getting a sense of what a check
is, what a handler
is and some basic sensu lingo, from here on I assume very-basic knowledge of how sensu is wired up.
The reason for this post
I decided to set up sensu at Gogobot following a talk I heared by @zohararad.
After his talk, we also had a short Skype call about how he’s using it, what is he checking, how and why.
Often, when you set up a new system, the unclear part is how people are using it, this post is to show how we use it and to set a pretty good basics as to how you should use it too.
Finding which failures you want to check and be alerted on
Over time, we added some very useful checks to the system.
The way we added checks was pretty straightforward, we looked at all production failures we had over the last 3-6 months prior to sensu and investigated the causes.
We found a few key failures
- Server ran out of disk space
- Cron jobs did not run on specific time
- Servers not getting the latest deployments
Every single person I talked to over the months that have passed since we added sensu told me that at least one of these happened to them at least once and made life miserable.
Obviously, these aren’t all of our checks but those definitely compose the base layer of each server basic checks.
First, I want to mention here, that sensu has a vvery vibrant community around plugins, and there’s a great repository to get you started on sensu/sensu-community-plugins
I am also slowly open sourcing our custom checks: gogobot/custom-sensu-plugins, you can feel free to use these as well
Here are the checks explained
Check and alert if server ran out of disk space
{
"checks": {
"check_disk_usage": {
"command": "check-disk.rb -c 95 -w 90",
"handlers": [
"slack"
],
"subscribers": [
"all"
],
"interval": 30,
"notification": "Disk Check failed",
"occurrences": 5
}
}
}
This uses the check-disk.rb
check from the community plugins, the critical threshold is 95% and the warning threshold is 90%.
The handlers for this is slack
which will send us an alert in the slack chatroom (Handlers are also from the community plugins repo)
This check is being done every 30 seconds and will only alert if it happened 5 times in a row.
NOTE: You will likely want a slightly slower threshold, ever 2-5 minutes is perfectly fine.
Cron jobs did not run on specific time
Obviously, we have a check that the cron
process is running on the machine, but you can find 50 examples for this on the web using the check-procs.rb
open source check.
We have dozens of tasks running in a schedule, those tasks run Ruby
code, obviously that code can fail like any other code, this means that even those cron
is running our infrastructure is still not in good health.
this is how we solved this?
Our cron tasks run a shell script, that shell script CD’s into the project directory and runs a rake (for example).
Each of these cron tasks has a file we identify the task with, we echo the date into that file after the rake task is done. If the file is too old, this means the task did not run and there’s a problem with the code it’s running.
Here’s how this looks
The shell script
#!/bin/bash
set -e
cd PROJECT_DIR
bundle exec rake sometask:sometask
echo `date` > scoring-monitor-cron
The check
{
"checks": {
"check_website_monitor": {
"command": "check-mtime.rb -f PROJECT_ROOT/scoring-monitor-cron -c 1500",
"handlers": [
"slack"
],
"subscribers": [
“scoring-system”
],
"interval": 900,
"notification": “Scoring check did not run”,
"occurrences": 2
}
}
}
As you can clearly see here, we check the mtime
of that file, if the file did not change in the threshold given, we want an alert.
One more thing to note here is that the subscribers
list is different, it’s no longer all
servers but a subset of server running that cron.
We have this check for every cron task that is running on the servers, making sure it all works and services are getting invoked when needed.
Servers not getting the latest deployments
This is likely the most annoying bug you will encounter.
Due to load on the server, we found that sometimes when you deploy to it, unicorn will somehow fail to switch the processes over to the new version.
I struggled a bit with finding how to check for this, essentially at the end the solution (as always) was pretty simple.
We have an API endpoint that responds with the git revision that’s deployed right now on the site.
This API is pretty useful and we can also ask our chat robot (based on hubot) to tell us which version of git is running on any environment.
So… How do we check this:
On every server www.gogobot.com
is directed to internal ip’s, so when you call www.gogobot.com/some-url
from one of the server you never leave the internal network.
When you hit the Amazon Load Balancer URL, it’s redirecting you to the source (www.gogobot.com
), so that’s also not useful.
Comes in check-urls-content-match.rb
from our sensu plugins repo (mentioned above)
{
"checks": {
"check_website_monitor": {
"command": "check-url-content-match.rb -b www.gogobot.com -h AMAZON_LOAD_BALANCER -s 0 -p /api/get_git_sha”,
"handlers": [
"slack"
],
"subscribers": [
“fe”,
“be”
],
"interval": 60,
}
}
}
What this does…
This checks a couple of URL’s, one with the hostname and one with the load balancer, passing the hostname as a header (to bypass the internal network).
If the content of those 2 URL’s does not match, this means the deployed version on this machine is not correct/old.
Bonus
I’ll give an example here, if you are an engineer working in a team, you will relate to this for sure.
You get an alert via email: disk-check
failed on backend-aws-west-2
.
Here’s my thought process seeing this:
WHAT?
- Where is that server located?
- Why would the disk get filled up?
- Did an internal system fail that supposed to rotate logs?
- What can I delete in order to fix this issue?
- Do I have permission to ssh into this server or do I need to escalate this up?
One of our alerts is from a website-monitor system we have built, it goes to the site and checks if pages are broken and also checks if CSS is not broken or returning 404.
Imaging a backend engineer receives this alert, no one else is awake… what should he do?
Adding more info on alerts
The great thing about the sensu JSON parsing is that if you add more data on it, this will be available on the alert itself
For checks that are not obvious to fix, or you will likely need more data we added a wiki page link like this:
{
"checks": {
“check_deployment”: {
"wiki": "http://wiki.gogobot.com/failures/deployment-failure",
"command": "check-url-content-match.rb -b www.gogobot.com -h AMAZON_LOAD_BALANCER -s 0 -p /api/get_git_sha”,
"handlers": [
"slack"
],
"subscribers": [
“fe”,
“be”
],
"interval": 60,
}
}
}
This is the important part here “wiki”: “http://wiki.gogobot.com/failures/deployment-failure”
When you see an alert, you simply click on this link and you can read more into how you can fix this, why did this break most likely and more.
Here’s what this looks like in the dashboard
Summing up
I really got into less than a handful of checks we have on our system and how we handle the alerts.
Check out the community plugins, there are checks for everything you can think of, from Graphite, Logstash, URL, CPU… everything.