08 Jun 2015
Improving the user experience is a relentless battle, you constantly have to keep pushing it in order to give your users the best experience possible.
The user story
As a user I want to see the most popular things around me when I am in a specific location.
When you travel, and you are in a certain location, you want to see things around you.
We started this out a while back with just sorting by distance and filtering out less popular places.
We realized, this is not enough. When you travel you are not really thinking in terms of absolute distances, you think in terms of rings or clusters.
You basically say to yourself, I am willing to drive 3-5 minutes, what is the best restaurant I can find in that range?
If you can drive for another minute and find a better one, would you do it?
If you just go around the block and walk 5 more minutes and walk into the best ice cream place you’ve ever been too, will you walk that?
You certainly will!
Here’s an illustration of it. (Numbers represent the relevance percentage for you).
Distance Rings Illustration
When most users travel, they are willing to invest that extra time in order to find the most popular places, but it has to be in an acceptable range.
Things to consider
- Everything should be done in SOLR
- It should incorporate seamlessly with our current ranking for places.
Implementation
I wanted this to be a built in function that SOLR will use natively.
This came down to either one of two ways
- Get Everything from SOLR and do it in memory (which will violate one of the rules we considered coming into this).
- Extend SOLR with this function and configure all of the search cluster to use it.
Research
I encountered this post User Based Personalization Engine with Solr that extends SOLR in order to sort results by a personal scoring system (using an API).
After going through and reading the post (and the posts it’s linking to):
I had a very good plan in mind.
The SOLR distance function is defined using the godist
internal function, you pass that function the field the location is indexed on and also lat, lng. like so: geodist(location_ll,48.8567,2.3508) asc
.
At this point I pretty much decided that I am going to use the same convention but going to sort by distanceScore
like so: distanceScore(location_ll,48.8567,2.3508) desc
.
The Implementation
Instead of sorting by the distance, I am going to assign a score for each distance range, so all the items in that range will have the same score, resulting in the secondary sort (our internal scoring system) being the tie breaker.
Here’s a basic illustration of it.
Distance Rings Illustration
So, lets look at some results for example.
{
"docs": [
{
"distance": 0.20637469859399163,
"distance_score": 1000,
"name": "Brenda's French Soul Food"
},
{
"distance": 0.08686129174919746,
"distance_score": 1000,
"name": "Chambers Eat + Drink"
},
{
"distance": 0.1812205524350946,
"distance_score": 1000,
"name": "Lers Ros Thai"
},
{
"distance": 0.06320259257621294,
"distance_score": 1000,
"name": "Saigon Sandwiches"
},
{
"distance": 0.11542846457258274,
"distance_score": 1000,
"name": "Turtle Tower Restaurant"
},
{
"distance": 0.2105972668549029,
"distance_score": 1000,
"name": "Olive"
},
{
"distance": 0.21655948230840996,
"distance_score": 1000,
"name": "Philz Coffee - Golden Gate"
},
{
"distance": 0.13191153597807037,
"distance_score": 1000,
"name": "Pagolac"
},
{
"distance": 0.2152692626334937,
"distance_score": 1000,
"name": "Sai Jai Thai Restaurant"
},
{
"distance": 0.21263741323062255,
"distance_score": 1000,
"name": "Zen Yai Thai"
}
]
}
As you can see from this result set, the items in the same distance claster are being scored the same distance_score
and sorted by our internal scoring system.
This gives the user a great sort, sorting by “Popular places around me”, not necesarily just by distance.
Code
After digging through SOLR source code quite a bit, I have found the 2 classes that do the distance calculation and return the result to the user.
I grabbed those 2 classes into a new Java project, then, instead of just returning the distance I checked the distance for the rings/cluster assignment and returned this number to the result set.
After doing that, you need to compile your JAR and make a bit of configuration changes in SOLR.
Adding the lib folder as another source for classes
<lib dir="./lib" />
In solrconfig.xml
, this line is usually commented out, uncommenting it means that now {core-name}/lib
is a directory in which SOLR will search for custom JARs.
Adding a function that you can call
<valueSourceParser name="distanceScore"
class="com.gogobot.DistanceParser" />
This will add a new valueSourceParser
that we can call using the distanceScore
which will be parsed by com.gogobot.DistanceParser
.
Summing up
-
When you get a feature like this as an engineer, you need to leave your comfort zone. Even if your thought is to do in in memory later, you need to look at the bigger picture, think of the entire feature and implement the feature in the most appropriate language.
-
I had real fun digging through the source code for SOLR, it is very different from digging through rails source code or any other big Ruby project out there.
-
Finding documentation around adding a valueSourceParser
wasn’t trivial, seems it should be easier.
-
I don’t know whether this would be easier with ElasticSearch, but seems that the documentation is better and the community is more vibrant.
Open source
The code is open source here: gogobot/solr-distance-cluster.
26 May 2015
I am obsessed about automation, with that, comes the notion of asking simple
questions and getting simple answers you can create actions with.
One of the questions I constantly ask myself is “What hasn’t been deployed yet?”.
We deploy multiple times a day, we try not to have more than 2-3 commits merged
without being deployed, this way, it’s easy to track on production whether a
commit is causing performance issues or something like that.
Another reason for that is to “give a heads up” to engineers with big features
going into production to check and make sure everything is working as expected.
How do we do it?
We use our own version of Hubot in Slack, we use it for all kinds of things (I blogged about it in the past as well).
So, we just ask gbot show me ENV diff
.
Slack command that shows the diff link on Github
It’s pretty simple really, all you need is an HTTP endpoint on your server that will return the git sha and than you just construct a Github link from it.
I originally thought about doing it using Github API but the return text to Slack was too long sometimes, this way it’s just a click away and Github does a good job with the compare view IMHO.
Here’s the coffee-script (Redacted version with only a single ENV)
module.exports = (robot) ->
robot.respond /show me prod diff/i, (msg) ->
msg.http("http://#{YOUR_DOMAIN}/git_sha?rand=#{Math.random()}")
.get() (err, res, body) ->
commit_link = "YOUR_REPO_LOCATION/compare/#{body}...master"
message = "This is the diff from [#{original_env}] is: #{commit_link} "
msg.send(message)
What are we using it for?
1. Do we need to clear cache for parts of the website?
If we launched a redesign for example, added A/B testing, or any other thing
that might be affected by parts of the website being cached, we need to make
sure we clear the cache after the deployment
2. Do we need to migrate the DB?
We don’t use Rails migrations on production, the database is just to big and
the risk of it bringing down the site is too great, so we have a different
strategy for migrations. If there’s migration in the code that’s not in
production yet, this is the last chance to make sure we don’t deploy before it
happens.
3. Do we need to push bulk actions?
We often have features that count on other bulk actions being done. For
example, we recently added a feature that will bump places with photos above
others in your search results, this needs a bulk action to re-index our search
cluster completely, this means, we have to do it before we deploy this to
production.
4. Do we need configuration changes?
We manage all of our servers with chef, so if you are counting on some
configuration file with a secret or some 3rd party API key being there, we need
to make sure that configuration is converged through all of the cluster.
Commit message dependent on configuration change
Lets take this piece of code for example, this code is counting on a distanceScore
function in SOLR. This is something I wrote in order to get distance rings score in SOLR (post coming soon), if this is not deployed to SOLR before this code is in production it will break.
5. Number of commits
Like I mentioned in the beginning of this post, if we have more than 2-3
commits currently unmerged we consider this to be a higher risk deployment. So
this is really to catch whether we have more than the usual amount of things.
If that happens we usually just go ahead and deploy before merging anything
else into master.
Showing the number of commits since the last deployed version
Last resort
From this post, you might think we just push to master and check after the
fact. That is not true, we are code reviewing anything that goes into master
and catch that before, but just in case something happens through, we want to
make sure we catch it before we release it.
Counting on git flow
We kinda adopted git flow at Gogobot a few years back, we each work on
feature branches and master is always deployable.
This is why we can adopt this approach, we just each work in a black box (That
black box can be deployed to dev/staging servers) while production is being
deployed many times in the process.
Make your own
Even if you are not working on a highly dynamic web application, you can still
make your own like: “Which client has this version”, “Which server has which
release” and more.
If you find yourself needing a couple of clicks that will break your current
flow, automate it.
Since we do everything in the chat room, this is a perfect workflow for us
If you have any comments or questions, leave it under the post here, I’d love
to hear/read it.
07 May 2015
Recently, I had to count specific phrases across all reviews on Gogobot.
This was only a small part of a bigger data science project that actually tags places using the review text that user wrote, which is really what you want to know about the place. Knowing that people like you say this place had great breakfast will usually means it’s good for you as well and we will bump this place up in your lists.
There’s a lot more you can do with such analysis, your imagination is the limit.
Get Pigging
The internet is basically swamped with “word count” pig applications that give you a text file (or some “big” text files) and counting words in them, but I had to do something different, I needed to count specific phrases we cared about.
I chose Apache Pig over Amazon Elastic Map Reduce, this way, I can throw as many machines as I want and let it process.
If we simplify the object it looks something like this:
class Review
def place_id
1234
end
def description
"This is my review"
end
end
Obviously, this data usually is stored in the database, so the first thing we need to do is export it and clean out any irregularities that might break pig processing (line breaks for example).
mysql -B -u USERNAME_HERE -pPASSWORD_HERE gogobot -h HOST_HERE -e “YOUR SELECT STATEMENT HERE;” | sed “s/‘/\’/;s/\t/\”,\”/g;s/^/\”/;s/$/\”/;s/\n//g” > recommendations.csv
tr -c ‘ a-zA-Z0-9,\n’ ‘ ‘ < recommendations.csv > clean_recs.csv
split -l 2000 clean_recs.csv
Lets break this script up a bit and understand the different parts of it.
First the mysql -B
part is exporting the table into a CSV.
The second tr
statement is cleaning out anything that is not readable chars, making sure what we get back is a clean CSV with the parts we want.
Third, in order to make sure we use parallelism in the file load, we split the main file into smaller chunks. Used 2000 here as an example, usually I break it to around 10,000 lines.
After we finish, the file looks like this:
4000000132125, A great place to hang out and watch New Orleans go by Good coffee and cakes
4000000132125, Beignets, duh
4000000132125, the right place for a classic beignet
Basically, it’s recommendation_id, recommendation_text
, perfect!
Getting to it
In version 0.10.0
of pig, they added a way for you to write User Defined Functions (UDF) in almost any language you desire.
I chose jRuby for this, this way, most engineers at Gogobot will find it understandable and readable.
here’s the UDF:
require 'pigudf'
require 'java' # Magic line for JRuby - Java interworking
class JRubyUdf < PigUdf
outputSchema "word:chararray"
def allowed_words
["Breakfast", "Steak Dinner", "Vegan Dish"]
end
def tokenize(words)
clean = allowed_words.map { |x| x if words.scan(/\b#{x}\b/i).length > 0 }.compact.join(";")
end
end
This is obviously a slimmed down version for the sake of example (real one read from a file and contained a lot more words).
This will get the words from the allowed_words
array, will throw everything else away and will only return the words you care about.
If you just count words, you quickly realize that the most popular word in recommendations is The
or It
, and we obviously could not care less about those words, we are here to learn what our users are doing and why other users should care.
Now, lets get to the pig script
register 'gogobot_udf.rb' using org.apache.pig.scripting.jruby.JrubyScriptEngine as Gogobot;
CLEAN_REVIEWS = LOAD 'clean.csv' using PigStorage(',') AS (place_id:chararray, review:chararray);
ONLY_ALLOWED_WORDS = FOREACH CLEAN_REVIEWS generate place_id, Gogobot.tokenize(review) as words;
FLAT_ALLOWED_WORDS = FOREACH ONLY_ALLOWED_WORDS generate place_id, FLATTEN(TOKENIZE(words, ';')) as word;
GROUP_BY_WORD = GROUP GROUP_BY_WORD BY (place_id, word);
WITH_WORD_COUNT = foreach GROUP_BY_WORD generate group, COUNT(GROUP_BY_WORD) as w_count;
The most important thing to understand here regarding to UDF’s is realy this line:
register 'gogobot_udf.rb' using org.apache.pig.scripting.jruby.JrubyScriptEngine as Gogobot;
This registers the UDF in pig, so you can use the methods inside it just like you use the pig internal methods.
For example, instead of using the internal TOKENIZE
method, we are using Gogobot.tokenize(review)
, which in turn will invoke the tokenize
method in the UDF and will only return the words we care about separated by ;
.
Summing up
Using jRuby with pig gives you a lot of power, you can really kick in your analytics workflows, we are using it for a lot of other things. (More posts on that in the future).
26 Dec 2014
TL:DR We created a Vagrant+Chef cookbooks combination for provisioning a new laptop for Ruby/Rails development. Fork it here gogobot/devbox.
Our development environment at Gogobot grew a lot over the last 4 years, from a simple rails application reading/writing from a mySQL DB into a full blown application that has many dependencies/micro-services and messaging between different kind of apps.
Over the last 4 years, we had multiple ways for installing the application, when I joined Gogobot it was a Google doc with about 8 pages filled with sudo gem install
and multiple configuration that I had to do manually, it took me about 2 days to get everything installed and running.
After a while it became better and we released gogobot/laptop, which was a bit better, but it had many many issues, if one of the processes would fail, you had to start all over and it was very unfriendly, when I installed my new laptop 3 years ago using this script it took a full day to get everything installed and running.
Even today, think about your development environment, you have so many things you did manually
- dotfiles
- tmux config
- vim config
- SSH keys
- Adding SSH known hosts
- Cloning the repo
- Running bundle
- Configuring mysql client
- Configuring postgres client
- etc..
All of these steps live somewhere, usually in some abandoned corner of your brain, you might choose to brew update git
or run a newer version of Redis while another member of your team still runs an older version.
We decided it’s not good enough for us anymore and we set out to do it differently.
Goals
Going into this, we had a few goals in mind
- Use existing tools already used at Gogobot
- Simple to install/use/understand for any team member (Ruby Engineers / iOS / Android / Project,Content,Product people)
- Minimal dependencies on host machines (git, Vagrant, Virtualbox)
- Easy to re-configure and re-provision
- All engineers use the same packages
- Absolutely no manual configuration of environments
The finished solution
If you read posts on my blog before, you know we use chef quite extensively at Gogobot, we decided to use chef for this project as well, virtualizing with Vagrant and Virtual box.
Redacted (but fully working) version of the project is now at gogobot/devbox.
Enjoy!
Please feel free to open issues/comment if you are using it or having issues with it.
Discuss on HackerNews here: https://news.ycombinator.com/item?id=8800108
Inspiration/Credits/Thank you’s
19 Dec 2014
One of the best things you can do for your application is set up alerts with thresholds for various events/metrics.
Gogobot uses sensu it’s primary alert/checks system.
Here’s a diagram of how sensu works (From the sensu website)
I won’t go too much into details here, there are too many moving parts and you can read about it on the sensu documentation but the essence of it is that a server orchestrates checks on clients (servers, nodes), which then distribute the status back to the server which handlers notifications and more…
NOTE: I encourage you to browse the documentation, getting a sense of what a check
is, what a handler
is and some basic sensu lingo, from here on I assume very-basic knowledge of how sensu is wired up.
The reason for this post
I decided to set up sensu at Gogobot following a talk I heared by @zohararad.
After his talk, we also had a short Skype call about how he’s using it, what is he checking, how and why.
Often, when you set up a new system, the unclear part is how people are using it, this post is to show how we use it and to set a pretty good basics as to how you should use it too.
Finding which failures you want to check and be alerted on
Over time, we added some very useful checks to the system.
The way we added checks was pretty straightforward, we looked at all production failures we had over the last 3-6 months prior to sensu and investigated the causes.
We found a few key failures
- Server ran out of disk space
- Cron jobs did not run on specific time
- Servers not getting the latest deployments
Every single person I talked to over the months that have passed since we added sensu told me that at least one of these happened to them at least once and made life miserable.
Obviously, these aren’t all of our checks but those definitely compose the base layer of each server basic checks.
First, I want to mention here, that sensu has a vvery vibrant community around plugins, and there’s a great repository to get you started on sensu/sensu-community-plugins
I am also slowly open sourcing our custom checks: gogobot/custom-sensu-plugins, you can feel free to use these as well
Here are the checks explained
Check and alert if server ran out of disk space
{
"checks": {
"check_disk_usage": {
"command": "check-disk.rb -c 95 -w 90",
"handlers": [
"slack"
],
"subscribers": [
"all"
],
"interval": 30,
"notification": "Disk Check failed",
"occurrences": 5
}
}
}
This uses the check-disk.rb
check from the community plugins, the critical threshold is 95% and the warning threshold is 90%.
The handlers for this is slack
which will send us an alert in the slack chatroom (Handlers are also from the community plugins repo)
This check is being done every 30 seconds and will only alert if it happened 5 times in a row.
NOTE: You will likely want a slightly slower threshold, ever 2-5 minutes is perfectly fine.
Cron jobs did not run on specific time
Obviously, we have a check that the cron
process is running on the machine, but you can find 50 examples for this on the web using the check-procs.rb
open source check.
We have dozens of tasks running in a schedule, those tasks run Ruby
code, obviously that code can fail like any other code, this means that even those cron
is running our infrastructure is still not in good health.
this is how we solved this?
Our cron tasks run a shell script, that shell script CD’s into the project directory and runs a rake (for example).
Each of these cron tasks has a file we identify the task with, we echo the date into that file after the rake task is done. If the file is too old, this means the task did not run and there’s a problem with the code it’s running.
Here’s how this looks
The shell script
#!/bin/bash
set -e
cd PROJECT_DIR
bundle exec rake sometask:sometask
echo `date` > scoring-monitor-cron
The check
{
"checks": {
"check_website_monitor": {
"command": "check-mtime.rb -f PROJECT_ROOT/scoring-monitor-cron -c 1500",
"handlers": [
"slack"
],
"subscribers": [
“scoring-system”
],
"interval": 900,
"notification": “Scoring check did not run”,
"occurrences": 2
}
}
}
As you can clearly see here, we check the mtime
of that file, if the file did not change in the threshold given, we want an alert.
One more thing to note here is that the subscribers
list is different, it’s no longer all
servers but a subset of server running that cron.
We have this check for every cron task that is running on the servers, making sure it all works and services are getting invoked when needed.
Servers not getting the latest deployments
This is likely the most annoying bug you will encounter.
Due to load on the server, we found that sometimes when you deploy to it, unicorn will somehow fail to switch the processes over to the new version.
I struggled a bit with finding how to check for this, essentially at the end the solution (as always) was pretty simple.
We have an API endpoint that responds with the git revision that’s deployed right now on the site.
This API is pretty useful and we can also ask our chat robot (based on hubot) to tell us which version of git is running on any environment.
So… How do we check this:
On every server www.gogobot.com
is directed to internal ip’s, so when you call www.gogobot.com/some-url
from one of the server you never leave the internal network.
When you hit the Amazon Load Balancer URL, it’s redirecting you to the source (www.gogobot.com
), so that’s also not useful.
Comes in check-urls-content-match.rb
from our sensu plugins repo (mentioned above)
{
"checks": {
"check_website_monitor": {
"command": "check-url-content-match.rb -b www.gogobot.com -h AMAZON_LOAD_BALANCER -s 0 -p /api/get_git_sha”,
"handlers": [
"slack"
],
"subscribers": [
“fe”,
“be”
],
"interval": 60,
}
}
}
What this does…
This checks a couple of URL’s, one with the hostname and one with the load balancer, passing the hostname as a header (to bypass the internal network).
If the content of those 2 URL’s does not match, this means the deployed version on this machine is not correct/old.
Bonus
I’ll give an example here, if you are an engineer working in a team, you will relate to this for sure.
You get an alert via email: disk-check
failed on backend-aws-west-2
.
Here’s my thought process seeing this:
WHAT?
- Where is that server located?
- Why would the disk get filled up?
- Did an internal system fail that supposed to rotate logs?
- What can I delete in order to fix this issue?
- Do I have permission to ssh into this server or do I need to escalate this up?
One of our alerts is from a website-monitor system we have built, it goes to the site and checks if pages are broken and also checks if CSS is not broken or returning 404.
Imaging a backend engineer receives this alert, no one else is awake… what should he do?
Adding more info on alerts
The great thing about the sensu JSON parsing is that if you add more data on it, this will be available on the alert itself
For checks that are not obvious to fix, or you will likely need more data we added a wiki page link like this:
{
"checks": {
“check_deployment”: {
"wiki": "http://wiki.gogobot.com/failures/deployment-failure",
"command": "check-url-content-match.rb -b www.gogobot.com -h AMAZON_LOAD_BALANCER -s 0 -p /api/get_git_sha”,
"handlers": [
"slack"
],
"subscribers": [
“fe”,
“be”
],
"interval": 60,
}
}
}
This is the important part here “wiki”: “http://wiki.gogobot.com/failures/deployment-failure”
When you see an alert, you simply click on this link and you can read more into how you can fix this, why did this break most likely and more.
Here’s what this looks like in the dashboard
Summing up
I really got into less than a handful of checks we have on our system and how we handle the alerts.
Check out the community plugins, there are checks for everything you can think of, from Graphite, Logstash, URL, CPU… everything.
15 Dec 2014
Following my recent post Measure, Monitor, Observe and Supervise I got a ton of emails, Facebook messages, twitter DM’s and more, asking me to talk about HOW we do it, the chef cookbooks, the pipeline, how everything is wired up etc…etc.
This post has been a long time coming, but still, better late than never I guess, so in this post I will be open sourcing the entire pipeline. How it’s done, the wiring, the cookbooks, everything.
While this post title “rails applications”, you will likely find it useful for any sort of web application running multiple servers, in the cloud or on premise.
First, I’d like to describe how this looks from a birds-eye view
- User hits application servers and the servers generate logs. Typically we look at a couple of logs. Nginx access logs and rails server logs
- Rsyslog is watching those 2 files and ships them to logstash over UDP when the file changes
- Logstash Checks the input and directs it to the right place (filter in logstash)
- Logstash outputs everything into elasticsearch
- Kibana is used to look at the graphs and dashboard
This is one example of a dashboard for our rails servers:
This has a lot of moving parts so lets dive into them one by one.
Logstasher gem
We use the logstaher gem in order to
generate JSON logs from the rails application.
We use the gem as is with a few custom fields added to it, nothing out of the
ordinary.
Logstasher gem will simply save a JSON version of the log file on the application server, it WILL NOT send the logs directly to logstash.
So for example you will have a /var/log/your-app-name/logstash_production.log
which will be a JSON version of /var/log/your-app-name/production.log
that you are used to.
I encourage you to read more into the Logstasher documentation, it’s pretty good and straightforward.
To give you an example of how the logs are separated, look at this image:
Rsyslog shipper
The rsyslog shipper is a coniguration file under
/etc/rsyslog.d/PRIORITY-CONFIG_FILE_NAME
on your App servers.
For example: /etc/rsyslog.d/22-nginx-access
.
We use chef to manage the template, here’s the base template:
NOTE: Do not forget to open up the ports in the firewall only to the servers
you need (or to the security group on Amazon)
$ModLoad imfile # Load the imfile input module
$InputFileName /var/log/nginx/access.log
$InputFileTag nginx-access:
$InputFileStateFile nginx-access-statefile
$InputRunFileMonitor
:syslogtag , isequal , "nginx-access:" @your-logstash-server:5545
& ~
Explaining the file a bit more:
- This is a working example of sending nginx logs to logstash over UDP
- ‘/var/log/nginx/access.log` is where nginx stored access logs by default, if yours doesn’t store it there, change it
nginx-access
is the tag given to that file input by rsyslog, you can identify the input by the tag and handle it
your-logstash-server
needs to be domain/ip/ec2-internal-address
- If you use
@
when sending the logs, it will be sent over UDP, using @@
will send the logs over TCP.
& ~
Tells Rsyslog to stop handling this file down the chain.
Note:
1. This works with rsyslog version 5.x, if you are using 7.x or any other version it will likely be different.
2. You only need $ModLoad imfile
per syslog config, if you use 2 configuration files, only the first one will need to include this file.
One gotcha that I wasn’t aware of at all until sharing this shipper as open
source is that if you use TCP from rsyslog and for some reason logstash is down
or something happens to the channel (firewall etc…) it might take down all
your servers to a halt.
This is due to the fact that rsyslog will save a queue of tcp tries and
eventually choke under the load.
Following that tip, we moved all of our shippers to using UDP if you have that
option you should do the same IMHO.
Chef cookbook
The main part of this is really our logger cookbook, this installs the main server and generates the inputs.
As part of this blog post, I am open sourcing it as well, you can check it out on Github.
The main part in the cookbook is the part that filters the input based on where it’s coming from.
input {
udp {
port => "5544"
type => "syslog"
}
}
output {
elasticsearch {
host => "127.0.0.1"
}
}
filter {
if [type] == 'syslog' {
grok {
match => {
message => "%{COMBINEDAPACHELOG}"
program => "nginx-access"
}
}
}
}
You can see here that the grok filter is COMBINEDAPACHELOG
, it’s being
extracted from the streamed line into logstash.
This is a saved filter that is part of what’s being supplied with logstash.
For Rails, it’s a bit more tricky as you can see here:
input {
udp {
port => "5545"
type => "rails"
codec => json {
charset => "UTF-8"
}
}
}
filter {
if [type] == 'rails' {
grok {
pattern => "<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:logsource} %{SYSLOGPROG}: %{GREEDYDATA:clean_message}"
}
mutate {
remove => [ "message" ]
}
json {
source => "clean_message"
}
}
}
You can see here, this is a bit more tricky. Since rsyslog ships the logs
into logstash, it adds some custom fields into it, making it an invalid JSON,
so the first part is cleaning up the message and parsing it using the JSON
encoder.
Note: Logstash has an input called syslog
, this might be better here than using TCP/UDP and cleaning up.
I haven’t tested it on our setup yet, but once I do, I will update the cookbook and the blog here.
Thank you’s
I want to thank @kesor6 for helping me a ton with
this, helping fixing bugs, pair programming at the wee hours of the night.
Operations Israel Facebook group, amazing
people helping out with problems and giving tips when needed.
28 Aug 2014
For a while now I wanted to share the way I work on Ruby/Rails projects. How I go about handling the multiple processes you need, editor, running tests and other stuff.
Since the way I work is not too common, I think it’s worth sharing and letting people know about it.
Everything in the terminal
100% of the work I do is done inside the terminal, I don’t switch back and forth between programs or between windows.
The only way I really exit the terminal is to Google something or to see how something looks on the web (although that is rare for me).
Since I usually don’t do front-end work, I stick to the terminal for backend work and tests that verify what I do.
Focus on the keyboard and automation
Unless I exit the terminal, I don’t use the mouse at all, I really just use the mouse for browsing online. I don’t treat this as a religion it’s just really convenient for me, I don’t measure every millisecond to say whether it’s faster or not in the long run, it just comes natural (with time).
For example, copying something from the error in the spec output to Google is also done using the keyboard.
Common language
Just to have a common language throughout this post, I want to put a few variables here that will make things clearer.
C
=> CTRL
{tmux_leader}
=> C-a
{vim_leader}
=> ,
Not Rails specific
The worflow is definitely not Rails or even Ruby specific, I work the same way with Chef projects (Devops) and even writing this blog post is done using Vim in the terminal.
- Tmux (+Tmuxinator)
- Vim
How it looks
Here’s a screenshot of what it looks like
As you can see right away the screen is split into multiple parts. I use Tmux to do this (I will go deeper into this in a bit).
Explaining the sections of the screen
The screen is divided into 2 main parts on the “main” tab.
The top part is where code is being edited, I use vim to do all of my work. Whether it’s Ruby, Javascript, CSS or any other language.
The bottom part is also divided into two parts, the left side is rails console (if it’s a rails project) or irb if it’s a Ruby project.
The right side is a pipe reader, basically just a tail of some log (I will explain further later on)
Tabs
Almost every project I have requires multiple tabs, each tabs runs a different process, usually these processes just need to run in the background for the project to be useable.
This of a rails project (not your basic one) you will need:
- Console
- Server
- Foreman
- Spork
- Guard
- SSH into a server of some sort
- Vagrant machine running a service
- add yours here…
Everything needs to run when you boot up the project.
Once you have all the tabs open, you can switch between them by {tmux_leader}+tab_number
, do in order to see the rails server log all I need to do is `{tmux_leader}+2
Adding tabs
Once you are working on a project, you sometimes need to add a tab say for example SSH’ing to a server, tailing a log or anything else really.
That is also super easy, you just do {tmux_leader}+c
, you can also rename the tab to know what’s on it with a glimpse to the status bar, just do {tmux_leader}+,
input the name and you are done
Panes
Switching between panes is also a breeze, you just move about as you’d move in vim, using the same navigation keys. {tmux_leader}+j
will move you down.
I usually resize panes for readability (when I am using the smaller screens like laptop or 24” screen at my sitting desk). Resizing is using the same keys as navigation but using the capital letter. {tmux_leader}+J
will resize down (maximize the upper screen).
Adding/splitting more panes
My home office standing desk is home to my 30” Dell screen (go buy one now), so I have a ton of screen real estate.
Sometimes I split into 6-7 vim buffers and also have 3-4 tmux panes (Heavy debugging sessions for example).
Splitting with tmux is really intuitive {tmux_leader}+-
with split horizontally and {tmux_leader}+|
splits vertically.
The status bar
Tmux status bar (or footer) is usually a place where you have all the info you need. I use tmux-powerline in order to configure this.
Anything that is a bash command can be a part of the status line, you can view your email count, the time, some API command from your server you curl to, really anything.
For me, I have experimented with lots of stuff to minimum, I just use the time (since I am using full screen more), the tabs and the session name, really noting more.
There’s a great blog post about configuring your status line (without poweline) here.
Automating projects start
It can be a real pain to start a rails project, when you start your computer or when you switch between projects.
People usually miss a part or don’t run specs because spork is not running and other lazy-driven excuses.
I basically automate every project using Tmuxinator
Here’s an example of how I configure the Gogobot project
# ~/.tmuxinator/gogobot.yml
# you can make as many tabs as you wish...
name: gogobot
root: ~/Code/gogobot_rails3/www_rails
windows:
- main:
layout: main-horizontal
panes:
- vim .
- ./read_pipe.sh
- server: rm -f log/development.log && rm -f log/test.log && rails server thin
- foreman: bundle exec foreman start
- spork: bundle exec spork
- guard: bundle exec guard -G GuardFileOnlyJasmine
So now, in order to start the project I just run tmuxinator start gogobot
and I am done, it will start up everything and I can start working.
Configurations
Both my vim config and the tmux config are public, you can fork/add/comment or just use them if you’d like
- vim config here
- Tmux config here
Developer workflows
Before starting this post, I thought to myself, will this be another tmux-vim-shortcut post?
My answer to myself was: “If you don’t share common workflows, it will be exactly that”, I remember before switching tim tmux+vim and seeing those posts, I had lots of questions popping up, like: What about copy+past, what about clipboard, scrolling, selecting, what about running tests etc…
I have identified some workflows that I had a problem with and I will try to clarify these here as best I can. Hopefully this will carry you over the edge of testing this method out.
Running tests
The right side you can see ./read_pipe.sh
, this is an awesome trick I picked up from Gary Bernhardt (@garybernhardt).
Usually, when you run specs inside vim or any other IDE, the specs break your coding cycle, you have to wait for the specs to finish in order to see the screen where your code is.
I found that this is very distracting, I want to keep editing code while specs are running, even if it’s just “browsing” the code, going up/down in the screen, I don’t even want to wait a single second for the specs to finish.
Here’s what it looks like to run specs for example (I use {vim_leader}+;
to run a single spec and {vim_leader}+'
to run a full specs file))
When the specs are running, I can continue with the code, browse to other files and do basically whatever I want.
One more awesome thing about it is that the pipe file can run in any other terminal, so basically I can put it on another tab, another window or anywhere I’d want to.
I usually use it this way: I focus on vim full screen (using tmux focus mode {tmux_leader}+z
), and the specs are running in the background, then, when I want to look at the output, I can just snap out of tmux focus mode and look at the specs output.
Checking some error in Google
A really common workflow with developers is needing to check something on Google from a test output, error or even code. That’s likely something you do dozens of times a day without even thinking about it.
Here’s what I do in order to achieve this.
First, I move to “copy mode” in tmux. I do this by hitting {tmux_leader}+[
.
This puts tmux into copy mode and I can navigate up/down or anywhere on the screen, the great thing about it is that it’s actually using the same navigation commands like vim.
After I select the text I want, I hit CMD+ENTER
in order to pull up Alfred. paste the text and hit ENTER
.
Here’s what it looks like.
Copy + Paste interface
Most of you probably take this one for granted, but having tmux+vim+osx clipboards talking together is really awesome, I can’t really tell you how much better my life became after making this happen.
Also, as you can see I am using Alfred. One of Alfred’s best features is the searchable clipboard history, this means I can search through my vim copy history in a snap.
Working with git
I have found that most commit messages people are making are one liners. I researched my commit messages and I found that there was a point in time where I started to write better longer commit messages. I think this workflow really pushed me towards that.
Using vim, I have 2 shortcuts for git I use all the time.
{vim_leader}+gs
will show a git status screen where I can see all the files that have not been stages and I can stage/unstage files right from vim. (This is done using vim fugutive)
This is what it look like:
You can read more about fugitive in this github repository.
The best thing about fugitive is that you stay inside vim, I have found that staying inside the editor allows you to write more comprehensive commit messages, copy some code or some documentation into the commit message and be more fluent.
Pair programming (Remote or on site)
If you are following me on any social network, you probably already know that I have been working from home for about 5 years now.
This means that usually when I pair program with someone, I am doing it remotely, usually this certain someone is 5,000 miles away.
Pair programming and screen sharing is such a pain without tmux and vim.
When you pair using tmux and vim everything works really well, first, you simply ssh into a machine, just like you do to one of your servers
There’s no rendering involved, so you don’t really need a monster internet connection.
Also, your pair has everything, just like you do, they have the console, the test output, all right there with a click of a key.
You can both SSH together into another machine, you can explain something, view a log together, it’s amazing.
Conclusion
I am really enjoying my development workflow, it works perfectly for me and I strongly believe it can work for anyone.
Using the right configuration of vim+tmux can really smooth out the transition to this workflow and I strongly recommend it.
Like I already mentioned above here, I don’t treat working with the keyboard alone as a religion but it is really faster (at least feels faster). Not switching between screens in order to see some output or to input some kind of command is really convenient.
You can look into (and setup) my configuration, it’s a very nice jump into the world of working completely in the terminal.
If you have any questions/comments please feel free to leave those down here and I will do my best to help out.
13 Jan 2014
The story
Around 3 months ago (probably more, but I can’t recall the exact date), Gogobot set out to do a complete revamp to the underlying infrastructure that runs the site.
We decided to go for it for various reasons, like performance, maintainability, but the main reason was to do it right, with Chef, documentation etc…
Right off the bat, I knew I’d need help doing it, so I called my buddy @kesor6 who runs a Devops shop in Israel to help out.
Evgeny was (and still is) great, he took care of all the heavy lifting while I was still focused on delivering features, so he wrote the majority of what we needed, up to the point where domain knowledge was needed, where I came in.
Even after all that heavy lifting, we did a ton of pair programming, which was one of the best experiences I had pairing with anyone, ever.
One of the main things Evgeny recommended was to start monitoring everything, now I immediately said, “dude, we have monitoring, what the hell are you talking about?”, We had NewRelic, NewRelic server monitor, Pingdom, Pagerduty, Amazon monitoring on all servers including ELB, DB and everything.
I didn’t actually realize what I was missing out on.
The infrastructure
We installed a few new services
- Logstash with Kibana
- Graphite (w/ statsd, collectd, collectd plugins)
Logstash
Logstash (backed by ElasticSearch index) is a logging server, you send all your server logs (nginx access logs, rails production logs etc…) to it.
Logstash will index it (you can control the index properties) and convert everything to a searchable format.
Now, you don’t actually know how much you need something like that, but since then I had a couple of bugs that I would take 2-3 more hours to solve without it.
Sample Queries
- Show me all the API response codes of people uploading postcards.
- Show me how many requests came from GoogleBot comparing to organic over the last 24 hours (same for bing)
- Show me all CSS requests that are broken (same for JS)
Kibana
Kibana is the client side application that takes everything and makes sense of it, so you can visualize everything is a beautiful way.
With Kibana you can save searches, make them your homepage and more.
All the knowledge you didn’t have on your app is at your fingertips now, since I had it I used it for many things and insights we did not have before.
Example Case
One of the nastiest bugs I encountered lately was that our Mobile app could not post postcards, it happened to specific users, and when it happened, you could not fix it, you had to reinstall the app.
Luckily, this issue wasn’t widespread, but even then, it was a hell to debug.
Here’s what it looked like:
What we saw was, the instead of doing POST
, the phones that had bugs in them did a GET
request, which was then retried over and over again.
What we also saw, is that the phones that were buggy, did not send the right headers, so it could not authenticate.
Just seeing everything as it was happening was mind blowing, since we have multiple API servers, I would never have seen this on my own, it was too difficult.
What we soon figured with Kibana is that that phones that had a bug in them did a POST
got a 301 request (permanent redirect), and since then did a GET
without even trying to do another post.
This directed us to the bug in Nginx configuration which was doing redirects to API requests (DON’T ever do that, trust me).
Again, Logstash has pretty amazing defaults, so the index and the data being sent from Nginx is enough to debug most problems you can encounter.
We use Kibana as a first research tool when we get a bug report. Looking at similar requests, checking how wide spread the bug is, and more.
With Kibana, you can look at a specific client as well (based on IP, user_id and more)
I am guessing you can imagine the level of research and insights you can draw from it.
Graphite
Perhaps the most important piece of the monitoring puzzle for us, I can’t start to explain how much we use Graphite these days, for things we never knew.
Before I go any further into this… let me show you what we had before
As you can see, there’s a HUGE spike in request queueing, and this is something I was always frustrated about NewRelic, WHY?
Why do I have such a request queue bottleneck, what happened? Did the DB spike? Did Redis Spike, are 50% of the LB servers dead?
What the hell is going on?
With NewRelic, we were blind, really, it was really frustrating at times, especially when you suddenly see a DB spike, but you have no idea what caused it, because you lack the reference.
Example Case
One of the most annoying bugs we had for a while, that we’ve been having DB spikes, like once a week, the DB would spike to around 80-100ms per request, and after a minute settle back down to 10-15ms.
We were trying to figure it out in NewRelic, but the slow query log only showed fast queries that were queued up, nothing really helpful, this is where graphite really shines.
We sat down and started looking at the stats, cross referencing things to one another, soon we realized that one of the admin controllers was doing too many long queries, which slowed down the DB time.
But, we had no proof, so we graphed it.
What you see in this graph, is that every time this controller was requested, it would spike the database, sometimes a light spike and sometimes a bigger spike.
Also, as you can clearly see from the graph, the issue was fixed and then tested over an over again, without the DB spiking again.
Using Graphite for everything
Now, we use Graphite for every measurement we need, we send Disk data, CPU Data, Memory data and more (using Collectd), We send Nginx connection data (using the Nginx collectd plugin), we send everything that rails supports through ActiveSupport, and also custom data about the scoring system and more.
The level of insights you begin to develop is sometimes mind blowing, it’s really hard to comprehend how blind I was to all of these things in the past.
Further reading
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
http://codeascraft.com/ (Etsy engineering blog) has a ton of insights, just seeing how they use monitoring and insights is amazing, I have learned a great deal from reading about it, I recommend you do the same.
(From the github README)
A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
Since we don’t want a performance hit when sending stats to Graphite, we want to send UDP packets. which are fast and fire-forget.
Supporting Libraries
- statsd
- statsd-instrument
Collectd
Collectd is a deamon that collects system performance periodically and sends them over to Graphite.
For example: See the Mongo graphs together with the Sidekiq metrics, so you can see if you have errors in Sidekiq workers, what Mongo looks like during those times. (You can see some pretty amazing things)
You can collect a ridiculous amount of data, and then you can look at it with Graphite, again, with cross reference to other metrics.
For example, we had problems with Mongo reporting about replica lag. One of the theories was that the disk was queuing reads/writes because of insufficient iops.
2 hours into having graphite, we realized this theory was wrong, and we needed to look elsewhere.
Plugins
CollectD has a very big list of plugins, it can watch Nginx, Mysql, Mongo, and others, you can read more about it here: https://collectd.org/wiki/index.php/Table_of_Plugins
One of the plugins we use is Nginx, so you can see some really useful stats about Nginx.
Summing up
I really touched just the tip of the iceberg of what we monitor now, we actually have a huge number of metrics.
We collect custom business logic events too, like scoring system events, which are critical to the business. Search events and more.
Once you start implementing that into your workflow, you start to see the added value of it, day in day out.
When I start implementing a new feature, I immediately bake stats into it, this way I know how it functions in production.
Eventually, you should remember that Graphite is an amazing platform you can build on, like that amazing Dashboard you always wanted can be achieved with Graphene, or with Dashing. There’s a more comprehensive list here: http://dashboarddude.com/blog/2013/01/23/dashboards-for-graphite/.
You can expose those insights to other members of the team, from product to the CEO who can care about totally different things then you personally do.
The approach I take with measuring and collecting is: First collect, store, then realize what you want to do with this data. Once you know what you want, you already have some stored data, and you can begin work.
There are libraries that support alerts based on Graphite graphs, so you don’t have to actually “look” at the dashboard.
Get to it!
I believe that absolutely every company, at any stage can benefit from this, from the 2 people bootstrapped startup to the funded multiple engineer startup.
Every time you see a post from a leading tech company, it’s backed by graphs, data and insights, you can have it too.
Get to it, it’s not difficult
Feel free to comment/feedback.
03 Jan 2014
This week I had an issue that my Graphite instance was falling apart.
Every query I tried, every dashboard I loaded was stuck, I couldn’t get anything done.
Now, I use graphite every day, every technical decision I make now is based on a graph, every problem and alert I get for the servers, I look at the graphs first, so naturally, this was not a good place to be.
From the server collectd stats, I saw that the EBS drive exploded, it just spiked to 100% and it was slowing everything down.
Now, I won’t go into Graphite, Carbon or any of these here, but I just want to go through how I solved it step by step, since every single post I read about it was partial, incomplete and inaccurate.
First, lets set out the goals for replacing the drive on your EC2 instance
- Minimal downtime
- Minimal data loss
- Fast (no copy data)
ok, so lets start…
First, you need to snapshot the drive
You don’t have to stop the instance, you don’t need to stop any service, the server can keep running while this is happening.
For a full (100%) 500G drive, it took Amazon around 2 hours, which was agonizingly slow, but the server kept running collecting stats, so I didn’t really mind it so much.
After you have the snapshot, you just create a new drive from it
You can of course configure everything just like a normal drive, you can configure iops, you can configure the size and region, just like you would a brand new one.
The filesystem is already there, your data is intact and the sun keep shining :)
Keep in mind, the drive HAS to be in the same region as your instance, or you will not be able to attach it.
It takes around 30s-1m for the drive to be available, then you just need to attach it to your machine
Then you need to select where you want to attach it
At this point, every other post I read failed to explain it clearly, so I will try really hard to be clearer.
Now, you have two drives /dev/xvdl
for example and the new one at /dev/xvdp
. /dev/xvdl
is mounted to /mnt
and the new one is not mounted yet, it’s just attached to the server.
Now, you have two options
Option #1
sudo vim /etc/fstab
You will see this line:
/dev/xvdl /mnt xfs noatime,nobootwait 0 2
As you can see, /dev/xvdl
is mounted to /mnt
like I said earlier, you can just replace it with /dev/xvdp
and restart the machine.
Your new line should look like this:
/dev/xvdp /mnt xfs noatime,nobootwait 0 2
Then you have to reboot the machine
Option #2
Stop all services that write to this disk
This is super important step, you HAVE to stop all services that write to this mounted drive, or it will just not work, Linux won’t let you unmount it if there are write or read processes.
I just stopped Graphite and relevant services and then ran
sudo umount /mnt
This will unmount /mnt
so you can continue
After the old drive is unmounted, you will need to do sudo mount /dev/xvdp /mnt
and then edit the /etc/fstab
file, just like is step #1
Then start the services again
Next step
Now if you follow the steps, you probably say to yourself it didn’t work, because the drive still shows up as 500G at 100%.
This is where you just need to run sudo xfs_growfs -d /mnt
, which will just bump the space to the drive’s capacity.
Summing up
When I did it, I had about 10 minutes of downtime to my stats machine, which didn’t take anything else down since everything is writing over UDP, for this sort of maintenance it seemed acceptable.
I didn’t lose any data except those 10 minutes where the stats server was down.
Feel free to comment or five feedback on anything
17 Dec 2013
Update (03-10-2014)
After some great debugging from @mislav, turns out that this is not actually a Ruby issue or even a Koala/Faraday issue.
This is actually the right_aws
gem and/or the right_http_connection
gem that are monkey-patching the class and causing the issue.
The github issue on Koala has all the discussion
Thanks @mislav
The story
One of the most interesting project I did recently was to completely revamp the underlying infrastructure that runs Gogobot.
From Apache-Passenger pair, I converted to Nginx+Unicorn while upgrading the Ruby version to 2.0 (from 1.9.3).
I did everything using Chef (many thanks to my colleague @kesor6 for the nights of pair programming), it was definitely one of the most interesting projects I had done, Chef is awesome but that’s for a different post.
The problem
When I upgraded the project to Ruby 2.0, things started to break, really weird things like the Facebook API (using the Koala gem), AWS api (using right-aws) and others.
Everything that broke was related to some sort of an external API, and it was really weird since everything was working super smoothly before the upgrade.
Now, the bug was really obscure and hard to find, and it required quite a bit of digging, but here are the results.
The code that was breaking was pretty standard, nothing out of the ordinary
graph = Koala::Facebok::API.new(token)
graph.get_object 'me'
When this code ran on Ruby 2.0, it threw an exception:
MultiJson::DecodeError: 399: unexpected token at ''
Now, this API did not change, I did not upgrade the gem, ruby 1.9 version of the site was running the same code with no problem.
At first, I tried upgrading the Koala gem, the JSON gem, replacing the JSON backend and more, nothing worked and I was getting into the dependency limbo we all know and hate.
Then, I started digging through the code, the line that was breaking was inside Koala:
body = MultiJson.decode("[#{result.body.to_s}]")[0]
The problem was that result.body
was actually Gzipped, and trying to parse it Gzipped just exploded all over the place.
Now, I did not ask for it to be Gzipped, no where in the source am I passing the header that I am accepting Gzip, so it felt odd, I started digging through the Ruby STDlib.
I found the culprit here: https://github.com/ruby/ruby/blob/v2_0_0_353/lib/net/http/generic_request.rb#L39
initheader["accept-encoding"] =
"gzip;q=1.0,deflate;q=0.6,identity;q=
You can read the code on your own, but what you can easily see is that if you don’t pass a header, by default Ruby will pass the header that will tell the external service that your app is ok with a Gzipped response (which is obviously wrong)
The fix
Digging through some Stackoverflow answers (which were 99% wrong), I found the one that hinted to the solution and it was that the response class had an option to decode the content, and if you pass this variable, everything will be ok.
(The source code is almost identical to the proposed solution)
module DecodeHttpResponseOverride
def initialize(h,c,m)
super(h,c,m)
@decode_content = true
end
def body
res = super
if self['content-length']
self['content-length']= res.bytesize
end
res
end
end
module Net
class HTTPResponse
prepend DecodeHttpResponseOverride
end
end
This solved the problem, for all classes, AWS, Koala and all other places that had bugs.
Summing up
I posted an issue to Koala here: https://github.com/arsduo/koala/issues/346, to be clear, this is not a Koala bug, but I figured the guys maintaining it should know about the bug.
Before you upgrade your project, make sure you have a test suite that covers those situations, luckily we did so the damage wasn’t big.
Hopefully, this post and the Koala bug report will save you some time when you are upgrading.