25 Sep 2015
Lately, my personal trend around working with a single page application has
been to create two separate projects. One called client and the other called
server.
I discussed the up/downs in this StackOverflow answer. @elado also answered it and emphasized how he works in this answer.
If you’ve read the StackOverflow question I linked to, you can understand that
when working like this, you have a set of challenges and some
misunderstandings, I think some of them are
worth mentioning here.
Assets
When working with Rails, you have the asset pre-compile stage, called the asset pipeline. Now, this is a source of hate and controversy in the rails community,
I definitely do not want to open that Pandora’s box, BUT, it raises a legitimate question. If I have a client application and a server API, who is in charge of assets?
When you get into the complexity of managing assets in a production
environment, you understand why it’s important. In order to deploy assets you
will need to compile them, sign them for a release and put them where servers
can easily access them. That goes to CSS and JS.
In development, you would want to work with as many files as you can,
modularizing them, but in production (at least for HTTP/1.1) you would ideally
want the browser to link to as few as possible.
Configuration
Where’s my API?
Locally, it’s located at http://localhost:300
, staging is stg.your-app.com
,
etc… Normally, you would want to inject those settings during deployment
right?
I discussed this with @elado
This will make most developers go like:
Now, if you modularize your app correctly, you should not have a lot of places
where you inject the Config
factory, but if not, you are in for a world of
hurt.
I remember the joys of working with a monolithic rails application, everything
was either rails g
or some rake
that you could easily find with rake -T
.
Now that you have a client and an API, you are working with more tools, each of
them brings challenges to the table that you need to deal with.
Scratching the surface
This is just scratching the surface of the challenger, when you go into running
these things to production you have many more
Solving the problems
All of the challenges and problems are solvable, I would like to discuss how I
personally solved some of them in the next part
API location
THE best solution I have for the API location, one that is scalable in
production as well is to not use an api location.
Instead, have your app call /api
, and handle it on the server. Lets see how
that works on development first and then on the server.
When I run grunt serve
and rails server
I essentially have two
applications. One on localhost:9000
and the other on localhost:3000
.
I created a dead simple Go proxy, this opens a server on
http://localhost:5050
. When I ask for a URL that start with /api
it will
proxy the request to localhost:3000
, everything else goes to
localhost:9000
.
Here’s a basic diagram to show how this works:
This solves a few problems.
First One is security configuration called CORS. I discussed handling Cors with Angular before.
Second, This will allow you to deploy this anywhere and the server it’s being
deployed to will control the state of whether this application is
production/staging or something else.
Nginx is awesome for those kind of configuration, here’s a snippet from our
configuration
server {
client_max_body_size 30M;
listen 80;
... REDACTED
server_tokens off;
root "<%= @path %>/current/public";
location /api {
try_files $uri @gogobot-backend;
}
location @gogobot-backend {
...REDACTED
}
}
Note: This template is from chef, this gets evaluated by Ruby before going to
the server, hence the <%= @path %>
code…
Now, this means that the client side code and the server side code need to
“live” on the same server, but this is definitely not a restriction you need to
have. For us it makes sense, for you it might make more sense to have a single
server for the client and multiple servers for the backend and have Nginx proxy
to the load balancer. You are in control here.
Assets
I use bower and grunt in order to manage assets. I do not manage any of the
assets in Rails.
Here’s what it looks like in the code
<!-- build:js(.) scripts/vendor.js -->
<!-- bower:js -->
<script src="bower_components/jquery/dist/jquery.js"></script>
<script src="bower_components/angular/angular.js"></script>
<script src="bower_components/angular-animate/angular-animate.js"></script>
<script src="bower_components/angular-aria/angular-aria.js"></script>
<script src="bower_components/angular-cookies/angular-cookies.js"></script>
<script src="bower_components/lodash/lodash.js"></script>
<script src="bower_components/angular-google-maps/dist/angular-google-maps.js"></script>
<script src="bower_components/angular-messages/angular-messages.js"></script>
<script src="bower_components/angular-resource/angular-resource.js"></script>
<script src="bower_components/angular-route/angular-route.js"></script>
<script src="bower_components/angular-sanitize/angular-sanitize.js"></script>
<script src="bower_components/angular-touch/angular-touch.js"></script>
<script src="bower_components/bootstrap-sass-official/assets/javascripts/bootstrap.js"></script>
<!-- endbower -->
<!-- endbuild -->
<!-- build:js({.tmp,app}) scripts/scripts.js -->
<script src="scripts/app.js"></script>
<script src="scripts/controllers/application.js"></script>
<script src="scripts/components/field/gg_field_controller.js"></script>
<script src="scripts/components/field/gg_field.js"></script>
<script src="scripts/controllers/main.js"></script>
<script src="scripts/controllers/login.js"></script>
<!-- endbuild -->
The default Yeoman Angular generator does the trick, use it in order to start
your applications and never look back.
CDN friendliness
For us at Gogobot, we have 3 levels of caching, I will not go too much into
details here, but the first level of cache is the CDN, we simply cache the page
as it is.
When you have a client/server application, this provides a few challenges as
well. Here’s what we do.
Obviously, we have a random key after each file when we build, this is handled
by grunt (the code above, reference the Yeoman generator for more details).
The perceptive of you will see a challenge here as well, since you have a JS
file but the HTML file actually might reference something different, this file
must be on the server in order for the HTML to work
The solution
The solution to that is also a bit complex, lets discuss it one by one. First,
lets make sure we have the files on the servers when we need them.
The solution for this is actually pretty straight forward.
We keep all the JS and css files on the servers, we don’t delete them with
every deployment. Instead, in a shared folder we have
/v/{version}/file.{version}.js/css
. This way, when the server asks for a
file, we proxy that into the right folder. (using Nginx as well).
This is actually how our main rails app works right now as well, since we cache
the pages as a whole, we want to serve CSS for the page even if the page itself
is stale (pages are cached for 24-48 hours). You can go to This
page and have a look at the source, find the
CSS files.
Here’s an example code from our deployment script that cleans up this library
assets_path = "#{latest_release}/public/v"
dirs = capture("ls -tr #{assets_path}", hosts: server.host).split("\n")
trim = dirs.size - 40
if trim > 0
logger.info "Removing #{trim} assets for release older than 40 versions..."
dirs[0..trim].each { |d| run "rm -rf #{assets_path}/#{d}", hosts: server.host }
else
logger.info "Not removing old releases, we have only #{dirs.size} out of 40 stored."
end
The second problem is harder to solve, and that’s: What happens if an old JS
tries to call an API that is no longer there, or responds in a manner that you
can’t predict anymore?
For this problems there are many solutions as well, you can go as crazy as
versioning and migrations for an API or you can version your API smartly for
breaking changes.
We chose the latter, we simply version our API in a way where we don’t remove
fields or change a field type unless you “release” a new version. Since our
API’s server the mobile application as well, those need to remain stable for
users and cannot break.
Choose how far back you want to support and go along with it.
Development against a server
From my experience in the past, this is definitely more noticeable for
developers that don’t use TDD, but, it can be a pain for ones that do as well.
When you work with a client application and the server is developed separately,
there’s often a gap between how people work, the client side developer might
need some response from the server that isn’t there yet. It could be a field,
it could be an endpoint that doesn’t even exist yet.
Facebook’s been doing amazing work with GraphQL, this means that the client
describes the response it needs from the server, the wrote a good post about it
here: GraphQL Introduction
However, for most applications you will have a predefined API endpoint that
sends back a predefined JSON response.
If you are starting from scratch right now, definitely look into customizing
this, make it easier for the client to define the server response, but if you
don’t, the solution for you is a proxy-stub server
There are so many open source solutions for this I can’t even count, Janus is one of them.
You can configure endpoints and API response and it will send you back the
stubbed response.
This is awesome if you want to develop visually, style a web page and what not,
all of this without compromising your progress by other teams.
Closing thoughts
While writing this post, I got the same feeling you get (if you got to this
point) probably. Is this all worth it? The answer is yes, this is what the
internet looks like these days, a client->api.
When you have an API you can serve all clients by the same logic, it’s actually
a great way to make sure you have less bugs and deliver more stable code.
Feel free to discuss in the comments, if you have any questions/comments, I
would love to read them.
23 Sep 2015
I do a lot of Remote Pair Programming.
One of the things that people ask me about the most is what happens if my peer
needs to see the server/database or any other locally available resource that
is not easily accessible through the tmux session.
That is actually a great question.
When you pair using screen share, that’s obvious, you see everything I see, but
when we pair using a terminal you basically have a tunnel vision to my
computer. You can’t see the server (web server) I am working on or see some
JSON result that is describing the bug the best.
When working with rails for example, you have your default server at
localhost:3000
and when working with angular and grunt server it’s at
localhost:9000
.
Without being able to see the web server, pairing will be very difficult.
Before I knew about SSH tunneling setup, I actually had a Skype session set up
with screen share just to see the browser, but now, I have a better solution.
Setting up SSH tunneling
Lets say you want to access my localhost:3000
by going to localhost:3000
on
your computer, just like you would if you worked locally.
ssh -L 3000:localhost:3000 user@your-peer-ip
This command will forward all traffic on port 3000
to the computer you are
sshing to on port 3000.
If you already have a server running and port 3000 is busy, you can use any
other port.
I use this trick every single day, hope you find it as useful as I do.
Happy pairing!
18 Sep 2015
Pair programming is one of my favorite things to do, in many ways I feel way
more productive when I pair with someone else and the cliché that Two (2) heads are better than one seems to be right in this case (at least for me).
Remote pair programming is hard, especially if you use the wrong tools. Since
I’ve been doing this for a long time, I developed a workflow around it with
tools that I think is worth sharing.
Note: I will not go into the discussion of “Why pair programming”, I think
it’s irrelevant for this post and there are many posts on the internet
discussing that point to exhaustion. I think this blog post discusses the ups/downs of pair programming well, read it if you want to learn/educate yourself more.
Tmux
I work with tmux a lot. One of the best thing about
Tmux in the terminal is that once you share your session using SSH, the person
pairing with you has access to all the tabs you have open for the project.
For example, if you pair on a rails project, you can see the console, server
logs, Redis and really just about anything else you have in that Tmux session.
I blogged about how I start work using tmux in the past in these posts
-
[My development workflow (vim+tmux+terminal+alfred) Awesomeness |
](http://avi.io/blog/2014/08/28/my-development-workflow-vim-tmux-terminal-awesomeness/) |
- How I start working using Tmux
Essentially, I have a start.sh
file in every project root that opens all the
tabs/splits I need and gives the session a name.
Configuration
You can check out my tmux configuration in my dotfiles repository here
Note: This is only tested on Mac, I know there are some problems with using it
on linux, especially around attach-to-user-namespace
and the use of pbcopy
.
Here’s an example of what a tmux session looks like
Edit: Thanks to Reddit user tuxracer04 for suggesting tmate.io. I have tried it before and it’s a very good option to share your session, it requires less setup but you have less fine grained control. Worth a short to test things out
Vim
I use vim in terminal exclusively as my editor. Over time I tried every editor
out there, from Sublime to Atom and more.
Since I am using the terminal as my main work tool, it’s easy to just use vim
inside tmux and share it, this simplifies the workflow of pairing quite a bit,
since you don’t have to screen share your code and deal with delays.
With screen sharing, it’s hard to pair since you need to allow your peer to
drive as well, anything from copy/pasting code to tying something they want.
Obviously, I do not want to make this post an editor war, but I really
encourage you to use a terminal based editor, whether it’s emacs, vim or any
other tool, it will make your life so much simpler
Configuration
I don’t have any pair programming specific vim configuration, it’s just using my
dotvim configuration https://github.com/kensodev/dotvim.
Running tests
Whether I am coding in Go, Ruby or any other language for that matter, I always
do TDD, no exceptions (Yes, even for Javascript)
Running tests is a huge part of the workflow when pairing and looking at
results of tests is definitely something we do a lot while pairing.
The usual “running tests” workflow is very breaking, you need to look at
another screen in order to view test results, or stop looking at code in order
to do that.
With running everything in the terminal and using Tmux, I can basically send
the result of the testing to another split.
Her’s an example of what this looks like
As you can see in the red rectangle, there’s a file called read_pipe.sh
, here’s the contents of this file
#!/bin/bash
DEFAULT_PIPE_NAME=".plumber"
PIPE_NAME="${1:-$DEFAULT_PIPE_NAME}"
if [ ! -p $PIPE_NAME ]; then
echo "Created pipe ${PIPE_NAME}..."
mkfifo $PIPE_NAME
fi
echo "Waiting for commands!"
while true; do
COMMAND=$(cat $PIPE_NAME)
echo Running ${COMMAND}...
sh -c "$COMMAND"
done
This comes from vim-plumber,
which allows you to send the results of tests to a pipe file
All I need to do is hit <leader>t
or <leader>T
in order to run tests, the
tests will run in the background in another pane and I can continue working.
When pairing this is super powerful since you can just run tests and then look
at them, copy results, look at the code while looking at the results side by
side and more.
This saved me so much time over the period I’ve been programming I cannot
emphasize enough how important this workflow is.
What does TDD has to do with pairing
I know it’s weird seeing TDD in a remote pair programming workflow, however, I
feel it’s absolutely crucial to pair when you pair program since TDDing forces
you to slow down and explain the process of your thoughts.
When you TDD, it’s a perfect oportunity to discuss the design with your pair
and understand how the system works.
SSH
In order to share tmux and your terminal you will need to allow SSH into your
machine. This is a bit tricky and requires just a bit more configuration beyond
your machine as well.
Router configuration
It is more than likely that you have a router in your home office, so behind
your public IP there’s a list of internal ip’s.
You need to configure your router that whenever someone SSH’s into the home
address they will be routed into a specific computer.
Every router is obviously different, but it looks like this
I have it configured that my laptop has a fixed saved IP on the router so it’s
always the same address and it does not change, this way it’s easier to make
the configuration sticky.
Authorized keys
In order for someone to SSH into your machine you have to authorize their
public key.
BUT, remember each tmux session has a name, you can actually limit people into
a tmux session when they SSH, if that tmux session doesn’t exist, they won’t be
able to SSH into your machine
command="tmux attach-session -t SESSION_NAME",no-port-forwarding,no-x11-forwarding,no-agent-forwarding KEY_TYPE KEY
SESSION_NAME
is the tmux session name you set, KEY_TYPE
and KEY
are the
name and the key given to you by your peer.
Binding each key to a session name is a very good way to make sure that peers
can only access sessions open for them, if you don’t have the session open, the
SSH will fail.
What I do is I append _pair
to my usual session names, lets say I have a
session named gogobot
, when I want to pair I will rename the session to
gogobot_pair
in order to work with one of my peers. When we’re done pairing,
I will change the session name back to gogobot
.
SSH Tunneling
One of the best thing about SSHing into your machine is that now your peer can
access your server, so you don’t really need to share screen to see the browser
even.
This blog post explains SSH tunneling really well, go read it.
Voice / Screen
When everything is in the terminal, this is by far the least important part.
Personally, I use Skype in order for voice and screen share, but I have also
used Google Hangouts and other solutions.
When my peer has trouble setting up the tunneling, we can also look at the
browser together.
Closing thoughts
Before I moved to a 100% terminal based workflow, remote pair programming was
close to impossible.
There are many solutions that try to solve this problem, Screen Hero was one of the best ones before they shut down the product.
https://atom.io/packages/motepair looks
nice if you are using Atom as well.
This post will not be complete without mentioning Joe’s Remote Pair Programming website, this was an inspiration to me for a long time and includes many tips and tricks for making it work.
I hope you enjoyed this post, feel free to leave a comment with
questions/feedbacks, I’d love to hear what you have to say
16 Sep 2015
With one of my recent tasks, I had to read Gzipped JSON files from a directory.
This proved to be pretty easy (as a lot of things with Go are), but I thought
it’s pretty useful to blog about it since many people likely need this
package main
import (
"bufio"
"compress/gzip"
"log"
"os"
)
func main() {
filename := "your-file-name.gz"
file, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
gz, err := gzip.NewReader(file)
if err != nil {
log.Fatal(err)
}
defer file.Close()
defer gz.Close()
scanner := bufio.NewScanner(gz)
}
From there, you can do with scanner
what ever you want, I needed to read the
lines one by one in order to parse each of the lines as JSON but obviously, you
can do whatever you want with it.
Keep Hacking!
16 Sep 2015
If you are a regular reader of this blog or even follow my rants on Twitter
( @KensoDev ), you know I am an automation freak, if I do something more than a
couple of times a day, I will definitely automate it a.s.a.p.
Since I’ve started working with Go, one of the things I’ve been missing the
most is an automated way to start testing a file.
I am talking about the plumbinb, the initial package name, the Suite etc…
I created a couple of vim snippets for this that save me a lot of time and I
thought it’s worth sharing
snippet gotestclass
package ${1}
import (
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/mock"
. "gopkg.in/check.v1"
"testing"
)
func Test${2}(t *testing.T) { TestingT(t) }
type $2Suite struct{}
var _ = Suite(&$2Suite{})
func (s *$2Suite) Test${3}(c *C) {
}
snippet gotest
func (s *${1}Suite) Test${2}(t *testing.T) {
}
If you are interested in this sort of automation and shortcuts, you can also
check out my dotfiles on github kensodev/dotfiles Or my vim settings on kensodev/dotvim
24 Aug 2015
Intro
One of my latest projects at Gogobot is an Angular
application (mostly for internal editorial use).
Right off the bat, I’ve decided that this will be a complete separate app,
based on Angular in the client side and Rails 4 on the server side.
The first challenge you go through when you have an application built this way
is “How am I going to manage authentication?”
One of the best resources I have encountered on the subject is
Techniques for authentication in AngularJS applications, however, this only explains the client side part of things and leaves out the server side setup.
Most of the solutions I tried online did not work or the solutions were not
current for Rails 4.
CORS
The one thing that is challenging here is setting up the CORS for Angular and
Rails to work together, but first, lets explain what the problem here.
Our rails app is started on http://localhost:3000
and the Angular grunt
server is running on http://localhost:9000
.
This is what’s called a Cross Origin request, meaning, it’s not coming from the
same origin.
Before the browser creates a POST
request, it will send an OPTIONS
request
to the same page. That response is than checked to see whether this origin is
allowed on the server.
If for some reason the OPTIONS
request fails, the browser will not attemt the
POST
request.
Here’s what it looks like on the browser
And here’s the server’s response with the important part highlighted
Setting up your rails application
The rails application is pretty much standard rails application, once the
application is created with rails new your-app-name
I just added gem
'devise'
to the Gemfile
and ran bundle install
.
This is all pretty basic and you just need to follow the README from the devise
repository on Github.
If you Google “Rails 4 CORS” or anything remotely resembling that search term,
you will likely reach articles/blogs that recommend you include rack-cors
and
some configuration.
This did not work for me at all so I ended up rolling my own with some tips
from @elado
After the app is set up I created a file lib/cors.rb
class Cors
def initialize(app)
@app = app
end
def call(env)
status, headers, body = @app.call(env)
headers['Access-Control-Allow-Origin'] = '*'
headers['Access-Control-Allow-Methods'] = "GET, POST, PUT, DELETE, OPTIONS"
headers['Access-Control-Allow-Headers'] = "Origin, X-Requested-With, Content-Type, Accept, Authorization"
[status, headers, body]
end
end
In ApplicationController
I have this code
before_filter :set_headers
def set_headers
headers['Access-Control-Allow-Origin'] = '*'
headers['Access-Control-Allow-Methods'] = "GET, POST, PUT, DELETE, OPTIONS"
headers['Access-Control-Allow-Headers'] = "Origin, X-Requested-With, Content-Type, Accept, Authorization"
end
NOTE: in the Access-Control-Allow-Origin
instead of the *
it is very
important that you put your domain (where the client code will run from). I do
not recommend having *
in your production code.
My SessionsController
looks like this
class SessionsController < Devise::SessionsController
respond_to :json, :html
def destroy
current_user.authentication_token = nil
super
end
def options
set_headers
render :text => '', :content_type => 'text/plain'
end
protected
def verified_request?
request.content_type == "application/json" || super
end
end
Notice that I configured devise to respond to json
requests as well
In routes.rb
I have devise configured like this
devise_for :users, controllers: { sessions: "sessions" }
devise_scope :user do
match "/users/sign_in" => "sessions#options", via: :options
end
Now that you have all the code in place, you just need to make sure your app
users the new middleware.
In config.ru
just add these lines before the run
require 'cors'
use Cors
Now you are ready to authenticate Angular with Devise.
If you follow the medium post I linked to, you should be up and running in
minutes.
Enjoy!
21 Jul 2015
Introduction
One of my favorite parts about Postgres is that you can have array columns. Either text array or integer arrays.
This is very useful for querying data without joining.
Models
Model Design
This is a pretty common model and table design, you have a model representing a schedule and a model representing the actual events.
For example: “Hacking with Avi” has multiple schedules in the following couple of days at different venues with different capacity for attendents.
Querying
The default approach for querying this will be to join the Event
with the EventSchedule
and query the scheduled_on
column.
However, my prefered approach would be to cache the scheduled_on column on the Event
table.
I am adding a column called schedules
to the Event
table, that column is of type integer[]
with default to []
.
Lets take this ruby code for example here:
event.schedules = event.schedules.collect { |schedule| schedule.scheduled_on.to_i }
This will give us something like this:
[1438059600, 1438619400, 1437973200, 1438014600, 1438578000, 1438664400]
Notice that I am converting the date into an integer.
If you read about Array functions in potgres you see that it’s not really trivial to query for greater than on the array elements.
The intarray module provides a bit more usefulness in the function it provides but still doesn’t provide what I really need.
The solution
The solution turns out to be pretty simple.
Lets say you have a date in integer form 1437497413
you can do this
select name from events where 1437497413 < any(schedules);
Bonus
One of the other things that is very common when you are working with integer array is sorting by one of the elements (either the min or the max).
For example, I want to sort by the dates.
Here’s what you can do
select (sort(schedules))[1] min_date from events where 1437453975 < any(schedules) order by min_date DESC;
Conclusion
You can see here, it’s pretty easy to manipulate and query array elements in Postgres, I encourage you to embrace the power of it and use it in your application, it will scale better and will make the data modeling easier for you.
Questions? Feedback?
Questions? Feedback? Feel free to discuss in the comments
20 Jul 2015
Introduction
These days with micro-services running and multiple dependencies “starting” work can be tricky.
For example: Just running Gogobot means running rails c
, rails s
, foreman
, spork
, ./read_pipe.sh
and sometimes even more.
Just getting to work in the morning or after a restart, just remembering everything you need to run can be difficult, and that’s just the main project, not counting in the chef repo, all microservices and more, well, you get the drift.
Since I am working on many projects at any given time, I always have a file in the root of the directory called start.sh
.
This way, when I want to start working on a project I just cd ~/Code/gogo && start.sh
.
This start.sh
script is simple bash that automates Tmux.
Here’s an example:
tmux new-session -d -s gogobot
tmux split-window -t gogobot:1 -v
tmux split-window -t gogobot:1.2 -h
tmux rename-window main
tmux send-keys -t gogobot:1.1 "vim ." "Enter"
tmux send-keys -t gogobot:1.2 "bundle exec rails c" "Enter"
tmux send-keys -t gogobot:1.3 "./read_pipe.sh" "Enter"
tmux new-window -t gogobot:2
tmux select-window -t gogobot:2
tmux rename-window server
tmux send-keys -t gogobot:2 "rm -f log/development.log && rm -f log/test.log && bundle exec rails server thin" "Enter"
tmux new-window -t gogobot:3
tmux select-window -t gogobot:3
tmux rename-window foreman
tmux send-keys -t gogobot:3 "bundle exec foreman start" "Enter"
tmux new-window -t gogobot:4
tmux select-window -t gogobot:4
tmux rename-window spork
tmux send-keys -t gogobot:4 "bundle exec spork" "Enter"
tmux new-window -t gogobot:5
tmux select-window -t gogobot:5
tmux rename-window proxy
tmux send-keys -t gogobot:5 "cd ~/Code/go/src/github.com/kensodev/go-solr-proxy/cmd/proxy && sh run_local_example.sh" "Enter"
tmux select-window -t gogobot:1
tmux attach -t gogobot
Explaining the parts
tmux new-session -d -s gogobot
will tell tmux to start a new session, detach from it and will name it gogobot
for reference in the future.
tmux split-window -t gogobot:1 -v
will tell tmux to split window #1 in the gogobot
session and -v
means vertically.
tmux split-window -t gogobot:1.2 -h
tells tmux to split pane 2
in window 1
and -h
as you probably already figured means horizontally.
Theres also
send-keys which is pretty self explanatory,
rename-window` and more.
I encourage you to read tmux reference, there are a lot of very useful tricks in there.
Final thoughts
You can see here, it’s pretty easy to automate your workflow around the terminal, I wrote about it some more in My development workflow (vim+tmux+terminal+alfred) Awesomeness.
It’s worth investing in automating your workflow, invest the time sharpening your skills and making your tools work better around your workflow.
01 Jul 2015
Intro
Getting started with Pig/Hadoop on EMR or any other platform can be a pretty
daunting task, I found that having a workflow really helps, having everything
laid-out really gets you going smoother.
So today, I am releasing pig-herder, production ready workflow for pig/Hadoop on
Amazon EMR.
What Pig-Herder includes
- Sane and friendly directory structure (Both on local and on S3).
- Pig script template.
- EMR launcher (with the AWS CLI).
- Java UDF template with a working schema and full unit test support.
- Compilation instructions to make sure it works on EMR.
The workflow
Directory Structure
Organization is super important in almost everything we do as engineers, but
organizing the directory structure to work with pig is crucial (at least for me).
Every Pig/Hadoop workflow has it’s own directory and the same directory structure that looks like this:
├── README.md
├── bootstrap.sh
├── data
│ └── small-log.log
├── jars
│ ├── mysql-connector-java-5.1.35-bin.jar
│ ├── pig-herder.jar
│ ├── piggybank-0.12.0.jar
├── launcher
│ ├── submit-steps.sh
│ └── submit-to-amazon.sh
├── lib
│ └── mysql-connector-java-5.1.35-bin.jar
├── pig-herder-udf
│ ├── internal_libs
│ │ └── date-extractor.jar
│ ├── libs
│ │ ├── hadoop-core-1.2.1.jar
│ │ ├── hamcrest-core-1.3.jar
│ │ ├── junit-4.12.jar
│ │ └── pig-0.12.0.jar
│ ├── main
│ │ ├── main.iml
│ │ └── src
│ │ └── io
│ │ └── avi
│ │ ├── LineParser.java
│ │ ├── LogLineParser.java
│ │ └── QueryStringParser.java
│ ├── out
│ │ ├── artifacts
│ │ │ └── pig_herder
│ │ │ └── pig-herder.jar
│ │ └── production
│ │ ├── main
│ │ │ └── io
│ │ │ └── avi
│ │ │ ├── LineParser.class
│ │ │ ├── LogLineParser.class
│ │ │ └── QueryStringParser.class
│ │ └── tests
│ │ └── io
│ │ └── avi
│ │ └── tests
│ │ ├── LogLineParserTest.class
│ │ └── QueryStringParserTest.class
│ ├── pig-herder-udf.iml
│ └── tests
│ ├── src
│ │ └── io
│ │ └── avi
│ │ └── tests
│ │ ├── LogLineParserTest.java
│ │ └── QueryStringParserTest.java
│ └── tests.iml
├── prepare.sh
├── production_data
├── script.pig
├── start_pig.sh
└── upload.sh
data
All the data you need to test the pig script locally. This usually includes a few log files, flat files and nothing more. This is basically a trimmed down version of what’s in production.
Lets say I want to run on Nginx logs in production, this will include 50K lines from a single log file, it’s a good enough sample that I will be sure it’ll work well on production data as well.
production_data
Same as data but a bigger sample, this will usually be 100K items exported from production, not from local/staging.
This gives a better idea whether we need to sanitize after export for example.
jars
All the jars that the pig script depends on. the prepare.sh
script will copy the compiled jar to this folder as well.
launcher
Launcher dir usually includes a couple of files, one that will boot the EMR cluster and another to submit the steps.
submit-to-amazon.sh
date_string=`date -v-1d +%F`
echo "Starting process on: $date_string"
cluster_id=`aws emr create-cluster --name "$CLUSTER_NAME-$date_string" \
--log-uri s3://$BUCKET_NAME/logs/ \
--ami-version 3.8.0 \
--applications Name=Hue Name=Hive Name=Pig \
--use-default-roles --ec2-attributes KeyName=$KEY_NAME \
--instance-type m3.xlarge --instance-count 3 \
--bootstrap-action Path=s3://$BUCKET_NAME/bootstrap.sh | awk '$1=$1' ORS='' | grep ClusterId | awk '{ print $2 }' | sed s/\"//g | sed s/}//g`
echo "Cluster Created: $cluster_id"
sh submit-steps.sh $cluster_id $date_string CONTINUE
submit-steps.sh
cluster_id=$1
date_string=$2
after_action=$3
aws emr add-steps --cluster-id $cluster_id --steps "Type=PIG,Name=\"Pig Program\",ActionOnFailure=$after_action,Args=[-f,s3://$BUCKET_NAME/script.pig,-p,lib_folder=/home/hadoop/pig/lib/,-p,input_location=s3://$BUCKET_NAME,-p,output_location=s3://$BUCKET_NAME,-p,jar_location=s3://$BUCKET_NAME,-p,output_file_name=output-$date_string]"
These will start a cluster and submit the steps to amazon. Keep in mind that you can also start a cluster that will be terminated when idle and also terminate if the step fails.
You can just pass —auto-terminate
to in the submit-to-amazon.sh
and TERMINATE_CLUSTER
to the submit-recalc-poi.sh
for example.
README
Very important. You have to make the README as useful and as self explanatory as possible.
If another engineer needs to ask a question or doesn’t have all the flow figured out, you failed.
bootstrap.sh
bootstrap.sh
is a shell script to bootstrap the cluster.
Usually, this only includes downloading the mysql
jar to the cluster so pig can insert data into mysql when it’s done processing.
jar_filename=mysql-connector-java-5.1.35-bin.jar
cd $PIG_CLASSPATH
wget https://github.com/gogobot/hadoop-jars/raw/master/$jar_filename
chmod +x $jar_filename
We have a Github repo with the public jars we want to download. it’s really a convenient way to distribute public jars that you need.
Upload.sh
Upload everything you need to Amazon.
This step depends on having s3cmd installed and configured
echo "Uploading Bootstrap actions"...
s3cmd put bootstrap.sh s3://$BUCKET_NAME/ >> /dev/null
echo "Uploading Pig Script"...
s3cmd put script.pig s3://$BUCKET_NAME/ >> /dev/null
echo "Uploading Jars..."
s3cmd put jars/pig-herder.jar s3://$BUCKET_NAME/ >> /dev/null
echo "Finished!"
Some projects include many jars, but usually I try to keep it simple, just my UDF and often piggybank
.
start_pig.sh
Starting pig locally with the same exact params that will be submitted to production in the submit-steps.sh
file.
This means, you can work with the same script on production and local, making testing much easier.
UDF Directory structure.
The directory structure for the Java project is also pretty simple. It includes 2 modules main
and test
. But after a very long time experimenting I found that the dependency and artifacts settings are crucial to make the project work on production.
Here’s how I start the project on IntelliJ. I believe the screenshots will be enough to convey the majority.
I start the project with just the hello world template (how fitting)
Starting the project
I then go to add the modules main
and tests
.
Adding modules
I have 2 library folders. lib
which does not get exported in the jar file and internal_libs
which does.
tests
and main
both depend on both of them.
tests
depends on main
obviously for the classes it tests.
Artifacts only output the main
module and all the extracted classes from internal_libs
. do not export or output the libs
folder or it will simply not work on EMR.
Artifacts
Get going
Once you familiarize yourself with the structure, it’s really easy to get going.
pig-herder
includes a full working sample for analyzing some logs for impression counts. Nothing crazy but really useful to get going.
Enjoy (Open Source)
pig-herder is open source on github here: https://github.com/kensodev/pig-herder
Feel free to provide feedback/ask questions.
13 Jun 2015
Using 3rd party APIs is a part of almost any company these days, I think that using Facebook and Twitter integrations is a part of merely every other startup out there.
Often times, those 3rd party APIs fail completely, work miserably or any other situation you can’t control.
Just by looking at this Facebook graph history for the past 90 days proves that even the most reliable giants can fail at times
Facebook Graph API health history
Dealing with 3rd party API failures is tricky, especially when you have to rely on them for signup/signin or any other crucial services.
In this post, I would like to discuss a few scenarios we had to deal with while scaling the product.
Dealing with latency
Latency in 3rd party APIs is a reality you have to deal with, as I mentioned
above, you simply cannot control it. There’s no way you can.
But, you can deal with your users not suffering from it, you can bypass all of
it’s limitations.
Most of our usage of the Facebook platform is posting your Postcards for your friends to see on Facebook.
This is far from being a real-time action, so, we can afford taking this out of the user thread and deal with it there.
Here’s a basic flow of what happens when you post a postcard
Creating a Postcard on Gogobot
What we do is pretty basic really, no rocket science or secrets, we deal with the postcard in parts, allowing the user the best experience possible, user gets the feedback immediately.
Postcard Feedback
The feedback we give to the user already includes the score he/she got for sharing the postcard, even though the process for sharing can take 1-3 seconds after that feedback process, but the user does not have to wait.
Dealing with Failures
Looking at the diagram you can see that if the job fails, we simply add it back to the queue and we retry it. Every job in the queue has a retry count.
After it exhausted all the tries, it will go to the “failed” queue and will be processed further from there if we decide to do so.
We use Sidekiq for a lot of the background workers.
With Sidekiq, it’s just a simple worker.
module Workers
module Social
class FbOpenGraphWorker
@queue = :external_posts
include Sidekiq::Worker
sidekiq_options :retry => 3
sidekiq_options :queue => @queue
sidekiq_options :unique => true
sidekiq_options :failures => :exhausted
def perform(model, id)
item = model.constantize.find(id)
Facebook::OpenGraph.post_action(item)
end
end
end
end
It goes the same to almost any API we use in the background, whether it’s Twitter, Facebook, Weather and more.
Dealing with 3rd party security
Booking Providers
Few months ago, we started rolling out a way to book hotels through the website, we’ve been scaling this ever since, in multiple aspects.
The part that was the most challenging to deal with is the 3rd party security demands.
One of the demands was to whitelist all server IPs that call the API.
Here’s a basic diagram of how it used to work
Gogobot Booking Provider requests
This was basically just a Net::HTTP
call from each of the servers that needed that API call.
From the get go, this created issues for us, since we scale our servers almost on a weekly basis. We replace machines, we add more machines etc..
Looking at this, you immediately realize that working this way, someone else controls your scaling, someone else controls how fast you can add new servers, how fast you can respond to your growth and that was simply unacceptable, this is a single point of failure we just had to fix A.S.A.P.
Now, I must mention here, that we did not have any disasters happening from this, since we gave a pool of IPs we switched around, but it definitely wasn’t something we could work with for the long run.
The solution and implementations
The solution is basically having a proxy, something you can control and scale at your own pace.
API proxy
All servers will call that proxy and it will handle the requests to the providers and return the response back to the requesting server.
This solution is great, but it introduces a few challenges and questions needed to be asked
How many calls do we have from each server to providers?
You do know I believe in knowing that.
When we started off, we had no idea how many calls we had, how many failed, how many succeeded…
Single point of failure
Now that all servers will call the proxy, this introduces a harsh point of failure, if this proxy fails, all calls to booking will fail, resulting in actual money loss.
Most of the time, when you work on solutions, it’s hard to connect between the money and the code, it’s hard to forget that errors can and will cause the company to lose money no pressure here though :).
Monitoring
We had to monitor this really well, log this really well and make sure it works 100% of the time, no exceptions.
Implementation
I will try to keep this on point here and not go too much into the solution, but still, I think it’s worth knowing a bit deeper into what we tried, what failed and what succeeded.
Solution #1
I started off trying to look into what other people are working with.
I found Templar
This had everything going for it.
- No dependencies
- Written in Go (see first point), multi threaded built in
- Tested
- From proven member of the community
Finding something that works at scale is often hard, but this looked to be really reliable.
I wrote a chef cookbook (open sourced here), created a server and had 2 of our front-end servers calling it.
We launched this on Apr 7, on Apr 8, it blew up.
What started happening is that we started seeing requests that just don’t ever return, they are simply stuck.
I opened the issue that day, Evan was very helpful, but at this point, this was a production issue and I did not have the time to deal with it further.
As I also mentioned earlier, I didn’t have enough data to deal with the bug well enough. I didn’t know how many calls we had and what is the peak. (That’s why you can only trust data).
We abandoned that solution (for now)
Solution #2
If there’s something that serves us well every day is Nginx.
Nginx is a huge part of our technology stack, it served all the web/api requests and also internal micro-services use it for load balancing.
We trust it, we love it and I thought it would be an amazing fit.
This is basically a forward proxy, Nginx does not support forward proxy for HTTPS calls, so we had to work around this part creatively.
We ended up with something like this.
Servers call providers.gogobot.com
(internal DNS only), with the provider name (Say Expedia).
So, the call goes out to providers.gogobot.com/expedia
, this internally maps to the Expedia endpoint, and does the proxy.
As mentioned earlier, we use chef for everything so this ended up being a pretty simple configuration like this:
<% @proxy_locations.each do |proxy_location| %>
location <%= proxy_location['location'] %> {
proxy_pass <%= proxy_location['proxy_pass'] %>;
}
<% end %>
This is a part of the nginx.conf template for the server. The @proxy_locations
variable is a pretty simple Array
of Hash
that looks like this:
[
{
location: `/expedia',
proxy_pass: 'EXPEDIA_API`
}
]
This way, we didn’t have to change anything in our application code except the endpoint for the APIs which is of course pretty straight forward.
Now, we can use the access log from the nginx and send it over to our logstash servers, and now, we have monitoring/logging all wired for basically free.
Logstash Proxy
This solution is live in production for the past couple of months now, it has proven to be super stable. We did not have to deal with any extraordinary failure.
Since then we added alerting to Slack and SMS as well, so we know immediately if this service fails.
Summing up
I tried to keep this post short and to the point, there are many factors at play when you want to get this done.
We have a lot going on around these decisions, mainly around monitoring and maintenence of these solutions.
With this post, I wanted to touch on some of the decisions, on the process of scaling a product in production, not on the nitty gritty technical details. other posts to come in the future perhaps.
Learn, observe and roll your own
As you can see here, there are multiple ways to deal with using APIs from your application, what works for us will not necessarily work for you.
Implementing monitoring and logging early on, will prove fruitful, since when the first solution will break (and it often does), you will have enough data and you can make an educated decision.
If you have real life examples, let me know in the comments
Questions are also great, if you hav any question regarding the implementation, let me know in the comments as well.