Scaling Gogobot - Using 3rd party APIs
13 Jun 2015Using 3rd party APIs is a part of almost any company these days, I think that using Facebook and Twitter integrations is a part of merely every other startup out there.
Often times, those 3rd party APIs fail completely, work miserably or any other situation you can’t control.
Just by looking at this Facebook graph history for the past 90 days proves that even the most reliable giants can fail at times
Dealing with 3rd party API failures is tricky, especially when you have to rely on them for signup/signin or any other crucial services.
In this post, I would like to discuss a few scenarios we had to deal with while scaling the product.
Dealing with latency
Latency in 3rd party APIs is a reality you have to deal with, as I mentioned
above, you simply cannot control it. There’s no way you can.
But, you can deal with your users not suffering from it, you can bypass all of
it’s limitations.
Most of our usage of the Facebook platform is posting your Postcards for your friends to see on Facebook.
This is far from being a real-time action, so, we can afford taking this out of the user thread and deal with it there.
Here’s a basic flow of what happens when you post a postcard
What we do is pretty basic really, no rocket science or secrets, we deal with the postcard in parts, allowing the user the best experience possible, user gets the feedback immediately.
The feedback we give to the user already includes the score he/she got for sharing the postcard, even though the process for sharing can take 1-3 seconds after that feedback process, but the user does not have to wait.
Dealing with Failures
Looking at the diagram you can see that if the job fails, we simply add it back to the queue and we retry it. Every job in the queue has a retry count.
After it exhausted all the tries, it will go to the “failed” queue and will be processed further from there if we decide to do so.
We use Sidekiq for a lot of the background workers.
With Sidekiq, it’s just a simple worker.
module Workers
module Social
class FbOpenGraphWorker
@queue = :external_posts
include Sidekiq::Worker
sidekiq_options :retry => 3
sidekiq_options :queue => @queue
sidekiq_options :unique => true
sidekiq_options :failures => :exhausted
def perform(model, id)
item = model.constantize.find(id)
Facebook::OpenGraph.post_action(item)
end
end
end
end
It goes the same to almost any API we use in the background, whether it’s Twitter, Facebook, Weather and more.
Dealing with 3rd party security
Few months ago, we started rolling out a way to book hotels through the website, we’ve been scaling this ever since, in multiple aspects.
The part that was the most challenging to deal with is the 3rd party security demands.
One of the demands was to whitelist all server IPs that call the API.
Here’s a basic diagram of how it used to work
This was basically just a Net::HTTP
call from each of the servers that needed that API call.
From the get go, this created issues for us, since we scale our servers almost on a weekly basis. We replace machines, we add more machines etc..
Looking at this, you immediately realize that working this way, someone else controls your scaling, someone else controls how fast you can add new servers, how fast you can respond to your growth and that was simply unacceptable, this is a single point of failure we just had to fix A.S.A.P.
Now, I must mention here, that we did not have any disasters happening from this, since we gave a pool of IPs we switched around, but it definitely wasn’t something we could work with for the long run.
The solution and implementations
The solution is basically having a proxy, something you can control and scale at your own pace.
All servers will call that proxy and it will handle the requests to the providers and return the response back to the requesting server.
This solution is great, but it introduces a few challenges and questions needed to be asked
How many calls do we have from each server to providers?
You do know I believe in knowing that.
When we started off, we had no idea how many calls we had, how many failed, how many succeeded…
Single point of failure
Now that all servers will call the proxy, this introduces a harsh point of failure, if this proxy fails, all calls to booking will fail, resulting in actual money loss.
Most of the time, when you work on solutions, it’s hard to connect between the money and the code, it’s hard to forget that errors can and will cause the company to lose money no pressure here though :).
Monitoring
We had to monitor this really well, log this really well and make sure it works 100% of the time, no exceptions.
Implementation
I will try to keep this on point here and not go too much into the solution, but still, I think it’s worth knowing a bit deeper into what we tried, what failed and what succeeded.
Solution #1
I started off trying to look into what other people are working with.
I found Templar
This had everything going for it.
- No dependencies
- Written in Go (see first point), multi threaded built in
- Tested
- From proven member of the community
Finding something that works at scale is often hard, but this looked to be really reliable.
I wrote a chef cookbook (open sourced here), created a server and had 2 of our front-end servers calling it.
We launched this on Apr 7, on Apr 8, it blew up.
What started happening is that we started seeing requests that just don’t ever return, they are simply stuck.
I opened the issue that day, Evan was very helpful, but at this point, this was a production issue and I did not have the time to deal with it further.
As I also mentioned earlier, I didn’t have enough data to deal with the bug well enough. I didn’t know how many calls we had and what is the peak. (That’s why you can only trust data).
We abandoned that solution (for now)
Solution #2
If there’s something that serves us well every day is Nginx.
Nginx is a huge part of our technology stack, it served all the web/api requests and also internal micro-services use it for load balancing.
We trust it, we love it and I thought it would be an amazing fit.
This is basically a forward proxy, Nginx does not support forward proxy for HTTPS calls, so we had to work around this part creatively.
We ended up with something like this.
Servers call providers.gogobot.com
(internal DNS only), with the provider name (Say Expedia).
So, the call goes out to providers.gogobot.com/expedia
, this internally maps to the Expedia endpoint, and does the proxy.
As mentioned earlier, we use chef for everything so this ended up being a pretty simple configuration like this:
<% @proxy_locations.each do |proxy_location| %>
location <%= proxy_location['location'] %> {
proxy_pass <%= proxy_location['proxy_pass'] %>;
}
<% end %>
This is a part of the nginx.conf template for the server. The @proxy_locations
variable is a pretty simple Array
of Hash
that looks like this:
[
{
location: `/expedia',
proxy_pass: 'EXPEDIA_API`
}
]
This way, we didn’t have to change anything in our application code except the endpoint for the APIs which is of course pretty straight forward.
Now, we can use the access log from the nginx and send it over to our logstash servers, and now, we have monitoring/logging all wired for basically free.
This solution is live in production for the past couple of months now, it has proven to be super stable. We did not have to deal with any extraordinary failure.
Since then we added alerting to Slack and SMS as well, so we know immediately if this service fails.
Summing up
I tried to keep this post short and to the point, there are many factors at play when you want to get this done.
We have a lot going on around these decisions, mainly around monitoring and maintenence of these solutions.
With this post, I wanted to touch on some of the decisions, on the process of scaling a product in production, not on the nitty gritty technical details. other posts to come in the future perhaps.
Learn, observe and roll your own
As you can see here, there are multiple ways to deal with using APIs from your application, what works for us will not necessarily work for you.
Implementing monitoring and logging early on, will prove fruitful, since when the first solution will break (and it often does), you will have enough data and you can make an educated decision.
If you have real life examples, let me know in the comments
Questions are also great, if you hav any question regarding the implementation, let me know in the comments as well.