Writing an AWS lambda function with Golang

About 10 days ago there was a discussion on HN about Golang.

I made a comment about how I love Go and what I do with it and people followed up with some questions.

Once of the things that sparked the most curiosity was that I said I write AWS Lambda functions with Golang with a thin wrapper on Node.js around them.

I quickly open sourced SNS Lambda notifier Golang just as a response to the comment but I wanted to follow up on some of the workflows and how it’s actually done.

What I absolutely love about AWS Lambda is that it can be triggered by other AWS resources like SNS notifications, S3 files and more. If you think of the workflows you can achieve with this you can easily see some powerful applications can be built on top of it.

Lets move on though and focus on the code since that’s what we’re here for :)

For our example, lets create a Lambda function that will be a subscriber to SNS notification.

To be exact, lets say we want to get a Slack notification every time a deployment fails on ElasticBeanstalk so that the engineers can check it out.

SNS Message

This is what an SMS message looks like

{
    "Records": [
        {
            "EventSource": "aws:sns",
            "EventVersion": "1.0",
            "EventSubscriptionArn": "",
            "Sns": {
                "Type": "Notification",
                "MessageId": "",
                "TopicArn": "",
                "Subject": "AWS Elastic Beanstalk Notification - New application version was deployed to running EC2 instances",
                "Message": "Timestamp: Wed Mar 30 18:28:06 UTC 2016\nMessage: Failed to deploy application..\n\nEnvironment: staging-api\nApplication: app-api\n\nEnvironment URL:",
                "Timestamp": "2016-03-30T18:28:54.268Z",
                "SignatureVersion": "1",
                "Signature": "",
                "SigningCertUrl": "",
                "UnsubscribeUrl": "",
                "MessageAttributes": {}
            }
        }
    ]
};

Parsing the message

I wrote a quick SNS message parser here: https://github.com/KensoDev/sns-parser

This parser implements a method called IncludesMessage to check whether the SNS payload includes some string in the message. Pretty simple and straightforward.

Now, we can focus on our function code

NodeJS Wrapper

main.js

var child_process = require('child_process');

exports.handler = function(event, context) {
  var proc = child_process.spawn('./notifier', [ JSON.stringify(event) ], { stdio: 'inherit' });

  proc.on('close', function(code) {
    if(code !== 0) {
      return context.done(new Error("Process exited with non-zero status code"));
    }

    context.done(null);
  });
}

You can see here that the Node code only spawns a child process called notifier and sends the JSON event via os args so the process can continue and then exits.

This is all the Node code you need, from here on everything will be done via Golang.

Go code

notifier.go

package main

import (
	"bytes"
	"fmt"
	"github.com/kensodev/sns-parser"
	"io/ioutil"
	"net/http"
	"net/url"
	"os"
	"strconv"
)

func main() {
	m := os.Args[1]
	parser := snsparser.NewSNSParser([]byte(m))
	failed, message := parser.IncludesMessage("Failed to deploy application")

	if failed {
		sendMessage(message)
	} else {
		fmt.Printf("Everything is OK, nothing to report in this message")
	}
}

func sendMessage(message snsparser.SNS) {
	data := getData(message)
	req, _ := http.NewRequest("POST", "SLACK_HOOK", bytes.NewBufferString(data.Encode()))
	req.Header.Add("Content-Type", "application/x-www-form-urlencoded")
	req.Header.Add("Content-Length", strconv.Itoa(len(data.Encode())))

	client := &http.Client{}
	resp, _ := client.Do(req)

	body, _ := ioutil.ReadAll(resp.Body)
	fmt.Println("Message is 'Failed to deploy application', send to slack: ", string(body))
}

func getData(message snsparser.SNS) url.Values {
	data := url.Values{}
	jsonPayload := `
			{
				"channel": "#devs",
				"username": "webhookbot",
				"text": "ALERT: <!here> ElasticBeanstalk failed to deploy application %v",
				"icon_emoji": ":red_circle:"
			}
		`

	jsonMessage := fmt.Sprintf(jsonPayload, message.TopicArn)
	data.Set("payload", jsonMessage)
	return data
}

So simple it hurts :)

Receiving the message via os args m := os.Args[1], calling the parser parser := snsparser.NewSNSParser([]byte(m) and checking if the message includes what we want failed, message := parser.IncludesMessage("Failed to deploy application"). From there we continue with what we want to do.

Extend for your use case

I chose the simplest use case possible of receiving a JSON object and processing it. Even though this workflow is pretty simple I think you can extend it to pretty much anything you want.

One of the things I am going to migrate away from the application and into these function is callbacks from third party services. All callbacks to Gogobot will go through Lambda functions eventually. Queue processing is another perfect example on how you can use these.

Hope you enjoyed this post, let me know if you have any questions.

My approach to Devops

Many of you already know that I do 100% of Gogobot’s Devops.

Being in charge of a consumer facing multi-platform product is definitely challenging and it has it’s ups and downs but I wanted to focus more about my approach to Devops and how I approach my daily tasks.

Engineers

The first thing I worry about is engineer happiness.

I realize “happiness” is hard to quantify but my ultimate goal is that for engineers it will “just work”, they don’t need to think about which server it’s being deployed to or what kind of load balancer is responsible for the traffic.

All they do is say gbot deploy production in the chat room and the rest is done.

The reason for this is that they have enough to worry about, like making the feature work, testing it and making sure the users like it. The infrastructure should be 100% transparent. No hassle.

Product

The most common error I see with Devops is focusing on the infrastructure and not on the product. I think this is fatal for a lot of companies.

Don’t tell me what infrastructure you need, tell me what product you need to deliver and I will make it happen for you.

For example:

I need a docker container that runs Java should become I have a Java based micro-service that accepts user reviews and sets language. What’s the best way to run this in production/dev/staging?

Beyond that though, known what your product is and what’s working (or not) can lead to better infrastrucute decisions. For example if a feature spec has been to update a record in 30 seconds after a user registers. The infrastructure can be X, if this feature is no longer working for users you can remove that piece and replace it with a better one for the product. (See MongoDB comment below)

Monitoring and logging

All of my decision rely on hard data, I don’t let guesses (even if they are educated ones) take control of my decisions.

What is “Not working”, what is “too slow”, without data to measure this it will be close to impossible to make the right decision.

So, with any piece of infrastructure deployed to production there’s a monitoring and logging strategy. Even if it’s a one-off service.

I wrote about this in the past with Measure, Monitor, Observe and supervise
. And if you are interested in setting up a logging cluster you can read on: Running ELK stack on docker - full solution
.

Expenses

After salaries, infrastructure is often the most expensive expense for a company. This day and age with cloud you can, with a click of a button create a 50K monthly bill.

Focusing on what is the most efficient way to achieve something is important to me and I often revisit this.

For example, one of our most expensive pieces of infrastructure was a cluster of MongoDB. It was running perfectly in production for a while and the feature it was supporting was running smoothly as a result.

However, looking at new developments in that field, we were able to remove the dependency on MongoDB and with a combination of lambda and S3 completely replace it. This move saved us 250,000$ a year on infrastructure costs.

Focusing on efficiency and squeezing infrastructure to the limit is very important to me (not in the expense of slow performance of course).

Constantly evaluating

Devops tools and solutions are moving in a very fast pace. However, running a stable and current production deployment means you can’t “jump the gun” on everything “cool” that catches your eyes. Constantly evaluating what’s good and stable is very important.

For example only recently (1-2 months) we started having Docker running in production and it’s still not running any of the critical services. (it will likely soon).

With every tool you introduce there’s a learning-cliff, so evaluating smartly helps overcome those and make sure you have the answers for everything.

Off the top of my head

These are just some thoughts off the top of my head that just popped out after an office conversation. What’s your approach? What do you think? Let me know in the comments below.

Running ELK stack on docker - full solution

If you’ve read my Measure, Monitor, Observe and supervise post, you know I am quite the freak of monitoring and logging everything in the system.

For logging purposes, the ELK stack is by far the best solution out there, and I have
tried a lot of them, from SAAS to self hosted ones.

However, from a Devops standpoint, ELK can be quite difficult to install, whether as a distributed solution or on a single machine.

I open sourced the way Gogobot is doing the logging with Rails over a year ago, in the blog post Parsing and centralizing logs from Ruby/Rails and nginx - Our Recipe.

This solution and a version of this chef recipe is running at Gogobot until this very day but I wanted to make it better.

Making it better doesn’t only mean running the latest versions of the stack, it also means having an easier way to run the stack locally and check for problems, but it also means making it more portable.

The moving parts

Before diving deeper into the solution lets first sketch out what are the moving parts of an ELK stack, including one part that is often overlooked.

ELK Stack

Going a bit into the roles here

  1. Nginx - Providing a proxy into Kibana and authentication layer on top
  2. Logstash - Parsing incoming logs and inserting the data into Elasticsearch
  3. Kibana - Providing visualization and exploration tools above Elasticsearch
  4. ElasticSearch - Storage, index and search solution for the entire stack.

Running in dev/production

Prerequisites

If you want to follow this blog post writing the commands and actually having a solution running you will need the following

  1. virtualbox
  2. docker-machine
  3. docker-compose

Docker is a great way to run this stack in dev/production. In dev we use docker-compose and in production we use chef to orchestrate and provision the containers on top of EC2.

Dockers

Nginx

First, lets create the Docker for nginx.

$ mkdir -p ~/Code/kibana-nginx
$ cd ~/Code/kibana-nginx

We will need an htpasswd, this file will contain the username and password combination that users will be required to use in order to view Kibana.

You can use this generator online or any other solution you see fit.

Create a file called kibana.htpasswd in the same directory and paste the content in.

For example:

kibana:$apr1$Z/5.LALa$P0hfDGzGNt8VtiumKMyo/0

Now, our nginx will need a configuration to use, so lets create that now

Create a file called nginx.conf in the same directory

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    access_log    /var/log/nginx/access.log;

    include       /etc/nginx/conf.d/*.conf;
    include       /etc/nginx/sites-enabled/*;
}

And now, lets create a file called kibana.conf that will be our “website” on nginx.

server {
  listen 80 default_server;
  server_name logs.avitzurel.com;
  location / {
    auth_basic "Restricted";
    auth_basic_user_file /etc/nginx/conf.d/kibana.htpasswd;
    proxy_pass http://kibana:5601;
  }
}

Now, we will need the Dockerfile, which looks like this:

FROM nginx
COPY kibana.htpasswd /etc/nginx/conf.d/kibana.htpasswd
COPY nginx.conf /etc/nginx/nginx.conf
COPY kibana.conf /etc/nginx/sites-enabled/kibana.conf

As you can see, we are using all the files we’ve created earlier.

You will need to build the docker (and eventually push it when going beyond dev). For the purpose of this post lets assume it’s kensodev/kibana-nginx, you can obviously rename it to whatever you want.

$ docker build -t kensodev/kibana-nginx
$ docker push kensodev/kibana-nginx

Logstash

Before I dive into the logstash configuration, I want to emphasize how we ship logs to logstash.

All logs are shipped to logstash through syslog, we use native syslog without any modification. All machines write simple log files and syslog monitors it and sends it to logstash via TCP/UDP. There is no application specific shipper or any other solution.

Diving in

I like creating my own image for logstash as well, gives me more control over what volumes I want to use, copying patterns over and more. So, lets do this now.

$ mkdir -p docker-logstash
$ cd docker-logstash

Here’s the Dockerfile, This is a simplified version of what I am running in production but it will do for this blog post. (Feel free to comment/ask questions below if something is unclear)

FROM logstash:latest
COPY logstash/config/nginx-syslog.conf /opt/logstash/server/etc/conf.d/nginx-syslog
EXPOSE 5000
CMD ["logstash"]

Few parts we’ll notice here.

I am exposing port 5000 for this example, in real life I am exposing more ports as I need them.

I have a single configuration file called nginx-syslog.conf here again, in real life I have about 5 per logstash instance. I try to keep my log types simple, makes life much easier.

nginx-syslog.conf

input {
  tcp {
    port => "5000"
    type => "syslog"
  }
  udp {
    port => "5000"
    type => "syslog"
  }
}

output {
  elasticsearch {
    hosts => "elasticsearch:9200"
  }
}

filter {
  if [type] == 'syslog' {
    date {
      match => [ "timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
      remove_field => [ "timestamp" ]
    }

    useragent {
      source => "agent"
    }

    mutate {
      convert => ["response", "integer"]
      convert => ["bytes", "integer"]
      convert => ["responsetime", "float"]
    }

    geoip {
      source => "clientip"
      target => "geoip"
      add_tag => [ "nginx-geoip" ]
    }

    grok {
      match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
      overwrite => [ "message" ]
    }
  }
}

Now, we will build the docker image

$ docker build -t kensodev/logstash
$ docker push kensodev/logstash

Moving on to composing the solution

Now that we have our custom docker images, lets move over to composing the solution together using docker-compose

Keep in mind here, so far we are working locally, you don’t have to docker-push for any of this to work on your local machine, compose will default to the local image if you have it

Create a docker-compose.yml file and paste in this content

nginx:
  image: kensodev/kibana-nginx
  links:
    - kibana
  ports:
    - "80:80"
elasticsearch:
  image: elasticsearch:latest
  command: elasticsearch -Des.network.host=0.0.0.0
  ports:
    - "9200:9200"
    - "9300:9300"
logstash:
  command: "logstash -f /opt/logstash/server/etc/conf.d/"
  image: kensodev/logstash:latest
  volumes:
    - ./logstash/config:/etc/logstash/conf.d
  ports:
    - "5000:5000"
  links:
    - elasticsearch
kibana:
  build: kibana/
  volumes:
    - ./kibana/config/:/opt/kibana/config/
  ports:
    - "5601:5601"
  links:
    - elasticsearch

This will create all the containers for us, link them and we’ll have a running solution.

In order to check whether your solution is running you can go to your docker-machine ip.

My machine name is elk, I do this:

› docker-machine ip elk
192.168.99.101

If you type that address in a browser, you should see this:

Kibana Blank

As you can see the button is greyed out saying “Unable to fetch mapping”

If you send anything to logstash using:

$ echo Boom! | nc 192.168.99.101 5000

You will see this:

Kibana working

You can now hit “Create” and you have a working ELK solution.

Conclusion

It can be a daunting task to setup an ELK stack, Docker and compose make it easier to run and manage in dev/production.

In the next post, I will go into running this in production.

Thanks for reading and be sure to let me know what you think in the comment section.

Hack multiple conditions in Nginx configuration

We @ Gogobot use Nginx for basically every user facing
web service. Whether it’s our main web-app or a microservice.

Even for Docker, we use nginx and not the default Docker proxy that you get
with -p 80:80.

I had a request to add support for trailing slash redirects for all URLs. so
/paris/ will redirect to /paris. and of course /paris/?x=x will also
redirect correctly with all of the parameters.

If you know nginx configuration you know this is a bit tricky, but you can get
around it using an equally tricky hack.

Let’s dive in

What I want is to redirect trailing slashes only for GET and also have
different redirects when there’s a query string and when there isn’t.

Here’s the code:

location ~ ^\/(?!blog)(.*)\/$ {
  set $_url_status "";

  if ($request_method = GET ) {
    set $_url_status "${_url_status}get";
  }

  if ($query_string) {
    set $_url_status "${_url_status}_with_query_string";
  }

  if ( $_url_status = "get" ) {
    return 302 $scheme://stg.gogobot.com/$1;
  }

  if ( $_url_status = "get_with_query_string" ) {
    return 302 $scheme://stg.gogobot.com/$1?$query_string;
  }
}

As you can see here, I am basically building a condition in multiple phases and
then asking whether it’s get OR get_with_query_string and redirecting
accordingly.

Happy Hacking.

Version 1.0.2 of circle-env released

I just released version 1.0.2 of circle-env command line tool.

What is circle-env

circle-env is a small but useful command line tool that imports your .env file to CircleCi, but also work on CircleCI replacing templates with real files.

It’s especially useful when you work with Docker images and you want to build files, replacing <SOME_ENV_VAR> with the value of $SOME_ENV_VAR.

Check it out.

Release notes

Project Home

Workflow fragmentation

Workflow fragmentation is a huge problem for modern engineers in my opinion. People like to think that being in an office and communicating solves this problem but this cannot be farther from the truth.

The problem is, we are using too many tools to communicate and to schedule work tasks.

Those tools don’t “talk” to each other and this creates a problem for companies and teams.

We like to think of ourselves as “agile” using these tools, but in the process, we lose context and we lose discussion by many smart people contributing to the decisions on features/bugs.

Let me dive in a bit…

The workflow

Let’s take a very common workflow for an engineer. You are starting a new task.

Your steps.

  1. You pick up the task from your task management software.
  2. You open a new branch on your source control.

Diving in

1

You need to open a browser, choose a task that suits you (or assigned to you).

2

You need to create the branch on your local machine, git checkout -b feature/some-feature-name.

Disconnect started…

There’s already a disconnect between your branch and the tasks system. Unless you “started” the task on the web (or mobile), no one really knows what you are working on and that you started working on this task.

OK… Continuing…

You go about your day and then you

  1. Commit some code with a descriptive (or not) commit message
  2. Push the code
  3. Deliver your task in the management software after you verified

While this is a simplified version of likely any engineer daily routine, this comes with a lot more difficulties than described here, so lets dive into them a bit.

Again here, there’s a huge disconnect, especially when you think about the product manager needs to “see” something.

The product manager has no way of knowing what the status of the task is unless the engineer actively tells him (through the system). Even tough, as engineers, everything we do is a “notification”. “I deployed this to staging”, “I am running tests on this”.

Between git, Github, PivotalTracker and the CI. there’s not a single like of communication, a feed of what’s going on with this task

Real life example

In order to further clarify what I am talking about, I want to take you through a single bug I worked on. This bug involved all of the problems.

Part of my work at Gogobot is working on search. Search is a huge part of Gogobot so it involves pretty much every aspect of the product whether it’s on the web or on the mobile application.

Recently a bug was discovered in a new (shhh) product we are working on.

This bug was communicated to me via Slack private message.

First bug announced

Now, this is a pretty serious search quality bug, users finding the best fit for them is part of our DNA, this is something I should be working on right away.

I start working on the problem and report back.

Report back on the bug

The fix took about 10 minutes and I opened a pull request

Bug fixed

Now, you’ll notice that this has absolutely no data about the bug. This simply is titled “fix”. When your memory is fresh that might be fine, but you have no idea what was the initial cause of this “fix”.

Like magic, I got a new question on Slack about 20 minutes after the fix was already deployed

Second bug report

This is the bug report on Pivotal Tracker and it described what was described to me in chat.

Pivotal Bug report

Understanding the problem

If you were able to follow along, you likely already understand the problem. This is not exclusive to us, any company will experience this over time.

You have many communication channels, some verbal, some written and some are in code and commit messages.

For example, if you run git blame and you see a line that your colleague wrote a few months back, you don’t really have the “why” and the discussions that led to this fix.

Beyond the simple syntax and obvious bug fixes, this becomes a problem.

Attempting to fix

This post has been in draft mode on my computer for a long while, I have to say I don’t have a complete solution, but I think a solution is possible.

The solution involves multiple stages

  1. Story in your bug tracker / sprint planner
  2. Slack channels opens and link to it from the story. Any discussion on this story happens in the Slack channel (and recorded)
  3. Opening a task should be pickup-task $task_id and it links your git commits into the story
  4. Opening a PR, running tests or anything else on that branch, pings to slack (again, recorded)
  5. Once approved, the PR is stated “ready to merge”, task is delivered, channel is closed and recorded for later reference.
  6. commit is squashed and link to the story (with the discussion) is added to it.

This would have solved the problem described here. (or at least it’s a good start)

The story would have a timeline of the discussion between the product, the engineer that picked up the story and later for reference if you are looking at the code.

What’s the situation for you?

Would love to hear what’s the situation for you and how would you suggest to solve it.

Nginx rule to redirect wordpress blogs - dropping the date

Recently, we changed the Gogobot blog to drop the dates from the posts.

Since we have links coming in to the “old” structure, we of course wanted to redirect from the old structure to the new structure.

The “old” structure is the common structure for wordpress /blog/2016/01/31/some-post-title and we changed it to be just /blog/some-post-title.

The requirements are pretty simple here, simply drop the date and redirect to the title, keeping all utm params and anything else in the url.

Here’s how you do it:

rewrite "/blog/[\d]{4}/[\d]{2}/[\d]{2}/(.*)" "/blog/$1" permanent;

Obviously, for us, everything is done through chef, I never edit anything on the server itself, but you need to verify that the configuration is passing syntax checks using nginx -t built in command.

After applying the configuration and restarting nginx, all “old” structures will redirect to the new one.

Automate deleting old application versions from ElasticBeanstalk

Intro

AWS ElasticBeanstalk is a very popular way of deploying applications to “the
cloud” without really messing too much with configuration and deployment
scripts.

It’s a very powerful platform if you know how to use it.

The problem

If you work in a team and you do the regular git workflow, you will encounter
this error:

A client error (TooManyApplicationVersionsException) occurred when calling the CreateApplicationVersion operation:
You cannot have more than 500 Application Versions. Either remove some Application Versions or request a limit increase.

ElasticBeanstalk is limited to 500 application versions across all your
applications. You will need to delete old applications.

Automating the solution

I’m an automation freak so I automated the process of deleting application
versions from AWS.

all you need to do is export some variables

$ export APP=your-application
$ export PROFILE=your-aws-profile

Then, you execute

$ ./execute-delete.sh

Solution details

The solution is composed of a few files

parse_versions.rb

This ruby file will except the JSON output from
describe-application-versions and parse it into simple output. Making sure it
belongs to the right application before outputting and verifying the date.

#!/usr/bin/env ruby

require 'time'
require 'json'

ALLOWED_NAMES = [ENV['APP']]

t = DateTime.now - 14

json = ARGF.read
hash = JSON.parse(json)

versions = hash["ApplicationVersions"]

versions.each do |ver|
  application_name = ver["ApplicationName"]
  created_at = DateTime.parse(ver["DateCreated"])

  if ALLOWED_NAMES.include?(application_name)
    if t > created_at
      puts "#{ver["VersionLabel"]}"
    end
  end
end

list-versions.sh

aws elasticbeanstalk describe-application-versions --profile $PROFILE

delete-versions.sh

echo "Starting to delete versions of $APP"

while read ver; do
    echo "Deleting version $ver"
    aws elasticbeanstalk delete-application-version --version-label $ver --profile $PROFILE --application-name $APP
    echo "Version $ver deleted!"
done

Enjoy!

Source Code

Source code for the scripts can be found here: https://gist.github.com/KensoDev/646de085dc8fd4c4b39d.

Lets write your infrastructure as code - step by step (Part 1)

This is a step by step guide to setting up your infrastructure on Amazon using
code (Terraform) for a reproducible, version controlled stack.

Why?

While discussing infrastructure as code, the most common question is why?.
Amazon gives you a lot of options with the UI, you can basically do everything from the UI without really touching any code or learning any new tool.

For me, the best thing about it is that I have a state of what my infrastructure looks like at any given moment. I can version control it, I can share the state file with my peers.

Have opinions on why? or why not?? please share them in the comments.

Lets start now.

Prerequisites

  1. Amazon AWS Account
  2. Access key and secret key

Tools required

  1. terraform (0.6.12 was used here)
  2. Text Editor of your choice.

Fair Warning

If you progress with this article and execute the apply commands you should
know you are creating resources on your (or your company’s) Amazon account.
Make sure you have the permission to do so.

Getting Started

In order to get started lets create a file called ~/.aws/credentials.
Make sure this file is never checked in to source control or anyone will be
able to access your Amazon account.

In this file, add your profile like so:

[avi]
aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY

avi can of course be replaced with the profile name to your liking, it’s just
a string and you can give it any name you want.

After you have the profile setup in a credentials file, we can continue.

Terraforming

Without going too much into terraform (like I said earlier, I highly encourage
you to read up on it) it gives you the ability to “describe” your
infrastructure as code and “apply” the changes.

Think of it as a git for your infrastructure, you can change resources and you
can “diff” between the code and the version that is currently running.

Now, lets jump into the code.

Your first terraform file

Creating a directory

$ mkdir ~/Code/terraform-test
$ cd ~/Code/terraform-test

Create a file called main.tf.

Provider

Lets begin by declaring the provider we will be working with.

provider "aws" {
  region  = "${var.aws_region}"
  profile = "avi"
}
  • profile here needs to be the same profile name that you declared earlier in
    ~/.aws/credentials
  • region if your AWS region. As you can see it’s coming from a variable that
    we will be covering soon

In this post, I assume you are using us-east-1, if you don’t the code sample will not just work for you, you will need to edit the variables with the correct AMI id for the region.

VPC and more…

Obviously, I can’t cover environments here. But usually you will have
production, staging and test. Each of those will have it’s own VPC, route
tables
, internet gateways and more. For the purpose of this post, I will only
cover production.

Lets begin with describing our VPC, route tables and internet gateways. Those
are the foundations for our cluster and everything else depends on them.

resource "aws_vpc" "production" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_internet_gateway" "production" {
  vpc_id = "${aws_vpc.production.id}"
}

resource "aws_route" "internet_access" {
  route_table_id         = "${aws_vpc.production.main_route_table_id}"
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = "${aws_internet_gateway.production.id}"
}

resource "aws_subnet" "production-1a" {
  availability_zone       = "us-east-1a"
  vpc_id                  = "${aws_vpc.production.id}"
  cidr_block              = "10.0.1.0/24"
  map_public_ip_on_launch = true
}

resource "aws_subnet" "production-1d" {
  availability_zone       = "us-east-1d"
  vpc_id                  = "${aws_vpc.production.id}"
  cidr_block              = "10.0.2.0/24"
  map_public_ip_on_launch = true
}

resource "aws_subnet" "production-1c" {
  availability_zone       = "us-east-1c"
  vpc_id                  = "${aws_vpc.production.id}"
  cidr_block              = "10.0.3.0/24"
  map_public_ip_on_launch = true
}

Now, that’s a lot to take in but basically lets dive in to what we described
here:

  1. Virtual private cloud called “production”
  2. Route for internet access that allows access to all traffic (outside)
  3. Subnets for 3 zones with an IP range and all exist inside the production VPC

All of these terms can be intimidating at first and I know that the simplicity
of the “git push” to heroku is in the back of your mind this entire time, but
in a real production environment, you will need fine grain control over most of
these things. You want to make sure in the network level, services cannot
communicate with what they shouldn’t be communicating with (Just as a single
example).

Security groups

Now that we have the subnets and VPC, we want to describe some security groups.

Lets think of what we need.

  1. Load balancer accessible from the outside
  2. Instances accessible from the load balancer
  3. Services accessible from instances
  4. Instances accessible from “allowed connections”

I like to create elb, internal and external groups to allow those
rules.

resource "aws_security_group" "internal" {
  name        = "internal"
  description = "Internal Connections"
  vpc_id      = "${aws_vpc.production.id}"

  tags {
    Name = "Internal Security Group"
  }

  ingress {
    from_port = 0
    to_port   = 65535
    protocol  = "tcp"
    self      = true
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "elb" {
  name        = "elb"
  description = "Load Balancer Security Group"
  vpc_id      = "${aws_vpc.production.id}"

  tags {
    Name = "Load balancer security group"
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "external" {
  name        = "external"
  description = "Connection From the world"
  vpc_id      = "${aws_vpc.production.id}"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["YOUR_IP_HERE/32"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Load balancer

Now that our security groups are all described we can continue with our web
accessible infrastructure (load balancer and instance)

resource "aws_elb" "web" {
  name            = "web-production"

  subnets         = ["${aws_subnet.production-1a.id}", "${aws_subnet.production-1c.id}", "${aws_subnet.production-1d.id}"]
  security_groups = ["${aws_security_group.elb.id}"]
  instances       = ["${aws_instance.prod-web.id}"]

  tags {
    Name = "prod-web-elb"
  }

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 20
    target              = "HTTP:80/"
    interval            = 30
  }

  listener {
    instance_port     = 80
    instance_protocol = "http"
    lb_port           = 80
    lb_protocol       = "http"
  }
}

Here we described a load balancer that is web accessible.

To simplify things, I didn’t add an SSL listener with a certificate, but you
can obviously do that (let me known in the comments if that’s what you need).

Instances

You can see, the instances are identified by aws_instance.prod-web.id. so
lets describe those instances now.

First, lets create the keypair for the instance.

$ keyname=the-startup-stack
$ keymail="devops@the-startup-stack.com"
$ ssh-keygen -t rsa -b 4096 -f $keyname -C $keymail

Now that you have your key ready, lets start describing it with code.

resource "aws_key_pair" "auth" {
  key_name = "${var.key_name}"
  public_key = "${file(var.public_key_path)}"
}

I really like my hostnames on the instances to describe what they are and also
give me some context on the instance. so I usually have a minimal
cloud-config script.

Terraform has a great option for those using templates so lets start with
those.

Create a file called web_userdata.tpl

#cloud-config

bootcmd:
 - hostname web.${domain_name}.`curl http://169.254.169.254/latest/meta-data/instance-id`
 - echo 127.0.1.1 web.${domain_name}.`curl http://169.254.169.254/latest/meta-data/instance-id` >> /etc/hosts
 - echo web.${domain_name}.`curl http://169.254.169.254/latest/meta-data/instance-id` > /etc/hostname

preserve_hostname: true

Then in terraform, we can use that file as a template

resource "template_file" "web_userdata" {
  template = "${file("web_userdata.tpl")}"

  vars {
    domain_name = "yourdomain"
  }
}

Now, lets create the instance

resource "aws_instance" "prod-web" {
  count     = 1
  user_data = "${template_file.web_userdata.rendered}"

  connection {
    user = "ubuntu"
  }

  tags {
    Name = "prod-web-${count.index + 1}"
  }

  instance_type          = "m3.xlarge"

  key_name               = "${aws_key_pair.auth.id}"
  ami                    = "${lookup(var.aws_amis, var.aws_region)}"

  vpc_security_group_ids = ["${aws_security_group.external.id}", "${aws_security_group.internal.id}"]
  subnet_id              = "${aws_subnet.production-1a.id}"
}

Before we finish up things here, we need to supply all the variables we
declared.

Lets create a file called variables.tf

variable "aws_region" {
    description = "AWS region to launch servers."
    default = "us-east-1"
}

variable "aws_amis" {
    default = {
        "us-east-1" = "ami-7ba59311"
    }
}

variable "key_name" {
}

variable "public_key_path" {
}

Now we can execute terraform plan in order to check what will terraform
create on our Amazon account.

If you set up everything correctly, you should see something similar to this:

var.key_name
  Enter a value: the-startup-stack

var.public_key_path
  Enter a value: the-startup-stack.pub

Refreshing Terraform state prior to plan...


The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed.

Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.

+ aws_elb.web
    availability_zones.#:                   "" => "<computed>"
    connection_draining:                    "" => "0"
    connection_draining_timeout:            "" => "300"
    dns_name:                               "" => "<computed>"
    health_check.#:                         "" => "1"
    health_check.0.healthy_threshold:       "" => "2"
    health_check.0.interval:                "" => "30"
    health_check.0.target:                  "" => "HTTP:80/"
    health_check.0.timeout:                 "" => "20"
    health_check.0.unhealthy_threshold:     "" => "2"
    idle_timeout:                           "" => "60"
    instances.#:                            "" => "<computed>"
    internal:                               "" => "<computed>"
    listener.#:                             "" => "1"
    listener.3057123346.instance_port:      "" => "80"
    listener.3057123346.instance_protocol:  "" => "http"
    listener.3057123346.lb_port:            "" => "80"
    listener.3057123346.lb_protocol:        "" => "http"
    listener.3057123346.ssl_certificate_id: "" => ""
    name:                                   "" => "web-production"
    security_groups.#:                      "" => "<computed>"
    source_security_group:                  "" => "<computed>"
    source_security_group_id:               "" => "<computed>"
    subnets.#:                              "" => "<computed>"
    tags.#:                                 "" => "1"
    tags.Name:                              "" => "prod-web-elb"
    zone_id:                                "" => "<computed>"

+ aws_instance.prod-web
    ami:                      "" => "ami-7ba59311"
    availability_zone:        "" => "<computed>"
    ebs_block_device.#:       "" => "<computed>"
    ephemeral_block_device.#: "" => "<computed>"
    instance_state:           "" => "<computed>"
    instance_type:            "" => "m3.xlarge"
    key_name:                 "" => "${aws_key_pair.auth.id}"
    placement_group:          "" => "<computed>"
    private_dns:              "" => "<computed>"
    private_ip:               "" => "<computed>"
    public_dns:               "" => "<computed>"
    public_ip:                "" => "<computed>"
    root_block_device.#:      "" => "<computed>"
    security_groups.#:        "" => "<computed>"
    source_dest_check:        "" => "1"
    subnet_id:                "" => "${aws_subnet.production-1a.id}"
    tags.#:                   "" => "1"
    tags.Name:                "" => "prod-web-1"
    tenancy:                  "" => "<computed>"
    user_data:                "" => "948c5ae186c03822f50780fa376b228673b02f26"
    vpc_security_group_ids.#: "" => "<computed>"

+ aws_internet_gateway.production
    vpc_id: "" => "${aws_vpc.production.id}"

+ aws_key_pair.auth
    fingerprint: "" => "<computed>"
    key_name:    "" => "the-startup-stack"
    public_key:  "" => "YOUR PUBLIC KEY"

+ aws_route.internet_access
    destination_cidr_block:     "" => "0.0.0.0/0"
    destination_prefix_list_id: "" => "<computed>"
    gateway_id:                 "" => "${aws_internet_gateway.production.id}"
    instance_owner_id:          "" => "<computed>"
    origin:                     "" => "<computed>"
    route_table_id:             "" => "${aws_vpc.production.main_route_table_id}"
    state:                      "" => "<computed>"

+ aws_security_group.elb
    description:                          "" => "Load Balancer Security Group"
    egress.#:                             "" => "1"
    egress.2214680975.cidr_blocks.#:      "" => "1"
    egress.2214680975.cidr_blocks.0:      "" => "0.0.0.0/0"
    egress.2214680975.from_port:          "" => "80"
    egress.2214680975.protocol:           "" => "tcp"
    egress.2214680975.security_groups.#:  "" => "0"
    egress.2214680975.self:               "" => "0"
    egress.2214680975.to_port:            "" => "80"
    ingress.#:                            "" => "1"
    ingress.2214680975.cidr_blocks.#:     "" => "1"
    ingress.2214680975.cidr_blocks.0:     "" => "0.0.0.0/0"
    ingress.2214680975.from_port:         "" => "80"
    ingress.2214680975.protocol:          "" => "tcp"
    ingress.2214680975.security_groups.#: "" => "0"
    ingress.2214680975.self:              "" => "0"
    ingress.2214680975.to_port:           "" => "80"
    name:                                 "" => "elb"
    owner_id:                             "" => "<computed>"
    tags.#:                               "" => "1"
    tags.Name:                            "" => "Load balancer security group"
    vpc_id:                               "" => "${aws_vpc.production.id}"

+ aws_security_group.external
    description:                          "" => "Connection From the world"
    egress.#:                             "" => "1"
    egress.482069346.cidr_blocks.#:       "" => "1"
    egress.482069346.cidr_blocks.0:       "" => "0.0.0.0/0"
    egress.482069346.from_port:           "" => "0"
    egress.482069346.protocol:            "" => "-1"
    egress.482069346.security_groups.#:   "" => "0"
    egress.482069346.self:                "" => "0"
    egress.482069346.to_port:             "" => "0"
    ingress.#:                            "" => "1"
    ingress.3452538839.cidr_blocks.#:     "" => "1"
    ingress.3452538839.cidr_blocks.0:     "" => "YOUR_IP_HERE/32"
    ingress.3452538839.from_port:         "" => "22"
    ingress.3452538839.protocol:          "" => "tcp"
    ingress.3452538839.security_groups.#: "" => "0"
    ingress.3452538839.self:              "" => "0"
    ingress.3452538839.to_port:           "" => "22"
    name:                                 "" => "external"
    owner_id:                             "" => "<computed>"
    vpc_id:                               "" => "${aws_vpc.production.id}"

+ aws_security_group.internal
    description:                          "" => "Internal Connections"
    egress.#:                             "" => "1"
    egress.482069346.cidr_blocks.#:       "" => "1"
    egress.482069346.cidr_blocks.0:       "" => "0.0.0.0/0"
    egress.482069346.from_port:           "" => "0"
    egress.482069346.protocol:            "" => "-1"
    egress.482069346.security_groups.#:   "" => "0"
    egress.482069346.self:                "" => "0"
    egress.482069346.to_port:             "" => "0"
    ingress.#:                            "" => "1"
    ingress.3544538468.cidr_blocks.#:     "" => "0"
    ingress.3544538468.from_port:         "" => "0"
    ingress.3544538468.protocol:          "" => "tcp"
    ingress.3544538468.security_groups.#: "" => "0"
    ingress.3544538468.self:              "" => "1"
    ingress.3544538468.to_port:           "" => "65535"
    name:                                 "" => "internal"
    owner_id:                             "" => "<computed>"
    tags.#:                               "" => "1"
    tags.Name:                            "" => "Internal Security Group"
    vpc_id:                               "" => "${aws_vpc.production.id}"

+ aws_subnet.production-1a
    availability_zone:       "" => "us-east-1a"
    cidr_block:              "" => "10.0.1.0/24"
    map_public_ip_on_launch: "" => "1"
    vpc_id:                  "" => "${aws_vpc.production.id}"

+ aws_subnet.production-1c
    availability_zone:       "" => "us-east-1c"
    cidr_block:              "" => "10.0.3.0/24"
    map_public_ip_on_launch: "" => "1"
    vpc_id:                  "" => "${aws_vpc.production.id}"

+ aws_subnet.production-1d
    availability_zone:       "" => "us-east-1d"
    cidr_block:              "" => "10.0.2.0/24"
    map_public_ip_on_launch: "" => "1"
    vpc_id:                  "" => "${aws_vpc.production.id}"

+ aws_vpc.production
    cidr_block:                "" => "10.0.0.0/16"
    default_network_acl_id:    "" => "<computed>"
    default_security_group_id: "" => "<computed>"
    dhcp_options_id:           "" => "<computed>"
    enable_classiclink:        "" => "<computed>"
    enable_dns_hostnames:      "" => "<computed>"
    enable_dns_support:        "" => "<computed>"
    main_route_table_id:       "" => "<computed>"

+ template_file.web_userdata
    rendered:         "" => "<computed>"
    template:         "" => "#cloud-config\n\nbootcmd:\n - hostname web.${domain_name}.`curl http://169.254.169.254/latest/meta-data/instance-id`\n - echo 127.0.1.1 web.${domain_name}.`curl http://169.254.169.254/latest/meta-data/instance-id` >> /etc/hosts\n - echo web.${domain_name}.`curl http://169.254.169.254/latest/meta-data/instance-id` > /etc/hostname\n\npreserve_hostname: true\n"
    vars.#:           "" => "1"
    vars.domain_name: "" => "yourdomain"


Plan: 13 to add, 0 to change, 0 to destroy.
~/

If you execute terraform apply now, terraform will create all the resources for you.

It will create a load balancer and attach the instance to it, all you need to do now is deploy your code (coming in part 3)

Source Code

You can find the source code on github.

Summing up

You can probably see now that you can pretty easily describe your
infrastructure with code.

In part two of this post, we will create the database, elasticache (redis),
open up security groups for more options and more.

Would love your feedback and comments as always

The Startup Stack Progress Report March 2 2016

CLI

The more I work on this project and the more people work on it with my
guidence, I see the need for a CLI that will consolidate all the commands.

For example, the check repository template, you will need to clone it and copy
directories from place to place. “Wrapping” all the commands in a testable and
hackable cli makes sense to me.

I am still cautios with following every idea I have since it’s a big project.

Supermarket

I started uploading some of the cookbooks to chef supermarket, I still need to
update some READMEs and documentation to match the code status.

I am really making sure the CHANGELOG are up to date, I think in a project this
size it’s important to understand what will break if you upgrade the cookbooks
or any resource.

Get involved