Today’s post was originally published on Jade Meskill’s personal blog and is being shared here.
At Octoblu, we deploy very frequently and we’re tired of our users seeing the occasional blip when a new version is put into production.
Though we’re using Amazon Opsworks to more easily manage our infrastructure, our updates can take a while for dependencies to be installed before the service restarts – not a great experience.
We knew that moving to an immutable infrastructure approach would help us deploy our apps, which range from extremely simple web services, to complex near-real-time messaging systems, quicker and easier.
Containerization is the future of app deployment, but managing and scaling a bunch of Docker instances, managing all the port mappings, is not a simple proposition.
Kubernetes simplified that part of our deployment strategy. However, we still had a problem, while Kubernetes is spinning up new versions of our docker instances, we could enter a state where old and new versions were in the mix. If we shut down the old before bringing up the new, we would also have a brief (sometimes not so brief) period of downtime.
I first read about Blue/Green deploys in Martin Fowler’s excellent article BlueGreen Deployment, a simple, but powerful concept. We started to build out a way to do this in Kubernetes. After some complicated attempts, we came up with a simple idea: use Amazon ELBs as the router. Kubernetes handles the complexities of routing your request to the appropriate minion by listening to a given port on all minions, making ELB load balancing a piece of cake. Have the ELB listen on port 80 and 443, then route the request to the Kubernetes port on all minions.
Blue or Green?
The next problem was figuring out whether blue or green is currently active. Another simple idea, store a blue port and a green port as tags in the ELB and look at the current configuration of the ELB to see which one is currently live. No need to store the value somewhere that may not be accurate.
Putting it all together.
The following is part of a script that runs on our Trigger Service deploy. You can check out the code on GitHub if you want to see how it all works together.
I’ve added some annotation to help explain what is happening.
Sometimes Peter makes a mistake. We have to quickly rollback to a prior version. If it is the off-cluster, rollback is as simple as re-mapping the ELB to forward to the old ports. Sometimes Peter tries to fix his mistake with a new deploy and now we have a real mess.
Because this happened more than once, we created oops. Oops allows us to instantly rollback to the off cluster, simply by executing
oops-rollback, or quickly re-deploy a previous version
We add an
.oopsrc to all our apps that looks something like this:
oops list will show us all available deployments.
We are always looking for ways to get better results, if you have some suggestions, let us know.