said i missed you like i miss blood running from my arms but i’m different now i think. i didn’t mean it. i just wanna be where i belong. somewhere the trees smell like shit & you make promises & break them just like before. i’ve been waiting for something real. i’m a coffin waiting to be filled. i want an ending. where have you been?
In an earlier blog post, we gave a high level description of our migration from AWS to FB data centers. What follows is an in-depth analysis of how we migrated thousands of running AWS EC2 instances into Amazon’s Virtual Private Cloud (VPC) in the span of 3 weeks with no downtime. It was extremely meticulous work, and it required the development of custom virtual networking software to make it happen. It was, as far as we are aware, the fastest and one of the largest EC2 to VPC migrations to date.
Investigating Direct Connect
Direct Connect is a product offered by AWS that allows a customer to establish peering links between Amazon’s data centers and a third party. Using it, we figured that we could link to Facebook’s infrastructure over multiple redundant 10Gbps links. It was during this research when we found the main blocker:
We have no control over IP addressing in EC2.
While this hadn’t been an issue before, it was impassable if we were to establish links with Facebook, as their internal IP space intersected with that of EC2. After much deliberation, we began to understand that we had one option: migrate to VPC first.
VPC launched in mid 2009 as a companion product to the existing EC2 offering, though it quickly became considered to be EC2 2.0, as it remedied many of the commonly accepted EC2 downfalls. At face value, the migration didn’t seem conceptually difficult, as VPC was just another software abstraction on top of the same hardware, yet it was much more complex, with a few main issues:
You cannot migrate a running instance.
AWS offers no migration plan.
EC2 and VPC do not share security groups.
This last point lingered in our heads as we tried to come up with a solution. What would it take to make EC2 and VPC talk to each other as if the security groups could negotiate? It seemed insurmountable: we had thousands of running instances in EC2 and we could not take any downtime. We were looking for a solution that would allow us to migrate at our own pace, moving partial and full tiers as needed, with secure communication between both sides.
So, we created Neti, a dynamic iptables-based firewall manipulation daemon, written in Python, and backed by Zookeeper.
Design and Implementation of Neti
Neti is the name of the Sumerian gatekeeper to the Underworld. The name seemed fitting, as we needed an all-knowing gatekeeper that would control access to and from both EC2 and VPC. We had several requirements during the design process:
Security: Neti must keep unauthorized traffic off of our instances on both sides.
Abstraction: Neti must allow the instances on both sides to communicate seamlessly, without knowledge at the application layer about where each instance was located.
Automation: We have too many instances to curate access lists, so Neti must be fully aware of any instance changes in either network. It also must be deployable and upgradeable using configuration management software.
Performance: All of this must occur without any significant increases in latency.
Due to the lack of integration options between EC2 and VPC, the only route for communication is over the instances’ public interfaces. Using EC2 security groups, each group would need to be aware of the public IP of every instance with which it communicated. If we had tens of instances (or even hundreds), this might have been manageable. Yet, with thousands of instances on each side, trying to control the security group access lists would be unwieldy. Also, security groups in EC2 tend to negatively affect network performance as the number of rules increases.
So, we looked towards iptables to provide the security we needed. Iptables is the standard for Linux packet filtering, and can scale much better than EC2 security groups. Each instance has its own iptables firewall, and Neti manipulates the iptables rules as needed. Once the instances are locked down with iptables, the EC2 and VPC security groups get opened up to allow all traffic from any public AWS IP range.
Additionally, iptables helps to provide a mechanism for achieving our desired abstraction layer. Neti assigns each instance an “overlay” IP address which is used at the application layer for communication. This IP is configured as a DNAT record, pointing at the instance’s IP. This way, the application sees the same IP regardless of the instance’s location in EC2 or VPC.
All of this is coordinated by Zookeeper, which keeps all of the registration information about every running instances.
There are three components to the system: the Neti daemon, a Zookeeper cluster, and a set of Zookeeper proxies.
The Neti software must be run on each instance.
You must run a Zookeeper cluster within VPC, configured with its own security group.
As we’ll need instances in both EC2 and VPC to communicate with Zookeeper, there must be EC2 instances set up in an identical arrangement to the VPC cluster, placed within their own security group, and set up to proxy all requests to the VPC cluster. This security group, as well as that of the VPC cluster, must allow all Zookeeper traffic between them on their public instances. For ease of instance replacement, it’s a good practice to attach Elastic IPs to these. These are the only instances that will not run Neti, as Neti relies on these clusters to operate.
Neti Instance Lifecycle
Let’s say we have three instances: Hudson, Sierra, and Walden. Hudson and Sierra live in EC2, and Walden has already been migrated to VPC.
As the Neti daemon starts on Hudson, it begins the registration:
Neti contacts Zookeeper1-proxy, and, using its instance ID, inquires if it has ever been registered. If found, it gets the same overlay IP as before. If not, it randomly chooses an available overlay IP and locks it to this instance ID.
Neti sends up the IP information and network location to Zookeeper to complete registration.
Neti downloads the current list of running instances from Zookeeper, including all of their public, private, and overlay IPs, as well as the network they live in.
The list is parsed, and iptables filter and DNAT rules are generated for each of the entries.
Neti sets a watch on the Zookeeper instance list.
Concurrently, as soon as step 2 finishes, all the rest of the registered instances get their Zookeeper watches triggered with the new set of instance data, and their iptables configs get updated automatically.
Once this dance is complete, all of the instances have full access to each other, and are successfully blocking any unauthorized traffic. If another instance spins up, this process starts again; if any instance dies, Zookeeper notifies all of the Neti daemons of the change and rules are updated within seconds across the entire fleet.
Overlay IPs and you
The overlay IP makes all instances agnostic to the location of the target instance. For example, let’s say that both Walden and Sierra are frontend instances and Hudson is a database server. Due to latency and security, you do not want Walden to communicate with Hudson over public IPs. Yet, you need Sierra to access Hudson over public IPs, as Hudson is still in EC2. With overlay IPs you do not have to build two different DB configs for each side, continuously updating and shipping the changes as you migrate servers to VPC. You simply use the overlay IP, and Neti handles choosing the optimal path.
Performance issues, or how I learned to stop restoring ip lists and love ipset
In v1 of the design, we were purely using iptables to manage all of the NAT and filter rules, leveraging the built-in iptables-save and iptables-restore tools. However, as we tested larger numbers of rules in iptables we started to hit performance issues. The problem occurs because every single request must do a O(n) lookup on the iptables filter to determine if it may proceed. Testing 8000 rules on one of our Memcached instances caused the network throughput to drop by an order of magnitude.
Clearly, we couldn’t continue if this was the case, so we looked elsewhere. After finding an article on iptables performance, we switched to using ipset for the implementation. It had a few major benefits:
It stored the list of IPs in a hash table in memory, offering O(1) membership checks
The set of IPs could be updated on the fly with simple command line tools.
It provided more peace of mind about the system, because we did not have to reload the entire iptables filter each and every time a host changed.
With ipset in place, we tested well over 8000 rules without any degradation of network throughput.
For a migration like this, we knew that preparation was key. There were going to be parts of the system on either side of the “gap” at all times, with different timelines, strategies, and requirements for each tier. A quick stop/start forklift of the tiers would not work without downtime, so the entire migration had to be a finesse game, with the end-to-end process mapped out, analyzed, monitored, and executed.
First, we had to take stock of everything. Everything. We spent a good deal of time cataloging every system in the fleet (remember spreadsheets?), along with their individual migration strategies and the potential problems that they may have. This effort had a three-pronged benefit:
We established confidence that each system could be migrated after planning out each step.
We constructed a high level view of our infrastructure and understood the weak points in terms of failover scenarios.
We found many systems that were either unnecessary or suitable for consolidation.
This migration (and the one to follow) allowed us to distill our infrastructure into a core set of critical systems, greatly easing migration and management going forward.
Tag all the things!
With the sheer number of instances we had to migrate, we relied on the instance-tagging feature of EC2. Most of our instance reporting and lookup tools were already built upon tags, tracking instance name, role, and some chef attributes, so adding some new metadata was simple. We used tags to track the installed Neti version, the overlay IP, and various Neti state information.
Tags also became useful in monitoring the process. We could run ad hoc reports to see how many instances have been migrated, how many were running old versions of Neti, and if any weren’t running Neti at all. We built some of these scripts into Sensu checks as well, so that we could be alerted to any issues. Having an arsenal of scripts to constantly watch the progress and status was essential to a smooth migration.
With Neti distributed throughout our infrastructure, we proceeded to flip on all access to public traffic into the AWS security groups. At this point, Neti controlled all access into these systems, and we could begin moving server tiers. Thanks to Neti, the entire infrastructure appeared to be operating on one large, flat network, simplifying the migration strategy.
Most tiers were migrated by bringing up identical tiers in VPC and cutting traffic over. For example:
Django: Our frontend tiers were stateless, so this was a simple matter of bringing up new hosts in VPC, and shutting down the old ones.
PostgreSQL: Using the streaming replication built into version 9.2, we were able to bring up master/slave replica sets in VPC, and cut to them with application-level controls. With this method, two of us cut over our entire DB fleet to VPC in less than 2 hours once the tiers were online and replicating.
Cassandra: The VPC hosts were brought online as members of another datacenter with respect to Cassandra configs, so replication and migration was simple.
Redis: New master/slave replica sets were synced to the current production slave (to avoid BGSAVE on the master) and the configs were cut over.
All in all, we had fewer issues than we had expected, especially considering this complex migration strategy. One problem was with conntrack. I can hear the groans already. Conntrack was a necessary evil in this scenario so that iptables did not need to be parsed on every connection. Here’s what happened: The instances did not have conntrack enabled, so when Neti was started, it built and loaded the iptables rules, which in turn enabled conntrack. On an instance with a lot of traffic, for example a Memcached instance, it takes a matter of seconds to overwhelm the default max of conntrack entries (65536). Then, new connections are denied, making the instance rather useless.
Mitigating this issue was rather simple. We forced chef to run a modprobe conntrack and set the nf_conntrack_max to a much higher limit before installing and starting Neti.
It took just less than 3 weeks to migrate everything to VPC. In the end, we built a large set of skills and guidelines that would help our next migration go just as smoothly. The main takeaways were:
Document everything. A well-documented infrastructure ensures that you will not forget any dependencies during the migration, and well thought out migration plans for each tier minimizes the roadblocks that you will hit. During the migration, you’ll be running in a heterogeneous environment, and you don’t want to get stuck there while you figure out the next steps.
Tooling can make or break a project. Investing time in Neti and expanding our tools to track the Neti rollout and audit the migration status were key to the migration being successful. Your tooling deserves love.
Don’t fear the low-level. Getting down and dirty with Iptables enabled this migration. This meant spending a lot of development time much closer to the kernel than normal, but this feat would not have been possible otherwise.
We have open sourced Neti and it’s companion Neti-cookbook for Chef. We hope that many people can benefit from our work on the problem and that it will ease their migration into VPC.
By Nick Shortway, Instagram infrastructure engineer
Remember when this Heath ad was ran backwards and it looked like a switch 5050? and then that little wiener in the blind video actually did this hubba switch? heath had that line where he slammed trying this and it was still a super cool clip.