WorryFree Computers   »   [go: up one dir, main page]

Making Consul run faster across data centers

Pavel Suchman
AppsFlyer Engineering
7 min readAug 3, 2023

--

As always, at the beginning of a quarter, our team lead named the tasks that needed to be done and asked who wanted to do what. I said I wanted to do “Consul to Consul Sync” (which might appear later in the text as C2CS), because it sounded interesting and kind of low-level. I have never written something like that and wanted to try.

I was happy to finally do a project of my own!

The initial training period in Appsflyer is quite long. After finishing it, new team members have to do a number of onboarding tasks that teach about various systems that our team manages.

These tasks are important but not very interesting, so I was excited to do a whole new project. And I would do it in Golang!

Getting real with Golang

I came to AppsFlyer with 12 years of experience, mainly in Python, but also in JavaScript and Ruby. I learned Go as a part of onboarding.

Go is the default platform language in AppsFlyer. Consul is also written in it and has a great SDK in Golang, so using it to write Consul-to-Consul Sync was a no-brainer.

As any programmer knows, there is a big difference between learning the syntax of a new language and really knowing how to use it. I did a training project in Go and added some tests to our existing projects, but now, I had a chance to use it in a new project and get a real feel for it.

Why do we need to sync between two Consul data centers?

AppsFlyer is currently in the middle of a large migration — we are moving from an old deployment system to a new one based on K8s. One of the main objectives of the migration is to make it safe for the end users — AppsFlyer developers.

If something goes wrong, they can always go back to the old system. We call it Side By Side. I was against it at the beginning because of the extra complexity, but this actually saved our skins later, as we had to roll back the migration because of the failure of another component. I work in a team responsible for everything deployment and service-runtime related, so the bulk of the migration effort was on us.

Both the old and new K8S-based production systems use Consul for service discovery. Each has its own Consul Datacenter infrastructure.

To quote Consul’s creators, Hashicorp, “Consul is a service networking solution to automate network configurations, discover services, and enable secure connectivity across any cloud or runtime.”

We needed to push the changes from the new system to the old one, so the services running in the old system would learn about changes that happened to the services running on the new system.

Hashicorp has a solution exactly for that problem, consul-k8s. Unfortunately, we couldn’t use it because of our scale.

AppsFlyer’s scale

Consul-k8s is a batch-based system; by default, it runs every 30 seconds.

We couldn’t afford to wait for that long.

AppsFlyer receives hundreds of billions of requests per day from more than 90% of

smartphones in the world. Some of the microservices needed to serve these requests run on hundreds of AWS instances, mostly spots. So spot interruptions are common, and when a service configuration changes, we can’t afford to wait 30 seconds before propagating the change — that would mean sending live traffic to instances that are already gone.

For those of you not familiar with AWS spots, it’s a way of profiting from AWS’ spare capacity. AWS lets you use their spare resources for an up-to-90%-discount off the regular price, but in exchange, they may take it back and shut your instance down at any moment — this is called spot termination.

We needed a system that was close to real-time. With the current Consul-to-Consul sync, we would get an alert if a synchronization attempt for a single service took more than a second.

A very experienced member of our team, who had worked on that problem before, studied the Hashicorp code and came up with a solution: do less work 🙂

Consul-K8s was updating the whole service with all of its instances every time there was a change, so if only 1 of 300 instances of the service changed, it would still send the list of all 300 to the other side, resulting in lots of unnecessary network traffic and processing work.

In the design he proposed, we would keep the state of the remote side and send over the network only what had actually changed.

So I was left with the “easy” 🙂part — implementing his design.

Architecture — it’s always about tradeoffs

Conceptually, the architecture is quite simple. The K8s services that need to be synced over to the old system have a special annotation. There is a single watcher listening for K8s service events. Upon receiving an event with this annotation, the system creates a synchronizer — a triple service watcher, communication channel and service pusher.

The watcher starts watching for Consul events for that service and sends them through a channel to the

pusher. The pusher listens to events coming from the channel and pushes them to the remote Consul server. It also manages a cache of the remote Consul state to minimize work — this is the part that “knows” to send the minimal needed updates to the remote Consul.

One of the most important (and somewhat controversial) parts of this design is the idea of using a dedicated synchronizer per service. We have more than 1,000 services, so there is a lot of concurrency going on.

We decided to do it that way instead of going with a single synchronizer for everything, to simplify development and testing. As with all tradeoffs, this one had a cost. By default, the Consul service has a limit of 250 concurrent connections, so we had to change this configuration in production to enable 1,000+ connections.

Good engineering and non-trivial unit tests

Good engineering is a mantra in our group. We mention it every 2 weeks during iteration kickoffs and

use it to settle arguments. A part of it is writing good unit tests with a lot of coverage. It turns out that

writing good unit tests for this system is pretty complicated — I had to mock the non-trivial behavior of the actual Consul, like keeping and updating the state of registered services.

Development and deployment

It took me about 3 months to write the code and tests and to bring it to a state where it was running on my machine. Next came the deployment to K8s, and I got stuck 🙁

The code simply refused to run, and there was nothing in the logs. I asked the folks on my team; usually, they know everything, but this time nobody did. After lots of head scratching, re-reading my code, and Googling, I understood that our K8s deployment system didn’t support cluster roles.

Regular K8s roles are valid for everything under the same namespace, but my system needed to watch for changes across all namespaces, so it needed a cluster role to run correctly.

After adding support for cluster roles, I could finally start load testing!

Load testing

Everything until that point was based on the idea that the new system would be fast enough, but the only way to test that was to run it on production load 😨

In the original design, C2CS synchronizes whatever runs on K8s. But we couldn’t put all of our microservices on K8s just yet, so I added a new mode where the user could specify a list of services, and C2CS would synchronize this list. I started this mode, and thankfully, the system could handle the load, and the sync worked fast enough.

Current status

Consul-to-Consul Sync is currently running in production 🥂

We have restarted the migration and 30% of Appsflyer services are already on K8s! So far Consul-to-Consul sync runs fine, but we will know how it is really going only later, when a majority of AppsFlyer services run on K8s.

Hopefully, the system will stand the test of time and scale and prove itself. There will surely be bugs and calls from PagerDuty in the middle of the night, but that’s the nature of the software. On the other hand, it’s just software, so the solution will be an edit away.

Future developments and lessons learned

The raison d’etre of Consul-to-Consul sync is to enable our migration to K8s. I hope that it will stand this test and Just Work (™)

After the migration is over, we will retire it, but maybe C2CS will become useful for other projects in AppsFlyer — after all, it’s a very generic system. We also want to release it as an open-source project so that others might use and hopefully improve it.

Writing and deploying this system wasn’t easy, but it was also a great experience, and I learned a lot from it.

Interestingly, the hard parts were not what I thought they would be. Deployment problems turned out to be the hardest, followed by writing good unit tests. I expected that the code itself would be non-trivial, especially getting service changes and pushing them to the other side, but that proved to be relatively easy, thanks to the good architecture and great Consul Go SDK

I wonder how my future projects in AppsFlyer will turn out to be — for now, I am cautiously optimistic 🙂

--

--