That’s a lot of sites!
Yes, it is! 10+ years ago we migrated 200 sites to a new server - and in 2024 we set up Cloudflare protection for well over 1000 sites.
Aegir is a hosting system built in Drupal, for Drupal. It lets you create and manage Drupal sites and all their databases, filesystems and virtual hosts. With Aegir, it’s easy to manage hundreds or thousands of sites via a simple UI. Each site has a node to represent it, and this project stored a whole bunch of additional Cloudflare metadata against the Site Nodes.
Keeping a PaaS product online at all times comes with a high level of responsibility. After code quality assurance and testing, DDOS attacks of all sizes and types are a high risk threat. The cost of protecting our availability, unsurprisingly, was non-trivial and became a point requiring fresh research and investment. Reducing the general load and the potential attack load on our servers would serve to support our quality of service.
In the Spring of 2024 we set up a proof of concept using Cloudflare, which would allow us to make a significant ongoing cost saving whilst also playing with some really cool APIs.
The plan
In order to put all our sites behind Cloudflare, we needed to:
* Get Aegir talking to Cloudflare via their API, and build the automatic processes to support the setup process
* Create a clear interface for starting and tracking setup per site
* Create a clear dashboard for tracking progress overall
* Go! Change the nameserver records for every domain, to point to Cloudflare
Here are some of the key interesting parts of our story (which had negligible downtime, btw!)
Interacting with Cloudflare’s API
It was greatly pleasing to find that Cloudflare let you do almost everything over an API (given the right authentication!). Given the scale of their company, it shouldn’t be a huge surprise I suppose!
With 1000 sites approx. to deal with, there was a strong drive to automate as much of this process as possible. So there was much celebration when it became clear that we would be able to do almost every step automatically, including:
* create everything we needed for the site domain, with an appropriate plan under our chosen account
* upload DNS records, query them and update them
* perform real-time ownership verification and SSL validation
Simple admin setup - Obtain DNS records for hundreds of domains, store them against Site Nodes, press GO
A large portion of the domains (about 500) are under our control, so we were able to bulk export the DNS records, process and save them against their Aegir Site Nodes. Some simple processing removed the SOA and NS records so that we would be able to send the records straight to Cloudflare when setup started.
These ‘easy’ sites, for which we had the DNS records, would be processed in bulk with a lot of Go! button clicks, and then making the relevant nameserver changes with the domain provider.
(The domain provider did offer to do bulk updates for us, but there seemed to be a 24h delay before action was taken - so it was quicker to do these changes ourselves.)
Creating a self-service setup mechanism via Drupal config + settings.php
Domains that we only had nameserver control for would have to be updated by the customer. But how would they know what nameservers to set up, and how would they trigger the setup process? Stats to the rescue!
As part of the Drupal 10 rebuild of the platform we added a Statistics module, which collects a selection of data points from each site and passes them to the corresponding node on Aegir for storage. They’re then aggregated and sent back to the sites so that customers can compare their performance to the cohort averages.
We created a form interface for the user to trigger setup when ready, and then smuggled the outputs over to Aegir in amongst the performance stats 🤭
This self service route did still require a lot of chasing, but generally performed well as it allowed the users to perform the nameserver change at their choice of timing, rather than requiring scheduled calls and appointments on top of an already high administrative load.
Surfacing useful data, stats and buttons for the team
When developing a technical tool that ideally needs a single fire-and-forget button to kick things off, you not only need that one button but also a lot of clear visual cues to help others understand what’s going on.
We tied the setup steps to interface outputs, with clear dependency messaging and reporting.
Reporting eventually included a message in our Slack channel ⭐️
The results
After just a few weeks, we had 990 sites set up on Cloudflare - 90% of the most important setups. It turned out to be very difficult to get hundreds of different people, groups and stakeholders to make DNS nameserver changes quickly (even when you tell them it’s urgent!), so the process would continue a little longer.
Already we can measure the success - thanks to Cloudflare’s caching, we’ve seen decent reductions in bandwidth use and the number of requests hitting the server.