Cloud

W Series: Re-thinking Wordpress

My Role
Cloud Architect
Timeline
2021

Rethinking Wordpress in the cloud

Anyone who has ever thought about running stateful applications that rely heavily on local document storage will understand the agony this was, but it was well worth it.

The challenge

  • Extreame delta in visits between on season and off season
  • Robust self recovering system, remember this brand is global
  • Toughened edge networking and defences
  • Geo location routing and caching that needs to be local

The solution

Kube & Googles secret sauce

One thing was clear from the outset, we needed scaling and we needed it to be incredibly fast. This eliminated old school managed instance groups and K8s became the obvious choice. But how best to go about this? Kubernetes is not a simple system to configure and maintain, especially when engineering and ops teams are used to more traditional hardware. I needed to create something extremely scalable, almost entirely hands off to maintain and can self recover and self heal in the event of crashes and issues.

Enter GKE Autopilot, an outstanding service from Google Cloud Platform. GKE Autopilot is a fully managed version of GKE, you simply deploy your workload with the spec set out and it will run it.

This was perfect, I created everything needed for the workload to run in a single workload file that can see the entire site be redeployed in a matter of a couple mins with a single command. GKE Autopilot can scale right down to 2 tiny replicas during off season, all the way to thousands of CPU cores and TB of RAM when needed.

You can read more about how I accomplished this in my blog post: xxxxx

Networking

So we've got our hosting done, how do we handle traffic?

Firstly GKE handles all the internal loadbalancing for our application, but as stated in the challenge, this is a global brand. Being a racing brand, I had a need for SPEED, so in comes Cloudflare. Cloudflare provides a very fast edge network with a huge number of pop centres all over the world. Caching assets is nice and all, but normal caching still requires communication with the server to get a manifest to then have the assets serve from the CDN.

Cloudflare global loadbalaning also allows us to do automatic global traffic routing at the edge to our clusters located around the world.

This also doesn't really unload that much to the edge if you still need to hit the server every time and suffer international latencies of the initial request. I wanted to go even faster! Introducing Cloudflare APO, APO is essentially spooky computer magic. It lets you cache the entire response, yes that right, the entire server response can be cached. Meaning you can essentially serve the entire site from your pop centres and forget about your cluster, saving you lots of money and makes for very happy global users. Cloudflare just gets a cache clear request from Wordpress when new assets are available.

Warning: you need to build your applications to run dynamic section in your site in javascript or as separate assets that you can exclude in pagerules from the aggressive caching of the APO.

Storage

It's time to address the elephant in the room, you're all thinking it. How do we handle local files, especially upload assets (particularly when there’s many GB's of files).

Like everyone I looked at storage buckets with low latency settings, the problem with this is you need to do some work to trick Wordpress in looking in other places and it introduces a lot of latency. Despite running in the same datacentre and as low latency as possible, latency continued to be an issue.

So why don't we use persistent disks? Well as you might not be aware of, you cannot have persistent disks that can be written from many nodes, only read from many. So this means only one node could write to the disk at once, which causes all sorts of mounting issues and an unreliable editing platform for the website administrators.

In comes Google Filestore, a NFS type system to mount extremely low latency, high performance drives over the VPC. Google does not lie when it says high performance, the latency on this method is very similar to directly mounted disks, but more importantly it allows us to read and write to the drive cluster no matter how many nodes we have.

The Result

The result of this work is a globally available, scalable website that is capabale of taking on huge traffic load. Allowing WSeries to handle its heavy traffic spike on season and keep costs low off season.

No items found.