Aug 20, 2020

Case Study: FLG Business Technology

In late 2019 I was approached by FLG Business Technology to submit a proposal to refresh their existing infrastructure.

FLG is a CRM product focused on workflow to automate and control processes. They are a UK-based business founded over 11 years ago.

Existing Setup

In brief, they had a number of PHP servers running behind a highly available pair of load balancers, which connected to a MySQL database cluster for data and a fileserver cluster for static resources.

The data size for both the MySQL cluster and the fileserver cluster was in the terabyte+ range, and the number of files was in the tens of millions.

Additionally they had backup servers and monitoring servers, with backups taken on a filesystem level which was often time-consuming for both backing up and restoring.

Requirements

FLG's infrastructure at the time was split over several dedicated servers. It worked very well and was extremely reliable and responsive, but they were concerned that aspects of the setup were fragile.

The main issue was that too much of their infrastructure, and any capacity changes, required manual configuration.

They wanted to modernise their infrastructure to increase the overall reliability, and also centralise their infrastructure under one provider. They were looking to reduce risk, replace tacit knowledge with documentation, and save admin time.

Their requirements included:

● Simplify their infrastructure by moving everything under one provider;

● The application must continue to be at least as responsive as it was before the migration;

● A high availability solution that can deliver an uptime of ~99.99%;

● Clear documentation on everything and how different services interact;

● Strict, centralised config management with controls;

● Automated scaling of capacity up or down.

Implementation

I began by proposing three other requirements for a successful migration:

Remove the need for bespoke management as much as possible;
Avoid vendor lock-in as much as possible;
Ensure that user-focused performance is not negatively impacted as a result of the changes.

I also expanded their requirement of configuration management to also include the infrastructure.

Silvermouse proposed migrating their entire infrastructure to Amazon Web Services with the entire architecture defined as Infrastructure as Code (IaC) written in Terraform.

We firstly worked to move all of their existing development into a CI/CD pipeline which delivered the application into a container (initially deployed onto their existing servers) and expanded monitoring of the current response times. We also split out any other functionality from the application servers to ensure that the servers were interchangeable and disposable (cattle, not pets). We also made code changes to ensure that all static resources could be served from Amazon S3.

Next we defined two clusters - production and staging - in Terraform inside separate VPCs to ensure that changes could be rolled out affecting the production workload. All of the Terraform configuration was stored in Git with the shared state stored inside DynamoDB.

We migrated all of their files to Amazon S3, their databases to Amazon RDS (using replication to ensure a smooth migration), and their application servers to ECS/Fargate. Production instances were replaced slowly to ensure minimum impact to their customers.

All of the persistent data was replicated and backed up to another region for disaster recovery, and all of the application and infrastructure configuration data were stored in Git; so backups of the application instances were not required.

Outcome

Once the migration was complete, FLG had all of their infrastructure under one provider - AWS - with the entirely solution clearly defined in Git.

All of the infrastructure and ephemeral data could easily be redeployed from Git, and all persistent data was available in another region. This design almost entirely eliminated their operational expense.

Their confidence in the infrastructure's HA aspects and ability to scale up or down increased. Additionally, the average end-user response time for the application improved by around 45%.