A Case Study on De-duplication

In early 2019, a customer approached us about an issue with the file storage aspect of their CMS application.

Their Infrastructure

Their application ran CMS for several of their clients, part of which handled the generation of invoices, contracts, and emails.

Each application server expected the files to be stored in a locally-accessible directory structure, and required the files to be near-instantly available at all times.

The files were stored in one of two XFS partitions, mounted over NFS from a pair of dedicated file servers.  The servers were using drbd and a floating IP so the share could be brought back in place quickly if the primary server failed.

The Problem

The files themselves never changed - they were only ever created or deleted - and the number of files added or removed per day was at a fairly low level. However, due to the time needed to traverse the directory structure, backups were taking too long. Daily backups were often running longer than 24 hours, which was not something that could continue (and would only get worse).

For this case study, we focused on one of the file shares which totalled around 15 million files, another 15 million directories, and 3TB of data.

Proposed Solution

The majority of the files generated were attachments from emails, which included images in signatures and so were often repeated. We suspected that removing duplicates would be a good first step.

We wanted to implement de-duplication in a way that the application itself would be unaware of it as much as possible, and ruled out rewriting the application at this first stage.

We decided to create a pool/ directory per client that would contain the only real files. The original filename would have a checksum appended to the filename, and each original file would be replaced with a symbolic link pointing to filename_checksum. This naming convention would ensure that two versions of the same filename would only exist for a customer if they actually differed.

The Perl script we wrote to achieve the above is linked at the end of this post. It takes a list of relative filepaths over STDIN, and will de-duplicate into the current folder by default, allowing you to run find . -type f | dedupe.pl .

The Outcome

The de-duplication was a lot more effective than we initially hoped for. We were expecting to remove around 70% of the files, but for some of their clients the reduction was as high as 99.7%.

In total, we reduced the number of files from 16,685,000 to 2,791,560, which was a reduction of 83.27%. The size on disk was reduced from 2.99TB to 1.14TB; a 61.9% reduction.

The backups - which iterate both file shares - went from taking 24-28 hours to taking 17-20 hours, solving their initial problem. The reduced amount of data also makes future improvements much simpler to tackle.

Limitations

  • Increased inodes:
    Files, directories, and symbolic links in Linux all require an inode. Replacing a file with a symbolic link will not increase the inode usage, but the de-duplication pool will. If potential inode exhaustion in the partition was an issue then we would have needed to implement this differently.
  • Increased complexity of symbolic links:
    Symbolic links are not transparent to the application (unlike hard links) and so if the directory structure on the file server and the application server differs, relative links will need to be used. The differing directory structure was a factor in our case and was worked around with the generate_relative_link subroutine.
  • Relies on support of symbolic links:
    Not everything supports symbolic links, and any code running checks against the files will need to be updated to support symlinks. Initially, we were investigating the possibility of using an /Amazon Storage Gateway/ to present an S3 bucket as an NFS share (instead of having the fileservers at all). One of the reasons we abandoned this idea was that Storage Gateways do not support symbolic links and will return a 524 Unknown Error.
  • Maintenance:
    This script has no hook into the application itself, nor does it have a separate partition for the pool. It will need re-running periodically to stay up to date with the filesystem.
  • Deduplication method:
    Deduplication is implemented at a filename level, so if filenames differ (even with a (1) at the end), there will still be duplicates. In our case, we had 534,010 duplicates remaining at around 180GB.
  • Threading:
    There was no multi-threading support implemented. While it can be run in parallel against different lists of files, this needs to be handled manually.
  • Future:
    The problem has not been solved permanently. It has been mitigated for now, but at a certain number of deduplicated files the problem will reappear.

Future Improvements

We’re looking at rewriting the application code so that writes are pushed to Amazon S3 directly, with de-duplication performed at the same time. This improvement will allow us to then read from S3, and remove the need for the two fileservers and the backup server.

Implementation

Please see https://gitlab.com/silvermouse/deduplication