TL;DR: Choosing an HPC Filesystem for the Cloud? Choose Lustre, the On-prem Leader

Cloud providers looking to enable the growing number of use cases for cloud-based HPC recognize the importance of standardizing on the right HPC filesystem. So, while a number of vendors claim to provide a filesystem ideal for HPC in the cloud, a deeper look at market realities and technology reveals where they fall short.  Let’s take a closer look.

Lustre is the #1 HPC Filesystem in HPC Today 

Lustre is the dominant filesystem used by HPC customers for more than two decades. A review of the companies supporting Lustre (http://opensfs.org/participants) shows 3 of the top 5 supercomputers in the world using it as their filesystem.1 In fact, Lustre owns more than 60% of the market share in supercomputing.2 A simple Google search yields numerous HPC teams taking advantage of Lustre’s strong performance. By contrast, you will struggle to identify an organization that is basing their HPC cloud apps on NFS and demonstrating HPC levels of performance.

Lustre can deliver over 1 TB/s across thousands of clients. For example, the next-generation Spider Lustre file system of Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) is designed to provide 32 PB of capacity to open science users at OLCF, at an aggregate transfer rate of 1 TB/s. (Read about OLCF’s use of Lustre).

HPC is Not Just about Latency

Low latency is certainly a base-level requirement for any HPC filesystem. Yet, it is not the only one. Performance at scale and streamlined data movement also serve as critical features of a valid HPC filesystem. Unfortunately, the technologies offered by many filesystem vendors fail to deliver these features. To compensate, these vendors attempt to place the HPC focus entirely on lower latency.

For Performance at Scale, NFS Falls Short

CSPs that are serious about offering true HPC in the cloud can immediately dismiss any technology solution that is based on NFS, because it simply will not handle the combined scale and throughput levels demanded by the HPC community. Vendors that have cobbled together filesystems based on NFS did so because they were not originally focused on the very specific performance requirements of true HPC, but rather pivoted into this space. Typically, these are flash-optimized filesystems that can only provide a maximum throughput of approximately 100GB/s. As a result, the HPC community generally limits the use of NFS to home directory storage, not the more challenging needs of HPC compute storage. As an example, for Titan, the Oak Ridge Supercomputer, NFS is used merely for hosting the project and user home directory.

Streamlined Data Movement Cannot be a Bolt-On Feature

Another key to delivering data fast to cloud-based HPC applications is to reduce the number of steps in the data movement process. The hallmark of a filesystem not originally designed for the cloud is its need for additional steps to stage the data via replication to the cloud.

Consider, as an example, a filesystem that was originally designed as a flash-optimized filesystem. In this case, data is ingested from cheaper tiers of storage, like NFS and object storage. Additional steps of writing the data and sharing it with users (HPC apps) become necessary, thereby slowing the entire process. In comparison, the Kmesh filesystem requires minimal steps in the data movement process. This is due to the cloud-native design and engineering of Kmesh’s technology.  The Kmesh Lustre-as-a-Service incorporates hybrid cloud, cross-cloud, and cross-region data synchronization capabilities, eliminating the need to daisy chain scripts together. We did not pivot into the HPC-Cloud market, we were designed for it from the start.

AWS Chooses Lustre

Many CSPs may despise AWS and their market-moving tactics, but none will deny the thoroughness AWS displays when evaluating and delivering core technologies. AWS spent more than two years conducting in-depth market research and focused technical trials into various HPC filesystem technologies before deciding where to place their HPC bets. As part of that effort, AWS looked into NFS-based filesystems like EFS, NFS, Weka, Elastifile and others. But in the end, AWS chose Lustre. As you are likely aware by now, AWS then launched Amazon FSx for Lustre at their most recent re-Invent  show. They ultimately concluded that Lustre is what the HPC market wants and that Lustre is the best technology to serve as the foundation of their Cloud HPC future.

To be sure, a few CSPs will attempt to outsmart AWS and try something different to see if they can gain advantage in cloud HPC. But, as history shows, that strategy is fraught with peril. The choice is analogous to the 4G-versus-WiMAX debate among wireless carriers. The dominant carriers decided on LTE, while a few renegades decided to “outsmart” them and bet on WiMAX instead. Things did not end well for the WiMAX teams. In the case of cloud HPC, not only is Lustre the superior filesystem technology, it has the market momentum and market leader support.

TL;DR: Choosing an HPC Filesystem for the Cloud? Choose Lustre, the On-prem Leader

If you are in charge of HPC offerings at a CSP, can you really afford to give AWS an HPC filesystem advantage? If you are in charge of choosing an HPC filesystem for the public cloud, would you want to take the risk of picking an NFS -based filesystem? The choice seems clear, and Kmesh is here to help you get started with a Lustre-based cloud filesystem in the fastest and easiest way possible. Check out our Lustre-as-a-Service page for more.

1https://en.wikipedia.org/wiki/TOP500#Top_500_ranking & various resources

2https://wiki.whamcloud.com/display/PUB/Why+Use+Lustre