Distributed File Locking

One of the big challenges in a distributed enterprise is giving users consistent, fast access to shared file data across sites. You can either centralize data and throw bandwidth, expensive MPLS connections, and WAN optimization at the problem, or try replicating data between sites. Both approaches have drawbacks, and neither solves the problem:

  • If you centralize the data: Performance for users in your headquarters is fast, but users in other offices end up waiting 20 minutes or more for files to open over the network. This often leads users to fall back on emailing or copying files.
  • If you replicate or copy the data: Performance is good at all offices since the data is local, but you end up with multiple copies of files that have to be merged later. Data integrity issues are inevitable with this approach.

It turns out it’s not a bandwidth issue. It’s a latency issue.

The L-word in networking and storage: Latency

Each file operation must cross the network, whether it’s to reference metadata, get file lock data, retrieve XREFs or close a file. Even apparently simple tasks like opening a file may take 1,000 or 10,000 sequential file operations to complete. This works well on the LAN when the user is close to the file and round trip latency is typically 0.5ms or less. That’s how traditional storage is designed to work.

Try to stretch the same file system over the WAN or into the cloud, and basic tasks like opening or saving a file can take 30 minutes or more. A traditional file system keeps the file lock and metadata with the authoritative copy of the file, so each operation has to reference it over the WAN, where roundtrip latency of 80 to 100 ms or more is common. That CAD file that took a few seconds to open on the LAN (10,000 file operations x 0.5ms = 5 seconds) now takes over 15 minutes (10,000 file operations x 100ms = 16.67 minutes).

Unfortunately, network latency is bound by the laws of physics. Data travels at the speed of light, but even at 186,000 miles/second, it takes approximately 15ms to get from New York to Los Angeles in a straight line with no interference — and data paths on the internet aren’t straight or unimpeded, so 60 to 80ms latency is common. Put another way, whether you have a 2 lane highway or a 10,000 lane highway doesn’t matter if you still have to make 10,000 or 15,000 sequential roundtrips.

One of our customers, C&S Companies, performed a test on their network with a 1.5MB CAD file to demonstrate the impact of latency (below). The file was hosted in Syracuse, NY, and opened by a user in San Diego — which took 22 minutes. As shown below, 15,000 sequential file operations had to make the roundtrip across the WAN with 86 ms of latency.

 

Cross-Site CAD file open without Panzura

diagram-cross-site-cad-file-open-without-panzura

C&S tested the same file operation with Panzura, and the file opened in just 8 seconds in San Diego — and every other remote office (below). All of the file operations happen locally. How? Panzura overcomes the impact of latency.

Cross-Site CAD file open with Panzura

diagram-cross-site-cad-file-open-with-panzura

So how can we solve for latency?

How do you get a distributed file system to deliver LAN-like performance, even when you have dozens or hundreds of offices connected to it?

Panzura has three key ingredients that make it fast, even if your offices are thousands of miles apart: distributed file locking, global metadata and global deduplication. Distributed file locking is the most important of the three. Let’s start there.

What is Distributed File Locking?

On any Windows File Share, if someone has opened a file for editing and you try to open it, you’ll get a message that you can’t open it in edit mode. You’re offered an alternative to open a read-only copy. You can save that as a separate file, but what you absolutely can’t do is overwrite or change the original file that someone else has open. Now imagine that for the entire Distributed Cloud File System and you get Distributed File Locking.

But file locking by itself isn’t sufficient for many types of projects with large, multi-part files. If multiple users are accessing the same file, you need granular sub-file locking. Without it, users end up routinely waiting for a colleague to finish editing the file, or have to make their own copies of files and manually merge edits later.

Distributed Byte Level File Locking, or byte-range locking, takes a much more granular approach. Instead of just locking an entire file, it can lock only the relevant elements of a file that are in use. It’s the equivalent of Google Drive for technical applications like AutoDesk Revit and AutoCAD, Tekla Structures, Bentley Microstation. Even large Microsoft Excel files can benefit from byte level file locking. It lets users work within the same project files while preventing file consistency or inadvertent data loss.

How Panzura’s Distributed File Locking Works

Panzura’s Distributed File Locking operates at the sub-file level. It’s built on five operating principles:

  • Data integrity above all else. The locking mechanism and other file system features are built to prioritize data integrity.
  • Immediately consistent lock data. To support data integrity, the locking metadata is updated immediately across all Panzura Flash Cache appliances in the system.
  • Every file has an Origin. This is the Freedom Flash Cache appliance that originally created the file. This node can assign temporary ownership to another Freedom appliance.
  • One Data Owner at a time. A sub-file or file can only “belong” to one Freedom Flash Cache appliance at a time — the Data Owner (DO). The DO manages the authoritative copy for that data instance and normally acts as the Authoritative Write Node for that data. Any writes have to be committed through the DO.
  • Data Asymmetry Resolution (DAR). Any differences between remote sets of files need to be resolved quickly and efficiently. Changed blocks are transported back to the DO.

Panzura File Locking

diagram-panzura-file-locking

How Distributed File Locking Ensures Data Integrity and File Consistency

Every storage solution and file system has various mechanisms meant to ensure data integrity. But there are different approaches and important trade-offs in any system design. In a distributed file system, architecture decisions have massive implications for data integrity – and its opposite, data corruption.

How a global file system handles data across nodes has a big impact on data integrity and consistency. There are two critical parameters to consider:

  • Where does the authoritative file lock data live?
  • How does a node behave if it’s partially or completely disconnected?

Panzura’s Distributed Cloud File System ensures data integrity by:

  • Having the originating Freedom Flash Cache appliances arbitrate lock data for their files.
  • Running checksum, dedupe and compression on all data before we send it, to avoid sending any corruption to the cloud.
  • Providing two high availability options for Freedom Flash Cache appliances: HA within a site by adding a local standby controller, or running a redundant controller in the cloud to provide HA for one or more sites.

You can learn more in our whitepaper on Distributed File Locking.

NEXT: Read more about Freedom View.