Reading Time: 4 minutes If you’re not familiar with the concept of a data lake, you’re not alone. A data lake is a large repository for unstructured data. And, it takes all data, before it’s cleansed, structured, or organized.
It’s not until you begin to unravel the technical challenges of storage, and retrieving data at a large scale then you begin to understand why there are so many solutions out there. And, why many of them sound confusingly similar.
In this post, we’re going to review some of the data lake implementations of the past, compare them to more modern solutions, and consider the various approaches taken to data management.
A Brief History of File Systems
Since the early 2000’s, we’ve seen a rise in different file systems like NetApp then eventually Isilon, that were designed to overcome the limitations of storage on a single server and its operating system.
These “filers” were purpose-built to handle hundreds of users reading and writing files all at the same time and achieving scale much larger than a single server can ever handle. Over time, these solutions became so popular that NetApp and EMC both saw so much success that combined, they make up almost 50% of the enterprise file storage market.
The File System Challenge
Are NetApp and EMC still the right choice? Those file systems were built close to 20 years ago, when the “big data” problem wasn’t nearly as big as it is now.
For the answer, let’s explore what happens when a user requests a file from the “filer”. It is now the filer’s task to process the request to find the single file that’s buried amongst a billion or so files.
This is your old “needle in the haystack” problem. The file system has to search the entire directory of files to retrieve the data your user asked for, which lives on some disk somewhere on the storage array.
It’s the file system’s job to keep track of all of that data and also keep up with performance, while often protecting the data using snapshots. Plus, the file system performs a litany of other tasks, thousands of times per second. That can mean performance gets just a bit wobbly at times.
To overcome these technical challenges, the legacy storage vendors have thrown more hardware at the problem. This creates silos of storage, along with an extraordinary amount of data replication, as files that are identical – or very nearly so – are stored in numerous different places.
While that’s a problem for you, as you struggle with data silos, or undertake recommended system upgrades to cope with your current data volume, it works well for legacy storage providers.
Most of these hardware appliances are usually supported for 3-7 years. As they near the end of their supported life, you’re faced with purchasing newer versions of the same hardware, and migrating your data from old devices, to new. That consumes CAPEX, and requires a significant amount of forward planning, to avoid running out of support, or out of storage space.
The explosion of data everyone is experiencing means organizations are hitting the “financial tipping point” much faster, and that’s prompting a move away from a regular refresh cycle, and towards a software-defined storage solution or OPEX model.
File vs. Object (Blob) Storage
Given many older file systems no longer make sense to implement due to scale limitations and cost, cloud storage – or object storage – seems like a logical solution.
However, while object storage can overcome the common limitations of file systems by coping with the volume of data, it does come with its own set of challenges. The first one is that object storage speaks to applications or users in different protocols such as Swift or HTTP.
These protocols differ from the file system protocol (SMB & NFS) as they are designed for web traffic. That means while you can migrate your data into object storage, your users and applications can no longer work with it. That might be fine for older data that you’re simply looking to store for archive reasons, but it’s unworkable for user data; files that people actively access and edit on a regular basis.
The problem with adopting object storage is that it forces organizations to re-write their applications to communicate in the new protocol. This can be very time consuming and cost prohibitive for most companies. In fact, one financial institution contemplating a move to cloud storage had 2,800 legacy applications to consider, and was facing a bill of millions of dollars for rewriting them.
The Best of Local Storage Meets the Best of Cloud Storage
Here’s where the next generation filer comes into play. One that is software-defined and is designed from the ground up to work with object storage.
If you take a look at how Panzura has architected their global file system, they have overcome the scalability limitations of traditional file systems by seamlessly converting all files into objects to live in a public or private cloud (object store).
This also makes all data available for consumption anywhere there is another Panzura filer accessing the same object store, so this means Panzura customers can lower their total cost of ownership of buying more hardware appliances, and remove the need to refresh hardware every 5-7 years.
This novel design breaks down traditional silos of storage and lets customers use cloud or object storage as a next-generation data lake, without compromising performance. This approach gives you the ability to generate data at edge locations or data centers, leveraging your data in the cloud for other use cases like analytics, machine learning, or artificial intelligence.
Turning it up a Notch
No modern data lake or file system solution would be complete without protection against a modern scourge – ransomware.
Legacy file systems are designed to allow files to be edited, so when a malicious actor penetrates them, corrupting or encrypting your data, encryption damages the files themselves.
Panzura employs a novel approach to ransomware protection for your data by creating an immutable file system. This means that they do not allow for deletion or overwriting of data. Instead, you can only add or append data to the original version. Any user can restore their file to the last known good state in minutes, avoiding the arduous restore process from a backup system.
Shift the balance of power in the fight against ransomware.
Storage and Data Management for the Times
A fresh take on a data lake suggests taking advantage of the age of data – not only coping with the volume of unstructured data, but being able to work with it in ways that drive organizations ahead, now requires the next generation of filer, capable of the next generation of data management.