3 min read

A Better Way to Manage Machine Generated Data

Panzura : Jun 5, 2021

Insights

A Better Way to Manage Machine Generated Data

Table of Contents

Managing rapidly growing volumes of data is hardly a new problem. It has been a challenge for IT for as long as there has been IT.

Yet, we have kept dealing with the problem in the same way — by making storage bigger. First with bigger drives, then with denser arrays, and finally with scale-out clusters.

That may have been the right solution in the past, but we can no longer afford to deal with the problem in that way. We don’t need something bigger. What we need is something better.

The way we manage storage growth has to change, and that change is being driven by two factors.

First is the rate at which data is growing. Data growth has finally reached the point where managing it using traditional on-premises storage is no longer practical. The datasets are too big and they are growing too fast.
The second factor is how different types of data need to be managed.

It is that second factor that we will explore in this blog.

Data Growth is Driving Workloads to the Cloud

As data volumes have grown unmanageable for on-premises storage, user workloads have migrated to the cloud. We can see this everywhere in the adoption of cloud-based resources such as Google Docs, Office 365, Gmail, and similar applications.

These traditional user-generated workloads, such as documents, graphics, and other file types that utilize the SMB protocol to transfer data. Workloads like these have historically generated the massive volumes of data that IT has to manage and these have been the first to move to the cloud.

The Face of Data Growth is Changing

There are increasing volumes of machine-generated data that also have to be managed. These workloads use the NFS protocol, instead of SMB, and include content such as log files, IoT data, Splunk data, and more.

Common estimates are that machine-generated data is growing at the rate of 50X that of traditional business-generated data.

Yet for all the rapid growth of machine-generated data, this particular data type has not yet made the jump to the cloud in the same way that SMB data has. Why is that?

The answer is simple: the focus on data movement to the cloud has been on user-generated data which typically uses the SMB protocol.

Why Not Put NFS Data in the Cloud?

NFS data has not been ideal for the cloud due to the interplay between users’ two simple but significant needs:

Storage size — a strength of traditional cloud storage.
Access speed — a weakness of traditional cloud storage.

The applications that utilize NFS for machine-generated data, such as Hadoop or Splunk, can quickly consume terabytes or even petabytes of storage.

These NFS-driven applications need to ingest that data as rapidly as possible to perform real-time analytics on those large datasets. To get the local performance that they require has typically meant caching on flash storage for performance, backed by some form of local NAS for capacity.

The problem? Reading the massive volumes of data that these applications require from the cloud is simply not an option.

The latency inherent to cloud reads is simply too high. So when this data does go to the cloud, it is more for long-term archive than for active data.

The challenge now is that enterprises are inundated with NFS data that they need to act on. Enterprises are generating enormous datasets that they need to store, access, and perform analysis on to extract actionable information.

Continuing to store these massive machine-generated datasets using the traditional, on-premises storage model is simply not practical. The data is growing too rapidly, making the costs of storing, managing, and backing up that data too expensive.

Hybrid Cloud Has Made NFS Performance a Priority

CloudFS is the first hybrid cloud NAS solution that has been specifically engineered to deliver exceptional performance for both SMB and NFS workloads in the enterprise.

As the leader in NFS performance, CloudFS was the first hybrid cloud NAS solution to design an embedded NVMe Separate Intent Log (SLOG) device.

A SLOG is similar in concept to a write cache for NFS data (and it certainly performs that function), but it does more than that. It also enhances data integrity, making it both fast and efficient.
By taking advantage of NVMe, CloudFS can deliver the performance enterprises need for their growing volume of machine-generated data.

The benefits to NFS performance in CloudFS are not limited to hardware. NFS performance has been maximized in virtual instances as well.

In fact, unmatched performance of both NFS and SMB workloads is a primary goal of the file system.

The result? CloudFS can deliver maximum performance across the network. Each Panzura filer can fully saturate 20Gbps of network bandwidth.

To be clear, unlike other solutions, this does not mean that multiple CloudFS instances can aggregate to 20Gbps or that it is a maximum burst number you might see on your network once. An individual CloudFS instance can fully saturate a 20Gbps connection and sustain that level of performance.

In other words, your applications get full use of the bandwidth without the bottlenecking you’d find on traditional cloud solutions.

Takeaways on Hybrid Cloud for Machine Generated Data

In summary, applications that consume and process the vast amounts of machine-generated data being created need:

The performance of local storage.
To access that data using the NFS protocol.

Panzura uses intelligent caching, next-generation hardware, and advanced software to deliver LAN-speed performance while leveraging the scalability and durability benefits of the cloud.

The data these applications need is both available locally for fast access and securely stored in the cloud as a single source of truth.

1t is now possible for large distributed enterprises to store their vast amounts of IoT data, machine logs, 3D medical images, 4k video and other machine-generated data in the cloud. Simultaneously, they can still achieve the extreme local performance that applications such as Splunk and Hadoop demand.