Articles

Cloud Data Services Sprawl … it’s Complicated

cloudLegacy data management didn’t offer the scalability one finds in Big Data or NoSQL, but life was simple. You’d buy storage from your vendor of choice, add a database on top and use it for all your workloads.

In the new world, however, there are data services for every application workload. Targeted services may sound great, but multiple workloads mean complex data pipelines, multiple data copies across different repositories and complex data movement and ETL (Extract, Transform, Load) processes.

With single-purpose data silos, the cost of storage and computation grows quickly. Companies like Amazon or Google have jumped in, selling targeted services – raking in lots of money, at higher margins and often with tricky pricing schemes.

It’s all very complicated. But now it’s time for enterprises to demand unified data services that have better API variety and a combination of volume and velocity. There’s no need for so many duplicated copies, for such complex data pipelines or for ETLs.

AWS as a Case Study

Amazon Web Services (AWS) offers 10 or more data services. Each service is optimized for a specific access pattern and data “temperature” (see Figure 1 below). Each service has different (proprietary) APIs, and different pricing schemes based on capacity, number and type of requests, throughput, and more.

Fig, 1; Source AWS

FIG 1 –  Source AWS

In most applications, data may be accessed through several patterns. For example, it may be written as a stream but read as a file by Hadoop or as a table by Spark. Or, perhaps individual items are updated while the list of modifications are viewed as a stream. The common practice is to store data in multiple repositories or move them from one to another as illustrated in figure 2.

FIG 2 - Source AWS Blog

FIG 2 – Source AWS Blog

 

 

Fig. 2 shows six services (DynamoDB, DynamoDB Streams, S3, Lambda Redshift and Kinesis) being used to move and store the SAME data. With each service acting as a one-trick pony, the application consumes far more capacity and processing than if there were a holistic service that supports multiple workload types.

The pipeline approach used by AWS and others has major drawbacks beyond complexity. For example, it is very difficult to track data security and lineage when the data wanders between different stages because context or identities can get lost in translation. Long pipelines also mean results are delayed quite a bit since they need to traverse multiple stages until they are analyzed.

The charts below may help guide in deciding which service is right for each potential job:

Data Store

Data Storage

Wrong choices are costly

For applications that need to store medium-sized objects, choices might include both S3 and DynamoDB. T the intuitive decision is to take S3 because it’s “simpler and cheaper”). Simple? … Not really. Let’s run through the math with couple of use cases:

Use Cases

Using the AWS Price Calculator, the results show that for Case 1 it is clearly less costly to use DynamoDB, while for Case 2 S3 is cheaper.

What that shows is that even with a low transfer rate(less than 1,000 requests per second) the S3 IO and bandwidth costs far outweigh the commonly referred to S3 capacity costs (of 3 cents per GB).

Capacity Costs

Note that for a company using DynamoDB with 20K Requests/sec and 10TB of data (with zero transfer-out), — something all NoSQL solutions can fit into a single mainstream server node – the company would pay AWS $172,000 per year (or more than half a million dollars in 3 years, the average life of a server). Imagine how many on-prem servers the company could purchase for that amount.

Summary

It’s time for simplification; it’s time to use smarter virtualized data services platform that can address the different forms of data (streams, files, objects and records), map them all to a common data model that can read and write data consistently, regardless of the API used.

With the latest advancement and commoditization in high-performance storage, such as fast flash and non-volatile memory, there’s no need for separate products for “hot” and “cold” data. Tiering logic should be implemented at the data service level rather than forcing application developers to code to different APIs.

By unifying and virtualizing data services over a common platform we save costs, reduce complexity, improve security, shorten project deployment time, and shorten time to insights (from the second the data arrives until it is mined for analytics).

iguaz.io is the first to deliver high-performance data platform with a unified data model and is positioned to disrupt the market, others will surely follow.


Authored by:
Yaron Haviv

Yaron Haviv, the CTO and founder of iguaz.io, has deep technological experience in the fields of cloud, storage, networking, and big data.

Prior to iguaz.io, Yaron was the VP of Datacenter and Storage Solutions at Mellanox, where he led technology innovation, software development and solution integrations for the data center market. Yaron was the key driver of open source initiatives and new solutions with leading storage vendors, enterprise, cloud and Web 2.0 customers. Before this, Yaron was the CTO and VP of R&D at Voltaire, a high performance networking, computing, and IO company.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s