In the last decade, some enterprises struggling with growing data volumes and shrinking big data talent pools, saw public cloud as a way to manage both challenges. Creating a data lake in the public cloud — just pouring all the unstructured data into a single massive collection, then using analytics tools to “fish out” the data a business unit needs — initially seemed like a good idea because it was the least path of resistance. However, that solution, as frequently happens, carried the seeds of its own problems. Storing big data in the public cloud makes it expensive for users, because while sending data to the cloud can be expensive, pulling it out is even more costly. If they try to avoid this by expanding their on-premise #Hadoop compute resources by buying more Hadoop data nodes, they incur higher costs by over-provisioning on their compute resources
Companies have discovered the notion of “data gravity.” As the quantity of data grows, there’s more inertia; it’s harder and more expensive to pull out of the cloud, and as it goes through different iterations and transformations it changes. As such, organizations are trying to avoid having to move data after it’s been stored. They want it to be “hot” from an analytics perspective, then “cold” from a storage perspective. Unfortunately traditional Hadoop deployments don’t give them that flexibility.
In addition, having many smaller data “swamps” only compounds the problem. Users end up with “Hadoop sprawl,” buying and managing many different Hadoop clusters specialized to handle different kinds of analytics – again, incurring high costs, with the added complications of the rigidity of the multiple hardwired clusters and frequent duplication of the data.
Thus, we’re seeing increased demand for open software-defined storage (SDS) that helps to decouple compute and storage, and to reduce Hadoop sprawl. Using SDS — such as the open source object store Red Hat Ceph Storage — on-premise keeps the data stationary and brings analytics to the data. Being able to use analytics tools and frameworks of choice is extremely important to data scientists who want to use the latest toys available to them, while also keeping their skills razor sharp. Public cloud just can’t keep pace with all the innovation happening in the area of analytics frameworks.
Data can be directly ingested into #SDS solutions from many different data sources, or from a single virtualized data source. The analytics tools, whether #Hadoop or non-Hadoop, are onsite. Customers control the determination of which data streams are higher or lower business value; this way, what one department considers high value can flow to its SDS cluster, while a different department, that places greater value on other data, flows it into its cluster.
The advantages to this approach show up in several ways. The total cost of ownership is lower – no hardwired Hadoop clusters, no data in transit to analytics tools because the tools are in situ, and no overcapacity on compute resources. Companies avoid being locked in, dependent on a single vendor. Furthermore, this approach is flexible and scalable, with the ability to easily support web-scale Big Data analytics projects.
Being able to leverage the industry standard S3A interface helps data scientists connect any analytics tools with an object store natively, allowing for better performance and near linear scale of object storage.
Many companies want self-service analytics on-premise, with the same kind of ease-of-use interface provided by public clouds. The combination of open source SDS with the S3A interface addresses this desire while eliminating the need to blindly go fishing for data.
No comments:
Post a Comment