Sunday, September 25, 2016

Building a common data platform for the enterprise on Apache Hadoop

To become a data-driven enterprise, ( #Bigdata ) organisations must process all types of data, whether it be structured transactions or unstructured file server data such as social, #IoT or machine data. Competitive advantage is at stake, and companies failing to evolve into data-driven organisations risk serious business disruption from competitors and startups.

Fortunately, we live in a time of unprecedented innovation in enterprise software and enterprise data has finally become manageable on a large scale. Thanks to the #Apache #Hadoop open source framework delivering enterprise archives, data lakes and advanced analytics applications, enterprise data management solutions are now able to turn the tide on data growth challenges.

Enter the Common Data Platform (CDP): a uniform data collection system for structured and unstructured data featuring low-cost data storage and advanced analytics. In this article, I’m going to define the components of a CDP, and where it stands alongside the traditional enterprise data warehouse.

1. #Apache Hadoop

Apache Hadoop is the backbone of the CDP. Hadoop is an open-source data management system that distributes and processes large amounts of data in parallel (across multiple servers and distributed nodes). It’s engineered with scalability and efficiency in mind, and designed to run on low-cost commodity hardware. Using the Hadoop Distributed File System ( #HDFS;), #Hive and #MapReduce or #Spark programming model, Apache Hadoop is able to service most any enterprise workload.

Hadoop supports any data whether structured or unstructured in many different formats making it ideal as a uniform data collection system across the enterprise. By denormalising data into an Enterprise Business Record (EBR), all enterprise data may be text searched and processed through queries and reports. Unstructured data from file servers, email systems, machine logs and social sources is easily ingested and retrieved as well.

2. Data lake

A Hadoop data lake functions as a central repository for data. Data is either transformed as required prior to ingestion or stored “as is,” eliminating the need for heavy extract, transform and load (ETL) processes. Data needed to drive the enterprise may be queried, text searched or staged for further processing by downstream NOSQL analytics or applications and systems. 

Data lakes also significantly reduce the high cost of interface management and data conversion between production systems. Data conversion and interface management may be centralised with a data lake deployed as a data hub to decouple customisations and point to point interfaces from production systems.

http://www.itproportal.com/features/building-a-common-data-platform-for-the-enterprise-on-apache-hadoop/

No comments:

Post a Comment