Dell, EMC, Dell Technologies, Cisco,

Thursday, July 6, 2017

Define and Process Data Pipelines in Hadoop With Apache Falcon

#ApacheFalcon is a framework to simplify data pipeline processing and management on #Hadoop clusters. It makes it much simpler to onboard new workflows/pipelines, with support for late data handling and retry policies. It allows you to easily define relationships between various data and processing elements and integrate with metastore/catalog such as #Apache #Hive/ #HCatalog. Finally it also lets you capture lineage information for feeds and processes. In this tutorial we are going to walkthrough the process of: Defining the feeds and processes Defining and executing a data pipeline to ingest, process and persist data continuously Prerequisites Download #Hortonworks Sandbox Complete the Learning the Ropes of the Hortonworks Sandbox tutorial, you will need it for logging into ambari as an administrator user. Complete the Creating Falcon Cluster tutorial to start the falcon service, prepare HDFS directories for Falcon cluster and to create Falcon cluster entities. Once you have downloaded the Hortonworks sandbox and run the VM, navigate to the Ambari interface on port 8080 of the host IP address of your Sandbox VM. Login with the username of admin and password that you set for the Ambari admin user as part of the Learning the Ropes of the Hortonworks Sandbox tutorial:

https://community.hortonworks.com/articles/110399/define-and-process-data-pipelines-in-hadoop-with-a-1.html

No comments:

Post a Comment