Dell, EMC, Dell Technologies, Cisco,

Thursday, September 21, 2017

Syncsort quality manager aims to purify Hadoop data lakes

#Syncsort Inc. is extending the data quality features of the #TrilliumSoftware Inc. subsidiary it acquired last November to native #Hadoop environments with #TrilliumQuality for #BigData. The offering combines Trillium’s data quality features with its Intelligent Execution data integration platform to enable information technology organizations to normalize and integrate data at the same time. The Trillium platform was previously available in native format only on #Linux, #Unix and Windows operating systems. The Hadoop support is the first time Syncsort has applied its data quality features to applications. Data quality is about identifying inconsistencies, errors or duplication. Examples include a ZIP code entered in a date field or duplicate customer records that appear to be different because of misspellings. Normalizing data is a tricky process. For example, different countries have different address and date formats and two people with the same name in the same ZIP Code may or may not be the same person. Users are rushing to extract data from production systems and load it into analytics engines, but are discovering that quality problems limit their effectiveness. “Everybody is trying to govern the data once it’s in the data lake so it doesn’t turn into a data swamp,” said Tendü Yoğurtçu, Syncsort’s chief technology officer. “The volume and variety of data makes it complex.” Trillium has hundreds of matching algorithms to identify such problems, and can be configured to automatically apply corrective algorithms, Yoğurtçu said. The offering includes address- and name-matching data for 150 countries as well as postal directories and geocoding. Intelligent Execution examines the topology of a data flow and optimizes resources for the job without changes to the application. It supports both new and existing Trillium data quality projects across Hadoop, MapReduce and Apache Spark on-premises or in the cloud. “Once you understand the data you can create the rules to cleanse that data,” Yoğurtçu said. “For example, if you have duplicates you can specify a process to flag them or get rid of them.” Trillium Quality for Big Data is available on all Hadoop distributions including Cloudera Inc.’s CDH, Hortonworks Inc.’s HDP and MapR Technologies Inc.’s Converged Data Platform. It deploys and installs via Cloudera Manager and Apache Ambari. Pricing is on a per-node basis or cloud subscription, but Sync

https://siliconangle.com/blog/2017/09/20/syncsort-quality-manager-aims-purify-hadoop-data-lakes/

No comments:

Post a Comment