You are currently viewing Switching from Cloudera Distribution Hadoop to Data Platform [A Success Story]

Switching from Cloudera Distribution Hadoop to Data Platform [A Success Story]

The discontinuation of CDH support in 2022 was a cause of status que disturbance for many businesses. The following overviews a flawless delivery my team made to a multi-billion-dollar financial services company that involved upgrading to the modern Cloudera Data Platform (CDP) 7.1.

An abbreviation for Cloudera Distribution Hadoop – CDH stood out as the foremost, extensively tested and embraced Apache Hadoop distribution, coupled with plenty of other associated initiatives. With CDH, business users could receive the essential constituents of Hadoop, which transcended competency to stack up storage capacity as well as execute distributed computing functions. Besides, CDH furnished a web-based user interface and imperative enterprise functionalities. CDH also had the reputation of being a peerless Apache-licensed open-source solution for integrated batch processing,

interactive SQL, and interactive search features compiled with role-based access controls.

Having all these functionalities stripped off support affected this NYSE-listed legendry enterprise too. Here is what happened next.

The Challenges While ‘Fixing’ Customer’s Cloudera CDH

Everything boiled down to the challenge of upgrading the customer out of Cloudera’s legacy distribution CDH 6 series to Cloudera Data Platform (CDP) 7.1 which combined a consistent set of breakthrough tools and APIs to manage data and analytics workloads across multiple cloud providers.

In the second layer of the situation at hand awaited the dependence of CDH and CDP environments on Kerberos for security while featuring integration with Azure Data Lake Storage (ADLS).

Layer three comprised of the fact that a CDP environment’s pipelines are principally built using Oozie, Spark, Impala, Shell Scripting, HDFS, ADLS, etc. Atop this, hundreds of business users use Hue for ad hoc analysis, Tableau for enterprise reporting, and JupyterHub for data science model development.

Steps for Switching from Cloudera Distribution Hadoop to Data Platform

Data and multicloud experts at INFOLOB jumped straight into action for the CDH to CDP upgrade – addressing all the other terms & conditions in between. Below are some of the key moves that propelled the engagement to unfold into a grand success.

  1. Assessing the current state and identifying the application changes supposed to be implemented to make them compatible with the CDP Platform
  2. Working with Platform Engineers and Architects to ensure a smoother upgrade from CDH to CDP
  3. Proposing high-level solutions to re-engineer Pig Scripts using Spark, Flume with Kafka, Hive on map-reduce to Hive on Tez, etc.
  4. Providing prototypes to convert Hive Managed Tables with location to Hive External Tables and addressing the behavior of deleting the data from the location
  5. Designing and offering templates for Pig to Spark as well as Flume to Kafka transition
  6. Ensuring an appropriate validation strategy of the environment to identify issues related to the upgrade
  7. Innovating and automating to streamline the development, validation, and testing strategy to accommodate least number of issues
  8. Root cause analysis of the issues related to the upgrade such as Hue slows down, transition to Hive on Tez, ADLS integration, etc.
  9. Posting upgrade assessment to ensure resources are perfectly configured for services such as YARN, Spark, Tez, Impala, Kafka, Hue, etc.
  10. Linking all the key stakeholders such as business users, system administrators, Cloudera Administration, developers as well as Cloudera Support to ensure timely resolution of complex issues

Results

The customer ended up acquiring unprecedented data power while only aiming for a small time increase in multicloud compatibility.

For a detailed copy of the case study with all the measurable outcomes, write to us at: