How To Choose A Right Hadoop Distribution For Your Business

tenthplanet blog pentaho How To Choose A Right Hadoop Distribution For Your Business

1. Introduction:-

Hadoop has more or less become synonymous with ‘Big Data’ today. Hadoop is an open source project and a number of vendors have developed their own distributions, adding new functionality or improving the code base. But the many distributions also contribute to a decision complexity on which distribution to choose for your needs. Also, why have vendor distributions when there is a ‘standard Hadoop distribution”’? Who are the major vendors and how do they compare? Read on to know more.

2. Why Hadoop Distribution:-

Hadoop is Apache software which is freely available for download and use. So why do we need distributions at all?

  • Distributions actually package Hadoop nicely into easy to install packages which make it easy for system administrators to manage effectively.

  • Distributions bundle versions of components that work well together. This provides a working Hadoop installation right out of the box.

  • Distribution makers strive to ensure good quality components.

  • Sometimes, they lead the way by including performance patches to the ‘vanilla’ versions and have predictable product release road maps.

  • This ensures they keep up with developments and bug fixes and also they come with support, which could be very valuable for a production critical cluster.

3. Leading Hadoop Distribution Vendors in the Market:-

  • In current market situations there are three leading Hadoop Distributions available in the market i.Cloudera, ii.Hortonworks, iii.MapR.

  • Choosing one amongst them is neither so easy task nor a tough one. A basic analysis on the type of approach towards their work and data helps one to easily choose the suitable Distribution for them.

  • If you’re anxious to test things out, all the vendors are offering free versions, but with each will have some level of restriction, either based on functionality or the number of nodes that can be added to a cluster.

  • If you need to get up and running really quickly, each vendor offers VM images with Linux and Hadoop already installed.

4. Which one to choose:-

  • It depends entirely on your business requirements because Hadoop is licensed under the Apache License, which is a free licensed software.

  • All these vendors will automatically provide patches and updates to the core Hadoop distribution, something that everyone benefits from.

  • So it’s best to instead turn your attention to each of the strengths and weaknesses based on the product offered and the available add-ons developed for your use.

5. CLOUDERA vs HORTONWORKS vs MapR:-

Many similarities are there in between this three distributions.

All the three distributions Cloudera, Hortonworks and MapR are focused on Hadoop and their entire revenue comes in by offering enterprise-ready Hadoop distributions.

MapR Distribution is the way to go if it’s all about product.

If open source is your uptake – then Hortonworks Hadoop Distribution is for you.

If your business requirements fit somewhere in between, then opting for Cloudera Distribution for Hadoop might be a good decision.

Also by providing support to help their users with the problems faced and also demonstrations, if required. All the three Hadoop distributions have stood the test of time ensuring stability and security to meet business needs.

Cloudera:-

Cloudera is the best known player and market leader in the Hadoop space to release the first commercial Hadoop distribution, it tops the list when it comes to building innovative tools.

The management console –Cloudera Manager, is easy to use and can be implemented with rich user interface displaying all the information in an organized and clean way.

The proprietary Cloudera Management suite automates the installation process and also renders various other enhanced services to users –displaying the count of real-time nodes, reducing the deployment time, etc.

Cloudera offers consulting services to bridge the gap between – what the community provides and what organizations need to integrate Hadoop technology in their data management strategy.

Hortonworks:-

Hortonworks, founded by Yahoo engineers, provides a ‘service only’ distribution model for Hadoop.

Hortonworks is different from the other Hadoop distributions, as it is an open enterprise data platform available free for use. Hortonworks Hadoop distribution –HDP can easily be downloaded and integrated for use in various applications.

Hortonworks was the first vendor to provide a production-ready Hadoop distribution based on Hadoop 2.0. Though CDH had Hadoop 2.0 features in its earlier versions, all of its components were not considered production ready.

MapR:-

MapR is also a platform-focused provider like Hortonworks and Cloudera.

MapR integrates its own database system MapR-DB which it claims is between four and seven times faster than the stock Hadoop database, HBase, running on competing distributions.

Due to its power and speed, MapR is often seen as a good choice for the biggest of Big Data projects.

Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach for storing meta-data on the processing nodes

Because it depends on a different file system known as MapR-File System (MapRFS) and does not have a Name-node architecture.

Features:-

HortonworksClouderaMapR
Dependability
High AvailabilitySingle point failure recoverySingle point failure recoverySelf-healing across multiple failures
Map Reduce HARestarts jobsRestarts jobsContinues without restart
UpgradingPlanned downtimeRolling upgradesRolling upgrades
ReplicationDataDataData and meta-data
SnapshotsConsistent only for closed filesConsistent only for closed filesConsistent for all files and tables
Disaster RecoveryParallel ClusterFile copy schedulingMirroring
Manageability
Management toolsAmbariCloudera managerMapR control system
Heat map, Alarm, AlertsYesYesYes
Integration with rest APIYesYesYes
Data and job placement controlNoNoYes
Performance & Scalability
Meta-data ArchitectureCentralizedCentralizedDistributed
Data ingestBatch mode writeBatch mode writeBatch and Streaming write
HBase PerformanceLatency spikesLatency spikesConsistent low latency
NoSql ApplicationsMainly batch applicationsMainly batch applicationsBatch and real-time applications
Data access
File system accessHDFS, read-only NFSHDFS, read-only NFSHDFS, read/write NFS
File I/OAppend-onlyAppend-onlyRead/write
Security ACL’sYesYesYes
Write level authenticationKerberosKerberosKerberos. Native

Comparison between MapR & Hadoop File Systems

MapR-FS vs HDFS

When data is written to MapR-FS, it is sharded into chunks. The default chunk size is 256 Megabytes. Chunks are striped across storage pools in a series of blocks, into logical entities called containers. Striping the data across multiple disks allows data to be written faster, because the file will be split across the three physical disks in a storage pool, but remain in one logical container.

When data is written to HDFS it is distributed across nodes. HDFS splits data into blocks of 128 megabytes, and distributes these blocks across different locations throughout your cluster. Files are automatically distributed as they are written.

MapR-FS distributes and replicates the name-space information throughout the cluster, in the same way that data is replicated. Each volume has a name container, which contains the metadata for the files in that volume. The CLDB service typically runs on multiple nodes in the cluster. CLDB is used to locate the name container for the volume, and the client connects to the name container to access the file metadata.

In HDFS, meta-data is managed by the NameNode. Before any operations can be performed on data stored in HDFS, an application must contact the NameNode. The single NameNode maintains metadata information for all the physical data blocks that comprise the files. This can create performance bottlenecks.

MapR-FS use replication for high availability and fault tolerance. Replication protects from hardware failures. File chunks, table regions, and metadata are automatically replicated. There is generally at least one replica on a different rack.In HDFS data stored on any node gets replicated multiple times across the cluster. These replicas prevent data loss. If one node fails, other nodes can continue processing the data.
MapR-FS avoids single point of failure and performance bottlenecks problem by fully distributing the metadata for file and directories.

In HDFS, Name Nodes can lead to single point of failure and performance bottlenecks.

MapR-FS allows updates to files in increments as small as 8K. Having a smaller I/O size reduces overhead on the cluster, which allows for snapshots, and is one of the reasons that MapR-FS is randomly read/write, even during ingestion.

Data in HDFS is immutable. If the source data changes, the data must be appended to existing data, or else reloaded into the cluster

MapR-FS written in C. Being written in C means less garbage collection for the operating system, which translates to faster performance.

HDFS is written in Java.

6. Summary:-

Choosing a Hadoop distribution completely depends on the hindrances or obstacles an organization is facing in implementing Hadoop in their enterprise. A right move in choosing a Hadoop distribution will help organizations connect Hadoop to different data analysis platforms with flexibility, reliability, and visibility. Each Hadoop distribution has its own pros and cons.

When choosing a Hadoop distribution for business needs, it is imperative to consider the additional values offered by each Hadoop distribution by balancing the risk and cost, for the Hadoop distribution to prove beneficial for your enterprise needs.

The world of Hadoop is getting bigger and bigger, this list of options can be overwhelming if you don’t know what you’re looking for. Hopefully, these considerations along with their specific criteria can lead you in the right direction as you search for the best Hadoop distribution for your needs.