Feature-wise Comparison Between Hadoop 2.x vs Hadoop 3.x

- Pentaho

We can compare Hadoop 2.x and Hadoop 3.x, analyze the features to know which gives us better combination.

License

  1. Version 2.x – Apache 2.0 is used for license.
  2. Version 3.x – Apache 2.0 is used for license.

Minimum supported version of Java

  • Version 2.x – Minimum supported version of java is java 7.
  • Version 3.x – Minimum supported version of java is java 8.

Fault Tolerance

HDFS is highly fault tolerant. It handles faults by the process of replica creation.

  • Version 2.x – Fault tolerance is handled by replication. HDFS by default replicates each block three times for a number of purposes.
  • Version 3.x – Fault tolerance is handled by Erasure coding. Erasure Coding is to use in the place of Replication, which provides the same level of fault tolerance.

Data Balancing

HDFS provides a balancer utility. This utility analyzes block placement and balances data across the Data Nodes.

  • Version 2.x – For data, balancing uses HDFS balancer. It distributes data across the disks of a datanode. HDFS might not always place data in a uniform way across the disks due to following reasons:
    • A lot of writes and deletes
    • Disk replacement
  • Version 3.x – For data, balancing uses Intra-data node balancer, which is invoked via the HDFS disk balancer CLI. It distributes data in a uniform way on all disks of a datanode.

Storage Overhead

HDFS replicates each block for the purpose of fault tolerance.

  • Version 2.x – HDFS has 200% overhead in storage space.
  • Version 3.x – Storage overhead is only 50%.

Storage Overhead Example

  • Version 2.x – If there is 6 block so there will be 18 blocks occupied the space because of the replication scheme.
  • Version 3.x – If there is 6 block so there will be 9 blocks occupied the space 6 block and 3 for parity.

YARN Timeline Service

The Storage and retrieval of application’s current and historic information in a generic fashion is addressed in YARN through the Timeline Serve

  • Version 2.x – Uses an old timeline service which has scalability issues.
  • Version 3.x – Improve the timeline service v2 and improves the scalability and reliability of timeline service.

Default Ports Range

The default ports of Hadoop services are in the Linux ephemeral port range (32768-61000)

  • Version 2.x – In Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind.
  • Version 3.x – But in Hadoop 3.0 these ports have been moved out of the ephemeral range.

Compatible File System

  • Version 2.x – HDFS, FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 file system Windows Azure Storage Blobs file system.
  • Version 3.x – Microsoft Azure Data Lake filesystem, HDFS, FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 file system Windows Azure Storage Blobs file system.

MR API Compatibility

The MapReduce Application Master REST API’s allow the user to get status on the running MapReduce application master.

  • Version 2.x – MR API compatible with Hadoop 1.x program to execute on Hadoop 2.X
  • Version 3.x – Here also MR API is compatible with running Hadoop 1.x programs to execute on Hadoop 3.X

Support for Microsoft Windows

  • Version 2.x – It can be deployed on windows.
  • Version 3.x – It also supports for Microsoft windows.

Slots/Container

Signifies an allocated resources to an ApplicationMaster. ResourceManager is responsible for issuing resource/container to an ApplicationMaster.

  • Version 2.x – Hadoop 1 works on the concept of slots but Hadoop 2.X works on the concept of the container. Through in the container, we can run the generic task.
  • Version 3.x – It also works on the concept of a container.

Single Point of Failure

  • Version 2.x – It has Features to overcome Single point of failover, so whenever Namenode fails it recovers automatically.
  • Version 3.x – It has Features to overcome Single point of failover, so whenever Namenode fails it recovers automatically.

HDFS Federation

HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer.

  • Version 2.x – In Hadoop 1.0, only single NameNode to manage all Namespace but in Hadoop 2.0, multiple NameNode for multiple Namespace.
  • Version 3.x – Hadoop 3.x also have multiple Namenode for multiple namespaces.

Scalability

  • Version 2.x – In hadoop 2.x, we can scale up to 10,000 Nodes per cluster.
  • Version 3.x – Hadoop 3.x provides better scalability compared with Hadoop 2.x. We can scale more than 10,000 nodes per cluster.

Faster Access to Data

  • Version 2.x – Due to data Node caching we can fast access the data.
  • Version 3.x – Similar to Hadoop 2.x, In Hadoop 3.x also due to data node caching we can fast access the data.

Platform

  • Version 2.x –  It can serve as a platform for a wide variety of data analytics as possible to run event processing, streaming, and real-time operations.
  • Version 3.x – Similar to Hadoop version 2.0, It can also serve as a platform for a wide variety of data analytics as possible to run event processing, streaming, and real-time operations.