Hadoop has undergone many changes in three different versions. Hadoop 3 combines the efforts of hundreds of contributors over the last six years since Hadoop 2 launched. In this tutorial, we will discuss the Comparison between Hadoop 2.x vs Hadoop 3.x. So, let’s first see comparison in tabular format:
|Features||Hadoop 2.x||Hadoop 3.x|
|Min java version Required||Java 7||Java 8|
|Fault Tolerance||Via replication (which is waste of space)||Via erasure coding|
|Storage Scheme||3x replication factor for data reliability, 200% overhead||Erasure coding for data reliability, 50% overhead|
|Yarn Timeline Service||Scalability issues||Highly scalable and reliable|
|Standby NN||Supports only 1 SBNN||Supports multiple SBNN|
|Heap Management||We need to configure Hadoop_ HFAPSIZE||Provides auto- tuning for heap|
|Data Balancing||For data balancing uses HDFS balancer||For data balancing uses intra data node balancer, which is invoked via HDFS Disk Balancer CLI.|
Hadoop 2 vs Hadoop 3 – Feature-wise Comparison
Let’s discuss Hadoop 2 vs Hadoop 3 comparison in detail –
- Minimum Supported Java Version
Hadoop 2.x: For Hadoop 2.x to work minimum version of Java required is Java 7
Hadoop 3.x: If you want to run jar files in Hadoop 3.x, the minimum version of Java required is version 8. Most of the libraries in Hadoop 3.x supports Java 8.
- Fault Tolerance
Hadoop 2.x: Replication technique provides for fault tolerance in Hadoop 2.x. We can configure the replication factor as per the requirement. Its default value is three. In the event of loss of any file block, Hadoop recovers it from the existing replicated blocks.
Hadoop 3.x: This version of Hadoop provides the technique of Erasure Coding for Fault Tolerance. Under erasure coding the blocks are not replicated in fact HDFS calculates the parity blocks for all file blocks. Now whenever the file blocks get corrupted, the Hadoop framework recreates using the remaining blocks along with the parity blocks.
- Storage Overhead
Hadoop 2.x: The storage overhead in Hadoop 2.x is 200% with the default replication factor of 3. Suppose a file “A” divides into 6 blocks in HDFS. With a replication factor of 3, we would be having 18 blocks for the file “A” stored in the system. From this, we can see
Storage overhead = No. Of Extra Blocks/No. of original blocks * 100
Hadoop 3.x: As Hadoop 3.x adopts Erasure Coding for fault tolerance, it minimizes the storage overhead of the data. Again take the example of a file with 6 blocks. Erasure Coding creates 3 more parity blocks.
Storage overhead = 3/6*100 = 50%
From the above example, we can see that storage overhead is drastically reduced.
- YARN Timeline Service
Hadoop 2.x: Timeline service version v.1.x comes along with this Hadoop 2.x. This version of the Timeline is not scalable beyond small clusters. It has a single instance of writer and storage running.
Hadoop 3.x: In Hadoop 3.x we will be using Timeline service version v.2. This version of Timeline service provides for more scalability, reliability and enhanced usability by introducing flows and aggregation. This version of the Timeline is more scalable than its previous version. It has scalable back-end storage and distributed writer architecture.
- Support for opportunistic containers
Hadoop 2.x: Hadoop 2.x works on the principle of guaranteed containers. In this, the container will start running immediately as there is a guarantee that the resources will be available. But it has two drawbacks
- a) FeedBack Delays – Once the container finishes execution it notifies resource manager about the released resources. When the Resource Manager schedules a new container at that node, the application master gets notified. Then AM starts the new container. Hence there is a delay introduced because of these notifications given to RM and AM.
- b) Allocated v/s utilized resources –The resources which RM allocates to the container can be under-utilized. For example, RM may allocate a container 4 GB of memory out of which it uses only 2GB. This lowers the effective resource utilization.
Hadoop 3.x: To eradicate the above drawbacks Hadoop 3.x implements opportunistic containers. In this case, containers wait in a queue if the resources are unavailable. The opportunistic containers have lower priority than guaranteed containers. Hence the scheduler preempts opportunistic containers to make room for guaranteed containers.
- Support for Multiple Standby Node
Hadoop 2.x: This version of Hadoop supported a single active NameNode and a single standby Name Node. This architecture is capable of tolerating the failure of one NameNode
Hadoop 3.x: Hadoop 3.x has improved so that we can configure multiple standby Namenode. In a system having three NameNodes configured can tolerate the failure of two NameNodes.
- Default Port for Multiple services
Hadoop 2.x: Linux ephemeral port range (32768-61000) are the default ports for multiple services. They have a drawback. Other services in Linux use these ports as well hence they to conflict with Hadoop services. Therefore Hadoop services would fail to bind at startup.
Hadoop 3.x: To mitigate the above drawback this new version moves the default port out of Linux ephemeral port range. This has affected NameNode, secondary NameNode, and DataNode.
NameNode ports: 50470 –> 9871, 50070 –> 9870, 8020 –> 9820
Secondary NameNode ports: 50091 –> 9869, 50090 –> 9868
DataNode ports: 50020 –> 9867, 50010 –> 9866, 50475 –> 9865, 50075 –> 9864
- New DataNode Balancer
Hadoop 2.x: A single DataNode manages many disks. These disks fill up evenly during a normal write operation. But, adding or replacing disks can lead to significant skew within a DataNode. Hadoop 2.x has HDFS balancer which cannot handle this situation. This is because it implements inter-, not intra-, DN skew.
Hadoop 3.x: The new intra-DataNode balancing functionality handles the above situation. The hdfs diskbalancer CLI invokes intra-DataNode balancer. For using this facility set dfs.disk.balancer.enabled configuration to true on all DataNodes.
So, this was all in Hadoop 2 vs Hadoop 3. Hope you like the above comparison.