YARN is a sub-project of Hadoop introduced in Hadoop 2.0. It is the next generation framework for resource management. With Map-Reduce focusing only on batch processing, YARN was conceptualised to provide a more general processing platform for data stored in HDFS. This document summarises the growing need of big data and limitations of MapReduce v1. It also explains why YARN is used as a business solution for resource management.
YARN framework is used for managing resources for distributed applications. In the year 2006, Hadoop was developed as a solution for handling big data in the easiest possible way. Hadoop framework uses the Map-Reduce algorithm for batch processing of large amount of data stored in a distributed file system. The data storage framework of Hadoop was able to counter the growing data size, but resource management became a bottleneck. The resource management framework for Hadoop needs a redesigning to solve the growing needs of big data.
MapReduce is a software framework for processing large data-sets in parallel, in distributed environment, in a reliable and fault-tolerant manner. It is a batch-oriented model where a large amount of data is stored in Hadoop distributed file system (HDFS) and computation on data is performed in Map and Reduce phases. The basic principle for MapReduce framework is to move processing algorithm to data rather than move data over the network for computation. The MapReduce tasks are scheduled to run on the same physical node on which data resides. This significantly reduces the network traffic and increases the efficiency of the system. The MapReduce system has two components – JobTracker and TaskTracker. JobTracker is a master daemon responsible for scheduling of jobs’ tasks on slaves, their monitoring and re-execution of failed tasks. TaskTracker is a per-node process responsible for execution of the map-reduce jobs. Some limitations of the MapReduce v1 framework are as follows:
a) Cluster Inefficiency: Sequential execution of application processes with less robust preemption criteria prevents the resource manager to have proper utilization of available cluster resources. Also, the lack of distribution of resources in terms of memory or CPU allocated for processes forbids multi-tenancy and thus degrades the cluster’s overall throughput.
b) Non-Scalability: MapReduce framework totally depends on the master daemon, i.e. the JobTracker. It manages the cluster resources, execution of jobs and fault tolerance. It is observed that the Hadoop cluster performance degrades drastically when the cluster size increase above 4,000 nodes or the count of concurrent tasks crosses 40,000. The centralised handling of jobs’ control flow resulted in endless scalability concerns for the scheduler.
c) Only Batch-Processing: The resources across the cluster are tightly coupled with map-reduce programming. It does not support integration of other data processing frameworks and force everything to look like a MapReduce job. The emerging customer requirements demand support for real-time and near real-time processing.
d) Partitioning of Resources: The framework forced the cluster resources to be partitioned into map and reduce slots. Such partitioning of the resources resulted in less utilization of the cluster resources.
e) Unavailability and Unreliability: A single point of failure for the cluster was due to the failure of JobTracker. All the queued and running jobs are killed if the JobTracker fails. In January 2008, Arun C Murthy logged a bug in JIRA against MapReduce architecture which resulted in a generic resource scheduler and a per job user-defined component that manages the application execution. (Link: https://issues.apache.org/jira/browse/MAPREDUCE-279)
a) Higher Cluster Utilization: With flexible and generic resource model in YARN, the scheduler only handles an overall resource profile for each type of application. This structure makes the communication and storage of resource requests efficient for the scheduler resulting in higher cluster utilization.
b) Scalability: Scalability is the ability of a software or product to perform well under an expanding workload. In YARN, the responsibility of resource management and job scheduling / monitoring is divided into separate daemons, allowing YARN daemons to scale the cluster without degrading the performance of the cluster.
c) Multiple Data Processing Algorithms: The MapReduce framework of is bounded to batch data processing only. YARN is developed with a need to perform a wide variety of data processing over the data stored over Hadoop HDFS. YARN is a framework for generic resource management and allows users to execute multiple data processing algorithms over the data.
d) Multi-tenancy: In a distributed environment, resources of an application are shared among one or more logically isolated instances. YARN allows cluster administrator to configure hierarchical queues and predictable sharing of cluster resources across organisation.
e) Flexible Resource Model: In MapReduce v1, resources are defined as the number of map and reduce slots available for the execution of a job. With the need of multiple data processing algorithms, every resource request cannot be mapped as map / reduce slots. In YARN, a resource request is defined in terms of memory, CPU, locality etc. It results in a generic definition for a resource request by an application.
f) Reliability and Availability: Fault tolerance is a core design principle for any multi-tenancy platform like YARN. This responsibility is delegated to ResourceManager and ApplicationMaster. The application specific framework i.e. ApplicationMaster handles the failure of a container. The ResourceManager handles the failure of NodeManager and ApplicationMaster. \
g) Locality Awareness: A container execution over YARN may require external resources like jars, files or scripts on local file system. These are made available to containers before they are started. An Application Master defines a list of resources that are required to run the containers. For efficient disk utilization and access security, the NodeManagers ensure the availability of specified resources and their deletion after use.
Giraph: Giraph is a framework for offline batch processing of semi-structured graph data stored using Hadoop. With Hadoop 1.x version, Giraph had no control over the scheduling policies, heap memory of the mappers and locality awareness for the running job. Also, defining a Giraph job on the basis of mappers / reducers slots was a bottleneck. YARN’s flexible resource allocation model, locality awareness principle and application master framework ease the Giraph’s job management and resource allocation to tasks. (Link: https://issues.apache.org/jira/browse/GIRAPH-13)
Apache Spark: Spark enables iterative data processing and machine learning algorithms to perform analysis over data available through HDFS, HBase or other storage systems. Spark uses YARN’s resource management capabilities and framework to submit DAG of job. The spark user can focus more on data analytics use cases rather than how spark is integrated with Hadoop or how jobs are executed.(Link: http://spark.apache.org/) A page on Hadoop wiki lists a number of projects / applications that are migrating to or using YARN, as their resource management tool. (Link: http://wiki.apache.org/hadoop/PoweredByYarn)