Hadoop – Compute | Manan Younas

Hadoops schedules MapReduce jobs across the nodes of a cluster using JobTracker.

Job Tracker

JobTracker takes a MapReduce job and breaks it into small map and reduce tasks and then schedules them to the various machines in your cluster. It also ensures each task is completed. The JobTracker is able to detect node failures and thereby redirect the work to other nodes.

Key Responsibilities of JobTracker:

Resource Management: Monitors available resources and TaskTrackers in the cluster, ensuring efficient distribution of work
Job Scheduling: Accepts job submissions, breaks them into individual tasks, and schedules them on appropriate TaskTracker nodes
Task Distribution: Assigns map and reduce tasks to TaskTrackers based on data locality and available resources
Fault Tolerance: Monitors task progress, detects failures, and re-schedules failed tasks on different nodes. How ?

When a MapReduce job is submitted:

JobTracker receives the job and initializes it
It splits the input data into manageable chunks for map tasks
Assigns map tasks to TaskTrackers, preferably where the input data resides
Monitors the completion of map tasks before scheduling reduce tasks
Coordinates the movement of intermediate data from mappers to reducers

Note: In modern Hadoop versions (YARN), JobTracker's functionality has been split between the Resource Manager and Application Master for better scalability.

Logic Behind Map-Reduce Division

The division of tasks into Map and Reduce phases follows a "divide and conquer" strategy that makes complex data processing more manageable:

Map Phase:

Data Splitting: Breaks large datasets into smaller, manageable chunks that can be processed independently
Parallel Processing: Each chunk can be processed simultaneously on different nodes, improving efficiency
Local Processing: Works on data where it resides, reducing network overhead
Transformation: Converts raw data into key-value pairs for easier processing

Reduce Phase:

Aggregation: Combines results from multiple map tasks into a final output
Consolidation: Processes all values associated with the same key together
Result Generation: Produces the final output in a format specified by the user

This division provides several benefits:

Scalability: Can easily scale by adding more nodes for parallel processing
Fault Tolerance: If one task fails, only that portion needs to be reprocessed
Load Balancing: Work can be distributed evenly across the cluster
Data Locality: Reduces network congestion by processing data where it resides when possible