Hadoops schedules MapReduce jobs across the nodes of a cluster using JobTracker.
Job Tracker
JobTracker takes a MapReduce job and breaks it into small map and reduce tasks and then schedules them to the various machines in your cluster. It also ensures each task is completed. The JobTracker is able to detect node failures and thereby redirect the work to other nodes.
Key Responsibilities of JobTracker:
- Resource Management: Monitors available resources and TaskTrackers in the cluster, ensuring efficient distribution of work
- Job Scheduling: Accepts job submissions, breaks them into individual tasks, and schedules them on appropriate TaskTracker nodes
- Task Distribution: Assigns map and reduce tasks to TaskTrackers based on data locality and available resources
- Fault Tolerance: Monitors task progress, detects failures, and re-schedules failed tasks on different nodes. How ?
When a MapReduce job is submitted:
- JobTracker receives the job and initializes it
- It splits the input data into manageable chunks for map tasks
- Assigns map tasks to TaskTrackers, preferably where the input data resides
- Monitors the completion of map tasks before scheduling reduce tasks
- Coordinates the movement of intermediate data from mappers to reducers
Note: In modern Hadoop versions (YARN), JobTracker's functionality has been split between the Resource Manager and Application Master for better scalability.
Logic Behind Map-Reduce Division
The division of tasks into Map and Reduce phases follows a "divide and conquer" strategy that makes complex data processing more manageable:
Map Phase:
- Data Splitting: Breaks large datasets into smaller, manageable chunks that can be processed independently
- Parallel Processing: Each chunk can be processed simultaneously on different nodes, improving efficiency
- Local Processing: Works on data where it resides, reducing network overhead
- Transformation: Converts raw data into key-value pairs for easier processing
Reduce Phase:
- Aggregation: Combines results from multiple map tasks into a final output
- Consolidation: Processes all values associated with the same key together
- Result Generation: Produces the final output in a format specified by the user
This division provides several benefits:
- Scalability: Can easily scale by adding more nodes for parallel processing
- Fault Tolerance: If one task fails, only that portion needs to be reprocessed
- Load Balancing: Work can be distributed evenly across the cluster
- Data Locality: Reduces network congestion by processing data where it resides when possible