Select Page

Hadoop – Compute

 

 

Hadoops schedules MapReduce jobs across the nodes of a cluster using JobTracker.

Job Tracker

JobTracker takes a MapReduce job and breaks it into small map and reduce tasks and then schedules them to the various machines in your cluster. It also ensures each task is completed. The JobTracker is able to detect node failures and thereby redirect the work to other nodes.

Key Responsibilities of JobTracker:

  • Resource Management: Monitors available resources and TaskTrackers in the cluster, ensuring efficient distribution of work
  • Job Scheduling: Accepts job submissions, breaks them into individual tasks, and schedules them on appropriate TaskTracker nodes
  • Task Distribution: Assigns map and reduce tasks to TaskTrackers based on data locality and available resources
  • Fault Tolerance: Monitors task progress, detects failures, and re-schedules failed tasks on different nodes. How ?

When a MapReduce job is submitted:

  1. JobTracker receives the job and initializes it
  2. It splits the input data into manageable chunks for map tasks
  3. Assigns map tasks to TaskTrackers, preferably where the input data resides
  4. Monitors the completion of map tasks before scheduling reduce tasks
  5. Coordinates the movement of intermediate data from mappers to reducers

Note: In modern Hadoop versions (YARN), JobTracker's functionality has been split between the Resource Manager and Application Master for better scalability.

 

Logic Behind Map-Reduce Division

The division of tasks into Map and Reduce phases follows a "divide and conquer" strategy that makes complex data processing more manageable:

Map Phase:

  • Data Splitting: Breaks large datasets into smaller, manageable chunks that can be processed independently
  • Parallel Processing: Each chunk can be processed simultaneously on different nodes, improving efficiency
  • Local Processing: Works on data where it resides, reducing network overhead
  • Transformation: Converts raw data into key-value pairs for easier processing

Reduce Phase:

  • Aggregation: Combines results from multiple map tasks into a final output
  • Consolidation: Processes all values associated with the same key together
  • Result Generation: Produces the final output in a format specified by the user

This division provides several benefits:

  • Scalability: Can easily scale by adding more nodes for parallel processing
  • Fault Tolerance: If one task fails, only that portion needs to be reprocessed
  • Load Balancing: Work can be distributed evenly across the cluster
  • Data Locality: Reduces network congestion by processing data where it resides when possible

 

Hive Data Types

Hadoop Data Types Overview

CategoryData TypeDescription
PrimitiveTINYINT1-byte signed integer
PrimitiveSMALLINT2-byte signed integer
PrimitiveINT4-byte signed integer
PrimitiveBIGINT8-byte signed integer
PrimitiveFLOATSingle precision floating point
PrimitiveDOUBLEDouble precision floating point
PrimitiveBOOLEANTrue/False values
PrimitiveSTRINGCharacter sequence
PrimitiveTIMESTAMPDate and time values
PrimitiveBINARYByte sequence
ComplexARRAYOrdered collection of fields
ComplexMAPKey-value pairs
ComplexSTRUCTContainer of named fields
ComplexUNIONTYPEDifferent types in same field

Additional Notes:

  • Primitive types are basic data types that cannot be broken down further
  • Complex types are composed of multiple primitive types or other complex types
  • All numeric types are signed
  • STRING type has no length limit

 

Complex Data Types in Hadoop

ARRAY

Arrays in Hadoop are ordered collections that can hold elements of the same data type. They are similar to arrays in programming languages, allowing indexing and iteration.

graph LR
    A[Array: scores] --> B[Index 0: 85]
    A --> C[Index 1: 92]
    A --> D[Index 2: 78]
    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bbf,stroke:#333
    style D fill:#bbf,stroke:#333

 

MAP

Maps are collections of key-value pairs where each key must be unique. They're useful for storing related data pairs like configurations or attributes.

graph LR
    M[Map: attributes] --> D1[department]
    M --> L1[location]
    D1 --> V1[IT]
    L1 --> V2[NYC]
    style M fill:#f9f,stroke:#333
    style D1 fill:#bbf,stroke:#333
    style L1 fill:#bbf,stroke:#333
    style V1 fill:#dfd,stroke:#333
    style V2 fill:#dfd,stroke:#333

 

STRUCT

Structs are containers that can hold named fields of different data types. They're similar to objects in programming languages and useful for organizing related fields.

Here's a practical example of how STRUCT stores data in a nested structure:

CREATE TABLE employees (
    id INT,
    info STRUCT<
        personal:STRUCT<
            name:STRING,
            age:INT,
            email:STRING
        >,
        work:STRUCT<
            department:STRING,
            position:STRING,
            salary:DOUBLE
        >
    >
);

-- Example data
{
    "id": 101,
    "info": {
        "personal": {
            "name": "John Smith",
            "age": 30,
            "email": "john.smith@company.com"
        },
        "work": {
            "department": "Engineering",
            "position": "Senior Developer",
            "salary": 95000.00
        }
    }
}

 

Here's a visual representation using Mermaid diagram syntax:

graph TD
    E[employees] --> ID[id: INT]
    E --> INFO[info: STRUCT]
    INFO --> P[personal: STRUCT]
    INFO --> W[work: STRUCT]
    
    P --> P1[name: STRING]
    P --> P2[age: INT]
    P --> P3[email: STRING]
    
    W --> W1[department: STRING]
    W --> W2[position: STRING]
    W --> W3[salary: DOUBLE]
    
    P1 --> PV1["John Smith"]
    P2 --> PV2[30]
    P3 --> PV3["john.smith@company.com"]
    
    W1 --> WV1["Engineering"]
    W2 --> WV2["Senior Developer"]
    W3 --> WV3[95000.00]
    
    style E fill:#f9f,stroke:#333
    style INFO fill:#bbf,stroke:#333
    style P fill:#dfd,stroke:#333
    style W fill:#dfd,stroke:#333

 

In this example, the STRUCT type creates a hierarchical structure where related data is organized in logical groups. The outer STRUCT 'info' contains two inner STRUCTs: 'personal' and 'work', each containing specific attributes about an employee. This organization makes it easier to manage and query related data fields together.

MapReduce Task Scheduling – Simplified Diagram

 

Here’s a simplified explanation of MapReduce task scheduling architecture:

graph TD
    A["Client Job Submission"] --> B["JobTracker"]
    B --> C["Resource Manager"]
    C --> D1["TaskTracker 1"]
    C --> D2["TaskTracker 2"]
    C --> D3["TaskTracker n"]
    D1 --> E1["Map Tasks"]
    D1 --> F1["Reduce Tasks"]
    D2 --> E2["Map Task (1 slot)"]
    D2 --> F2["Reduce Tasks (2 slots)"]
    D3 --> E3["Map Tasks"]
    D3 --> F3["Reduce Tasks"]

    %% Add descriptions
    style B fill:#90EE90
    style C fill:#FFB6C1
    style D1 fill:#87CEEB
    style D2 fill:#87CEEB
    style D3 fill:#87CEEB

 

Key components:

  • JobTracker: Central component that manages job scheduling and monitors progress
  • Resource Manager: Handles resource allocation across the cluster
  • TaskTrackers: Run on worker nodes, execute map and reduce tasks, and report status to JobTracker.
  • Map Tasks: Process input data in parallel, creating key-value pairs
  • Reduce Tasks: Aggregate and process mapped data to produce final output

This architecture enables distributed processing of large datasets across multiple nodes in a cluster, providing fault tolerance and scalability.