Querying Data in Hive














Hadoops schedules MapReduce jobs across the nodes of a cluster using JobTracker.
JobTracker takes a MapReduce job and breaks it into small map and reduce tasks and then schedules them to the various machines in your cluster. It also ensures each task is completed. The JobTracker is able to detect node failures and thereby redirect the work to other nodes.
When a MapReduce job is submitted:
Note: In modern Hadoop versions (YARN), JobTracker's functionality has been split between the Resource Manager and Application Master for better scalability.
The division of tasks into Map and Reduce phases follows a "divide and conquer" strategy that makes complex data processing more manageable:
This division provides several benefits:
Category | Data Type | Description |
Primitive | TINYINT | 1-byte signed integer |
Primitive | SMALLINT | 2-byte signed integer |
Primitive | INT | 4-byte signed integer |
Primitive | BIGINT | 8-byte signed integer |
Primitive | FLOAT | Single precision floating point |
Primitive | DOUBLE | Double precision floating point |
Primitive | BOOLEAN | True/False values |
Primitive | STRING | Character sequence |
Primitive | TIMESTAMP | Date and time values |
Primitive | BINARY | Byte sequence |
Complex | ARRAY | Ordered collection of fields |
Complex | MAP | Key-value pairs |
Complex | STRUCT | Container of named fields |
Complex | UNIONTYPE | Different types in same field |
Arrays in Hadoop are ordered collections that can hold elements of the same data type. They are similar to arrays in programming languages, allowing indexing and iteration.
graph LR
A[Array: scores] --> B[Index 0: 85]
A --> C[Index 1: 92]
A --> D[Index 2: 78]
style A fill:#f9f,stroke:#333
style B fill:#bbf,stroke:#333
style C fill:#bbf,stroke:#333
style D fill:#bbf,stroke:#333
Maps are collections of key-value pairs where each key must be unique. They're useful for storing related data pairs like configurations or attributes.
graph LR
M[Map: attributes] --> D1[department]
M --> L1[location]
D1 --> V1[IT]
L1 --> V2[NYC]
style M fill:#f9f,stroke:#333
style D1 fill:#bbf,stroke:#333
style L1 fill:#bbf,stroke:#333
style V1 fill:#dfd,stroke:#333
style V2 fill:#dfd,stroke:#333
Structs are containers that can hold named fields of different data types. They're similar to objects in programming languages and useful for organizing related fields.
Here's a practical example of how STRUCT stores data in a nested structure:
CREATE TABLE employees (
id INT,
info STRUCT<
personal:STRUCT<
name:STRING,
age:INT,
email:STRING
>,
work:STRUCT<
department:STRING,
position:STRING,
salary:DOUBLE
>
>
);
-- Example data
{
"id": 101,
"info": {
"personal": {
"name": "John Smith",
"age": 30,
"email": "john.smith@company.com"
},
"work": {
"department": "Engineering",
"position": "Senior Developer",
"salary": 95000.00
}
}
}
Here's a visual representation using Mermaid diagram syntax:
graph TD
E[employees] --> ID[id: INT]
E --> INFO[info: STRUCT]
INFO --> P[personal: STRUCT]
INFO --> W[work: STRUCT]
P --> P1[name: STRING]
P --> P2[age: INT]
P --> P3[email: STRING]
W --> W1[department: STRING]
W --> W2[position: STRING]
W --> W3[salary: DOUBLE]
P1 --> PV1["John Smith"]
P2 --> PV2[30]
P3 --> PV3["john.smith@company.com"]
W1 --> WV1["Engineering"]
W2 --> WV2["Senior Developer"]
W3 --> WV3[95000.00]
style E fill:#f9f,stroke:#333
style INFO fill:#bbf,stroke:#333
style P fill:#dfd,stroke:#333
style W fill:#dfd,stroke:#333
In this example, the STRUCT type creates a hierarchical structure where related data is organized in logical groups. The outer STRUCT 'info' contains two inner STRUCTs: 'personal' and 'work', each containing specific attributes about an employee. This organization makes it easier to manage and query related data fields together.
Here’s a simplified explanation of MapReduce task scheduling architecture:
graph TD A["Client Job Submission"] --> B["JobTracker"] B --> C["Resource Manager"] C --> D1["TaskTracker 1"] C --> D2["TaskTracker 2"] C --> D3["TaskTracker n"] D1 --> E1["Map Tasks"] D1 --> F1["Reduce Tasks"] D2 --> E2["Map Task (1 slot)"] D2 --> F2["Reduce Tasks (2 slots)"] D3 --> E3["Map Tasks"] D3 --> F3["Reduce Tasks"] %% Add descriptions style B fill:#90EE90 style C fill:#FFB6C1 style D1 fill:#87CEEB style D2 fill:#87CEEB style D3 fill:#87CEEB
Key components:
This architecture enables distributed processing of large datasets across multiple nodes in a cluster, providing fault tolerance and scalability.