Hadoop Data Types Overview
Category | Data Type | Description |
Primitive | TINYINT | 1-byte signed integer |
Primitive | SMALLINT | 2-byte signed integer |
Primitive | INT | 4-byte signed integer |
Primitive | BIGINT | 8-byte signed integer |
Primitive | FLOAT | Single precision floating point |
Primitive | DOUBLE | Double precision floating point |
Primitive | BOOLEAN | True/False values |
Primitive | STRING | Character sequence |
Primitive | TIMESTAMP | Date and time values |
Primitive | BINARY | Byte sequence |
Complex | ARRAY | Ordered collection of fields |
Complex | MAP | Key-value pairs |
Complex | STRUCT | Container of named fields |
Complex | UNIONTYPE | Different types in same field |
Additional Notes:
- Primitive types are basic data types that cannot be broken down further
- Complex types are composed of multiple primitive types or other complex types
- All numeric types are signed
- STRING type has no length limit
Complex Data Types in Hadoop
ARRAY
Arrays in Hadoop are ordered collections that can hold elements of the same data type. They are similar to arrays in programming languages, allowing indexing and iteration.
graph LR
A[Array: scores] --> B[Index 0: 85]
A --> C[Index 1: 92]
A --> D[Index 2: 78]
style A fill:#f9f,stroke:#333
style B fill:#bbf,stroke:#333
style C fill:#bbf,stroke:#333
style D fill:#bbf,stroke:#333
MAP
Maps are collections of key-value pairs where each key must be unique. They're useful for storing related data pairs like configurations or attributes.
graph LR
M[Map: attributes] --> D1[department]
M --> L1[location]
D1 --> V1[IT]
L1 --> V2[NYC]
style M fill:#f9f,stroke:#333
style D1 fill:#bbf,stroke:#333
style L1 fill:#bbf,stroke:#333
style V1 fill:#dfd,stroke:#333
style V2 fill:#dfd,stroke:#333
STRUCT
Structs are containers that can hold named fields of different data types. They're similar to objects in programming languages and useful for organizing related fields.
Here's a practical example of how STRUCT stores data in a nested structure:
CREATE TABLE employees (
id INT,
info STRUCT<
personal:STRUCT<
name:STRING,
age:INT,
email:STRING
>,
work:STRUCT<
department:STRING,
position:STRING,
salary:DOUBLE
>
>
);
-- Example data
{
"id": 101,
"info": {
"personal": {
"name": "John Smith",
"age": 30,
"email": "john.smith@company.com"
},
"work": {
"department": "Engineering",
"position": "Senior Developer",
"salary": 95000.00
}
}
}
Here's a visual representation using Mermaid diagram syntax:
graph TD
E[employees] --> ID[id: INT]
E --> INFO[info: STRUCT]
INFO --> P[personal: STRUCT]
INFO --> W[work: STRUCT]
P --> P1[name: STRING]
P --> P2[age: INT]
P --> P3[email: STRING]
W --> W1[department: STRING]
W --> W2[position: STRING]
W --> W3[salary: DOUBLE]
P1 --> PV1["John Smith"]
P2 --> PV2[30]
P3 --> PV3["john.smith@company.com"]
W1 --> WV1["Engineering"]
W2 --> WV2["Senior Developer"]
W3 --> WV3[95000.00]
style E fill:#f9f,stroke:#333
style INFO fill:#bbf,stroke:#333
style P fill:#dfd,stroke:#333
style W fill:#dfd,stroke:#333
In this example, the STRUCT type creates a hierarchical structure where related data is organized in logical groups. The outer STRUCT 'info' contains two inner STRUCTs: 'personal' and 'work', each containing specific attributes about an employee. This organization makes it easier to manage and query related data fields together.