Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

Hadoop Data Types Overview

CategoryData TypeDescription
PrimitiveTINYINT1-byte signed integer
PrimitiveSMALLINT2-byte signed integer
PrimitiveINT4-byte signed integer
PrimitiveBIGINT8-byte signed integer
PrimitiveFLOATSingle precision floating point
PrimitiveDOUBLEDouble precision floating point
PrimitiveBOOLEANTrue/False values
PrimitiveSTRINGCharacter sequence
PrimitiveTIMESTAMPDate and time values
PrimitiveBINARYByte sequence
ComplexARRAYOrdered collection of fields
ComplexMAPKey-value pairs
ComplexSTRUCTContainer of named fields
ComplexUNIONTYPEDifferent types in same field

Additional Notes:

  • Primitive types are basic data types that cannot be broken down further
  • Complex types are composed of multiple primitive types or other complex types
  • All numeric types are signed
  • STRING type has no length limit

 

Complex Data Types in Hadoop

ARRAY

Arrays in Hadoop are ordered collections that can hold elements of the same data type. They are similar to arrays in programming languages, allowing indexing and iteration.

graph LR
    A[Array: scores] --> B[Index 0: 85]
    A --> C[Index 1: 92]
    A --> D[Index 2: 78]
    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bbf,stroke:#333
    style D fill:#bbf,stroke:#333

 

MAP

Maps are collections of key-value pairs where each key must be unique. They're useful for storing related data pairs like configurations or attributes.

graph LR
    M[Map: attributes] --> D1[department]
    M --> L1[location]
    D1 --> V1[IT]
    L1 --> V2[NYC]
    style M fill:#f9f,stroke:#333
    style D1 fill:#bbf,stroke:#333
    style L1 fill:#bbf,stroke:#333
    style V1 fill:#dfd,stroke:#333
    style V2 fill:#dfd,stroke:#333

 

STRUCT

Structs are containers that can hold named fields of different data types. They're similar to objects in programming languages and useful for organizing related fields.

Here's a practical example of how STRUCT stores data in a nested structure:

CREATE TABLE employees (
    id INT,
    info STRUCT<
        personal:STRUCT<
            name:STRING,
            age:INT,
            email:STRING
        >,
        work:STRUCT<
            department:STRING,
            position:STRING,
            salary:DOUBLE
        >
    >
);

-- Example data
{
    "id": 101,
    "info": {
        "personal": {
            "name": "John Smith",
            "age": 30,
            "email": "john.smith@company.com"
        },
        "work": {
            "department": "Engineering",
            "position": "Senior Developer",
            "salary": 95000.00
        }
    }
}

 

Here's a visual representation using Mermaid diagram syntax:

graph TD
    E[employees] --> ID[id: INT]
    E --> INFO[info: STRUCT]
    INFO --> P[personal: STRUCT]
    INFO --> W[work: STRUCT]
    
    P --> P1[name: STRING]
    P --> P2[age: INT]
    P --> P3[email: STRING]
    
    W --> W1[department: STRING]
    W --> W2[position: STRING]
    W --> W3[salary: DOUBLE]
    
    P1 --> PV1["John Smith"]
    P2 --> PV2[30]
    P3 --> PV3["john.smith@company.com"]
    
    W1 --> WV1["Engineering"]
    W2 --> WV2["Senior Developer"]
    W3 --> WV3[95000.00]
    
    style E fill:#f9f,stroke:#333
    style INFO fill:#bbf,stroke:#333
    style P fill:#dfd,stroke:#333
    style W fill:#dfd,stroke:#333

 

In this example, the STRUCT type creates a hierarchical structure where related data is organized in logical groups. The outer STRUCT 'info' contains two inner STRUCTs: 'personal' and 'work', each containing specific attributes about an employee. This organization makes it easier to manage and query related data fields together.