Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

Hive is altnerative way to program MapReduce . Hive write SQL-like queries which are then converted to MapReduce programs.

 

 

 

Here's a comparison between Pig and Hive:

FeaturePigHive
LanguagePig Latin (procedural)HiveQL (declarative, SQL-like)
Learning CurveModerateEasy for SQL users
Data ModelSchema-optionalSchema-mandatory
Use CaseETL and data pipeline processingData warehousing and SQL queries
PerformanceBetter for complex data transformationsBetter for structured data analysis
Data TypesRich set of data types and nested structuresPrimitive and complex types, similar to SQL
Development ByYahooFacebook
ExtensibilityUDFs in Java, Python, etc.UDFs primarily in Java

This comparison highlights the key differences in their approach to processing data in Hadoop ecosystems.

 

Hive Architecture

The Hive architecture consists of several key components that work together to process SQL-like queries into MapReduce jobs:

graph TD
    subgraph Query_Processing
        A["User Interface"] --> B["Driver"]
        B --> C["Query Processing Engine"]
        C -->|"Parsed & Optimized Query"| D["Executor"]
    end

    subgraph Query_Engine["Query Processing Engine"]
        P["Parser"]
        Q["Semantic Analyzer"]
        R["Query Optimizer"]
        P --> Q --> R
    end

    subgraph Hadoop_Environment
        M["NameNode"] 
        N["DataNodes"]
        O["MapReduce"]
        M --> N
        O --> M
        O --> N
    end

    D --> Hadoop_Environment
    D --> F["MetaStore"]
    F --> H["Database"]

    %% Mode of Interaction
    I["CLI"] --> A
    J["Web UI"] --> A
    K["JDBC/ODBC"] --> A
    L["API"] --> A

 

Key components:

  • User Interface: Multiple modes of interaction including Command Line Interface (CLI), Web UI, JDBC/ODBC drivers, and API
  • Query Processing Engine: Combines parser, semantic analyzer, and query optimizer to process and optimize HiveQL queries
  • Executor: Executes the tasks using Hadoop MapReduce
  • MetaStore: Stores metadata about tables, columns, partitions, and schema
  • Hadoop Environment: Consists of NameNode for metadata management, DataNodes for storage, and MapReduce for distributed processing

This architecture allows Hive to efficiently process SQL-like queries by converting them into MapReduce jobs while maintaining metadata and providing multiple interaction options for users.