Apache Hive

Apache Hive is an open-source data warehouse system built on top of Apache Hadoop. It allows you to query and analyze large datasets using an SQL-like language called HiveQL (HQL). Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.

Why we use Hive?

• To run SQL queries on large datasets
• To perform data warehousing and analytics
• To simplify Hadoop usage (no need to write Java MapReduce jobs)
• To process structured and semi-structured data
• To support ETL (Extract, Transform, Load) workflows

When should you use Hive?

Hive is a good fit when:

• You need batch processing of large datasets
• Your queries are analytical (OLAP)
• You are working with data lakes (HDFS, S3, etc.)
• You want SQL-like access to big data
• Performance is not required to be real-time

Not ideal when:

• You need real-time or low-latency queries
• You require frequent row-level updates
• You need transaction-heavy (OLTP) systems
• You want fast interactive querying (unless using newer engines)

Key features of Hive

• SQL-like query language (HiveQL)
• Schema-on-read (structure applied at query time)
• Works on top of Hadoop storage (HDFS)
• Batch processing oriented
• Extensible (UDFs, custom scripts)
• Integration with engines like Apache Tez and Apache Spark

Key components of Apache Hive

• HiveQL (HQL): SQL-like language for querying data
• Driver: Manages query lifecycle
• Compiler: Converts queries into execution plans
• Metastore: Stores metadata (tables, schemas, partitions)
• Execution Engine: Runs queries using MapReduce, Tez, or Spark
• SerDe (Serializer/Deserializer): Converts data between storage format and Hive tables

Advantages

• Familiar SQL interface (easy for analysts)
• Works well with massive datasets
• Scalable and distributed
• Integrates with the Hadoop ecosystem
• Supports data warehousing use cases

Disadvantages

• High latency (not real-time)
• Limited support for row-level operations
• Performance depends on underlying execution engine
• Not ideal for interactive analytics
• Requires Hadoop ecosystem setup

Alternatives

Depending on your needs, here are common alternatives:

Presto (also known as Trino)

Fast, interactive SQL queries on big data

Apache Spark SQL

Faster processing with in-memory computation

Google BigQuery

Fully managed, serverless analytics

Amazon Redshift

Managed data warehouse for analytics

Snowflake

Modern cloud-native data warehouse

Other Features of Hive

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes. Other features of Hive include:

• Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned.

• Different storage types such as plain text, RCFile, HBase, ORC, and others.

• Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.

• Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.

• Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.

• SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.

Four file formats are supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE. Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13. Additional Hive plugins support querying of the Bitcoin Blockchain.