Hive is an Apache software foundation project originated at Facebook. It is a data warehousing system build on top of Hadoop to analyse big data using SQL like query language. This blog covers an overview of Hive architecture and its design goals.
The RDBMS and NoSql databases failed to fulfil the exponential data growth and a need of enterprise data warehouse system was created. The data from different databases like MySql, Oracle etc. was dumped to Hadoop DFS and MapReduce programs were written to analyse the data. HDFS provided a reliable and scalable storage solution, but writing a MapReduce program every time and how to save the table schema were few concerns for data analysis using Hadoop. Hive was developed to address these data warehousing challenges faced by Hadoop. Hive is generally used for Log processing and Text mining, that is extracting useful information from computer generated logs or data. Other applications of Hive include Google Analytics and Document indexing.
- Easy Data Summarisation
- Data Analysis for petabytes of data stored across multiple machines in Hadoop DFS.
- To provide a support for Ad-hoc queries i.e. Hive Query Language, a SQL like language
- Pluggable custom mappers and reducers
The below diagram shows the main components of Hive and the Hive architecture diagram.
Web UI / CLI / Server : Interface for users to submit queries and different operations to the system.
Driver : Creates a session handle for the query and sends the query to the compiler.
Compiler : Parses the query, does semantic analysis & generates an execution plan.
Meta-Store : Stores all the information about the tables created, their partitions & buckets, the schemas, the columns and their types etc. Provides data abstraction and data discovery layer for Hive.
Execution Engine : Executes the execution plan created by the compiler. An execution engine can be a Hadoop cluster or a local system.