Why update Hive Tables in four steps when you can do it in one! Check out this updated guide for updating Hive Tables the easy way.
Incremental Updates Hadoop and Hive are quickly evolving to outgrow previous limitations for integration and data access. While this approach may work for smaller data sets, it may be prohibitive at scale. As true Inserts and Updates are not yet available in Hive, we need to consider a process of preventing duplicate records as Updates are appended to the cumulative record set.
In this blog, we will look at a four-step strategy for appending Updates and Inserts from delimited and RDBMS sources to existing Hive table definitions. While there are several options within the Hadoop platform for achieving this goal, our focus will be on a process that uses standard SQL within the Hive toolset.
Hive Table Definition Options: The table definition exists independent from the data, so that, if the table is dropped, the HDFS folders and files remain in their original state. Local Tables are Hive tables that are directly tied to the source data. The data is physically tied to the table definition and will be deleted if the table is dropped. The following process outlines a workflow that leverages all of the above in four steps: Replacing the Base table with Reporting table contents and deleting any previously processed Change records before the next Data Ingestion cycle.
The tables and views that will be a part of the Incremental Update Workflow are: After the initial processing cycle, it will maintain a copy of the most up-to-date synchronized record set from the source. At the end of each processing cycle, it is cleared of content as explained in the Step 4: Regardless of the ingest option, the processing workflow in this article requires: One-time, initial load to move all data from source table to HIVE. File Processing For this blog, we assume that a file or set of files within a folder will have a delimited format and will have been generated from a relational system i.
Files will need to be moved into HDFS using standard ingest options: Appears as a standard network drive and allows end-users to use standard Copy-Paste operations to move files from standard file systems into HDFS. Once the initial set of records are moved into HDFS, subsequent scheduled events can move files containing only new Inserts and Updates. Reconcile In order to support an on-going reconciliation between current records in HIVE and new change records, two tables should be defined: This table will house the initial, complete record load from the source system.
After the first processing run, it will house the on-going, most up-to-date set of records from the source system: