Importing Data into Hive using Sqoop

Posted on July 15, 2016December 16, 2018 by ProTechSkills

Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive.

Related Posts:

How import works in Hive

First of all, data from RDBMS to HDFS is imported at default location (/user

/$USER_NAME) or at location specified by –target-dir option.

Next, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types.

In last, a LOAD DATA INPATH statement is used to move the data files into Hive’s warehouse directory.

Importing a table from RDBMS to Hive

Use the following command to import an entire table:

sqoop import -m 1  --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --table student --hive-table student --create-hive-table --hive-import

Above command will run map-reduce job and import the entire data from MySQL of student table to Hive table.

Argument	Description
–hive-import	Used to specify hive import. Import tables into Hive
–create-hive-table	Create Hive Table automatically. If the target hive table already exists then job will fail. By default this property is false.
–hive-table <table-name>	Sets the table name to use when importing to Hive.
–hive-home <dir>	Override value of environment variable `$HIVE_HOME`
–hive-overwrite	Overwrite existing data in the Hive table.
–hive-drop-import-delims	Drops \n, \r, and \01 from string fields when importing to Hive.

Here

-m 1 defines number of mappers; in this example it is 1. Sqoop performs the data transfer in parallel manner that depends on the number of mapper. By default there are 4 mappers but we can set the number of mappers as per the requirement. However it is recommended not to use high number of mappers as it will affect the transfer and it might make the transfer slow.

–connect argument is for specifying the RDBMS connection URL.

–create-hive-table : If set, then the job will fail if the target hive table exits. By default this property is false.

NOTE: The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the –hive-table option.

NOTE: If you have multiple Hive installations, or hive is not in your $PATH, use the –hive-home option in command to identify the Hive installation directory.

If the Hive table already exists, you can specify the –hive-overwrite option to indicate that existing table in hive must be replaced.

Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data.

NOTE: While importing the data into Hive, default field delimiter is ^A and the record delimiter is \n.

To import the partial data like few columns, few rows or result of query, you can use flags as described in Importing Data using Sqoop.

Related Posts

PROTECHSKILLS

Importing Data into Hive using Sqoop

Importing a table from RDBMS to Hive

Leave a Reply Cancel reply

Get in touch with us today! Let's Talk About Your Needs.