Importing Data using Sqoop

Posted on July 8, 2016November 20, 2016 by ProTechSkills

Sqoop is an Apache Hadoop top-level project and designed to move data between Hadoop and RDBMS. Sqoop is a collection of related tools.

To use Sqoop, you specify the tool you want to use and the arguments that control the tool.

sqoop tool-name [tool-arguments]

In this post, we will cover how to import data from MySQL to Hadoop and Hive.

Before importing data let’s create some sample data in MySQL.

Creating sample data in MySQL

Run below command and enter password to open MySQL:

mysql -u root -p

Now create a database “protechskills” with table “student” using below commands:

Create database protechskills;
use protechskills;
create table student (id int, name nvarchar(500), age int);
insert into student values(1,'john',25);
insert into student values(2,'kate',30);
insert into student values(3,'mike',40);

Now verify the data in mysql as shown in below image:

Mysql-Student-Data

Here table student has 3 schema columns id, name, and age. Total number of rows in this table is 3. Now we will import the same into HDFS by using Sqoop.

Importing a table from RDBMS to HDFS

The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS.

Use the following command to import an entire table:

Syntax:

sqoop import -m 1 --connect jdbc:mysql://localhost:3306/$DATABASE_NAME --username $USER_NAME --password $PASSWORD --table $TABLE_NAME --target-dir $TARGET_DIR

In our case command will be as shown below:

sqoop import -m 1 --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --table student --target-dir /sqoopdata

Above command will run map-reduce job and import the entire data from MySQL of student table to HDFS.

Here

-m 1 defines number of mappers; in this example it is 1. Sqoop performs the data transfer in parallel manner that depends on the number of mapper. By default there are 4 mappers but we can set the number of mappers as per the requirement. However it is recommended not to use high number of mappers as it will affect the transfer and it might make the transfer slow.

–connect argument is for specifying the database connection URL.

–target-dir specifies location of the output/target directory in HDFS. If this argument is not specified then It will create a directory with the same name as the imported table inside your home directory (e.g./user/hadoop) on HDFS and import all the data there.

Now you verify the imported data in Hadoop using web interface (http://NAMENODE_IP:50070/explorer.html#/) or using commands.

Run below command to list imported data:

hadoop fs -ls /sqoopdata

Now run below command to read content of imported data:

hadoop fs -cat /sqoopdata/part-m-00000

Importing selected columns of a table from RDBMS to HDFS

By default, the import query will select all the columns of the input table for import, but we can select the subset of columns by specifying the comma-separated list of columns in the –columns argument.

Below query will only import two columns (name,age) of student table.

sqoop import -m 1 --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --table student --target-dir /sqoopdatadir --columns "name,age"

Importing selected rows of a table from RDBMS to HDFS

By default, all the rows of the input table will be imported to HDFS, but we can control which rows need to be imported by using a –where argument in the import statement.

Below query will import only those rows into HDFS where the value of the age column is greater than 27.

sqoop import -m 1 --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --table student --target-dir /sqoopdatadir --where "age > 27"

Note: You can also use the –where and –column arguments at the same time.

Importing data using Free-form Query

Sqoop can also import the result set of an arbitrary SQL query. Instead of using the –table, –columns and –where arguments, you can specify a SQL statement with the –query argument.

When importing a free-form query, you must specify a destination directory with –target-dir.

sqoop import -m 1 --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --query 'SELECT * FROM student where $CONDITIONS' --target-dir /sqoopdata

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS in where clause, which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with –split-by.

sqoop import --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --query 'select * from student where $CONDITIONS' --target-dir /sqoopdata --split-by id

Above command will run 4 parallel processes to import the data.

Importing all the Tables of a Database to HDFS

We can import all the tables present in a database by using import-all-tables tool of sqoop. Data from each table is stored in a separate directory in HDFS.

To use import-all-tables tool, the following conditions must be met:

Each table must have a single-column primary key or -m 1 option must be used to run single process.
You must intend to import all columns of each table.

Run below command to import all tables of protechskills database.

sqoop import-all-tables --connect jdbc:mysql://localhost:3306/protechskills --username root --password root -m 1

–target-dir option don’t work in this tool. All tables will be imported into user home directory(e.g./user/hadoop) in HDFS.

However if you want to import almost all the tables from the database except a few tables, then you can use –exclude-tables option in command as shown below:

sqoop import-all-tables --connect jdbc:mysql://localhost:3306/protechskills --username root --password root -m 1 --exclude-tables ABC,SALES

Above command will import all the tables from protechskills database except tables ABC and SALES.

Related Posts

PROTECHSKILLS