To process and analyze data in Hadoop, it requires loading data into Hadoop file system that is present on Application server and databases. Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported.
Apache Sqoop that is short form of “SQL to Hadoop” is developed to serve the purpose of transferring the data between relational databases to Hadoop both ways.
Sqoop uses MapReduce (Hadoop’s execution engine) to perform the transfer between RDBMS and HDFS; however there is only mapper phase and no reducer phase in this process. Sqoop performs the data transfer in parallel manner that depends on the number of mapper. By default there are 4 mappers but we can set the number of mappers as per the requirement. However it is recommended not to use high number of mappers as it will affect the transfer and it might make the transfer slow.
Sqoop is a command-line tool; it supports Linux operating system. There are various ways to install Sqoop; one way is to use binary tarball that is provided with every release. The other way is to use operating system-specific packages such as rpm package for Red Hat, CentOS, SUSE and deb package for Ubuntu and Debian.
Sqoop is not a cluster service, it need not be installed on every node of the cluster. As a Hadoop application, Sqoop requires that the Hadoop libraries and configurations be available on the machine. If you want to import your data into HBase and Hive, Sqoop will need those libraries. For common functionality, these dependencies are not mandatory.
Follow these steps to install Sqoop on any Linux machine using tarball method.
- If you have archive binary of Sqoop (e.g. sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz) then copy it into your Linux VM at any location (e.g. /opt) using WinSCP software.
You can also download the latest version(binary tarball format) of Sqoop from here.
NOTE: If you are getting permission error then run below command on Linux terminal or putty terminal to change the permissions: sudo chmod -R 777 /opt
- Go to the location where you have copied the sqoop software package and un-compress it.
cd /opt tar -xzvf sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz
- Rename the name of sqoop directory:
mv sqoop-1.4.5.bin__hadoop-2.0.4-alpha sqoop-1.4.5
- Set Environment variables in .bash_profile:
Add below lines in end of file:
export SQOOP_HOME=/opt/sqoop-1.4.5 export PATH=$PATH:$SQOOP_HOME/bin
Run below command to update environment variables in current session:
5. Sqoop needs JDBC connector driver for making connection with RDBMS. We need to first download the specific JDBC driver from the database vendor’s website (they are available there free of cost). After downloading the driver we need to copy the driver’s JAR file to Sqoop’s lib directory.
If you have mysql-connector jar then copy it into lib folder of SQOOP_HOME.
Or run below commands to download it and copy it into Sqoop:
wget http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz tar -xzvf mysql-connector-java-5.1.30.tar.gz cp mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar $SQOOP_HOME/lib
6. Run below command to verify the installation
It will list all commands of command. Output of this command will look like as:
Now you can run any command available in sqoop for importing or exporting the data between RDBMS and Hadoop ecosystems.
About the Author –