When you don’t want to import the whole table, instead just the newly added or altered rows of the table then you can use incremental import feature of Sqoop. This saves considerable resources. It periodically syncs the table to the HDFS. There are various ways to do that.
Sqoop supports 2 types of incremental imports:
Incremental import mode can be defined using –incremental argument in the command.
- Importing only the new data (Append Mode)
When in a table only new rows are inserted and the previous data is not changed/altered then you can use Sqoop’s append mode. In this mode only the newly added rows of the table of the RDBMS is appended to the HDFS.
This mode requires two parameters —check-column and —last-value which provide the information about the column which will be checked for appended value and the last value which is updated in HDFS.
sqoop import -m 1 --connect jdbc:mysql://localhost:3306/protechskills --username root --password root --table student --target-dir /sqoopdata --incremental append --check-column id --last-value 3
This will import only those rows of the table whose ID value is greater than 3.
You can use the same student table which was created in previous post and add some new rows in student table to check the incremental append import.
- Importing the updated data of the table (Last-modified Mode)
If in addition to newly added rows, your previous data is also updated then the append mode fails. For this purpose you can use lastmodified mode of Sqoop.
This mode also requires two parameters —check-column and —last-value, where –check-column argument specify the column name having a date value (datetime/timestamp) which will provide the information about the time when each row was last modified.
And –last-value argument specify the last value of field which was updated in HDFS, to which data already has been imported.
This mode will import only the data after the last modified time value, hence this column should be updated to the current time after every update or addition of new row in source RDBMS.
sqoop import -m 1 --connect jdbc:mysql://localhost:3306/sqoopdata --username root --password root --table abc --incremental lastmodified --check-column last-modified-time --last-value "07-24-2016 22:59:01"
This will check the column “last-modified-time” and will import the data whose value will be greater than “07-24-2016 22:59:01”.
NOTE: The above two modes of import requires one to remember the last-value, so that the data can be appended or modified with reference to that last value.
About the Author –
- Introduction to Sqoop and Installation
- Importing Data using Sqoop
- Importing Data into Hive using Sqoop