847-505-9933 | +91 20 66446300 info@datametica.com

Sqoop: An Introduction

An Introduction to Sqoop:

At the heart of Hadoop lie multiple facets of data management. One of the major components of Hadoop is the ability to transfer data to and from their distributed file system, HDFS. One of the frameworks capable of this data transfer is Sqoop.

Sqoop is a command line tool used for importing and exporting data between Hadoop and specified relational databases. Importing is the process of bringing data into Hadoop, exporting is the process of taking the data from Hadoop and putting it back into the system. Sqoop can manage both of these processes by using the Sqoop Import and Sqoop Export functions.

image

Like Hadoop, Sqoop is written in Java, which provides an API called Java Database Connectivity (JDBC). This allows applications to access data stored in a RDBMS and inspect the nature of the data. If JDBC is native to a database platform, Sqoop can work directly with it. If not, however, it is possible to use Sqoop connectors to gain access to non-compliant external systems.

Sqoop relies on the database to describe the schema of the data to be imported, and uses MapReduce to import and export the data, which provides parallel operation.

How to use Sqoop:

image

We can bring data from the relational world into the Hadoop Distributed File System by using the Sqoop Import tool and then analyze the data with MapReduce. We will then have a resultant data set, which we can put back into RDBMS using the export process of Sqoop.

Import Process:

image

The input to the import process is a database table. Sqoop initially reads this table row by row into HDFS with the output of the import process being a set of files on the Hadoop distributed file system. The reason we have multiple files in the output is that the import process is performed in parallel.

A by-product of the import process that Sqoop uses is a generated Java class. This java class is also provided for use in subsequent MapReduce operations on the data. This class encapsulates one row of the imported table, and its availability makes development of MapReduce applications more convenient.

Before the importing of data can start, Sqoop uses JDBC to examine the table by retrieving a list of all the columns and their SQL data types. These SQL types (varchar, integer and more) can then be mapped to Java data types (String, Integer etc.) that will hold the field values in MapReduce applications. Sqoop’s code generator will use this information to create a table-specific class to hold records extracted from the table.

SQOOP Tools, Commands, and Arguments:

An available list of Sqoop Tools, Commands, and Arguments can be found on:

http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1721978

This is Part 1 of a series of blogs on Sqoop that will be on our website in the upcoming weeks.

Part 2 will have more information on the following:

  • Sqoop Import
  • Import Process
  • Dealing with Incremental Loads
  • Importing to Hive Loads

Supriya Sahay,
Sr. Engineer
Big Data Platform, DataMetica

Leave a Comment

POST COMMENT Back to Top

*

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.