847-505-9933 | +91 20 66446300 info@datametica.com

RC/ORC File Format

Data is big…and it’s growing. Probably not a big surprise for you, but with Big Data becoming a hot topic in business and IT, it has reached a point where it is unavoidable. As data has grown, improved technology has made it feasible to finally manage this infinite amount of information and store it appropriately. Specifically, relational data today is stored in databases like Oracle, DB2, and MySQL. Although common throughout many businesses, these databases struggle to let users access the data efficiently and in real-time. These issues arise due to inadequate storage capacity and inability to allow parallel processing resulting in cost and time inefficiency.

With open source technologies taking the first step, we have seen advances in data management. For storage specifically, Hadoop provides a few basic file formats which can help to save memory and provide a better way to access relational data. Today, we will be discussing, in detail, how RCFile and ORCFile are used to help store and access relational data in Hadoop.

RCFile

The RCFile (Record Columnar File) is a data storage structure that determines how to minimize the space required for relational data in HDFS (Hadoop Distributed File System). It does this by changing the format of the data using the MapReduce framework. The RCFile combines multiple functions such as data storage formatting, data compression, and data access optimization.
It is able to meet all the four requirements of data storage:

(1) Fast data storing,
(2) Improved query processing,
(3) Optimized storage space utilization
(4) Dynamic data access patterns.

The RCFile format can partition the data both horizontally and vertically. This allows it to fetch only the specific fields that are required for analysis, thereby eliminating the standard time needed to analyse the whole table in a database. The overall data size reduction can be as large as 14% of the original data format.

The simplest way to create the RCFile format is using Hive in Hadoop as follows:
(If you do not have an existing file to use, begin by creating one. Example below)

CREATE TABLE table_rc (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT
)
STORED AS RCFILE;

Enabling Compression for RCFile

SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Create a hive table:

CREATE TABLE table_txt (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT
)
ROW FORMAT DELIMITED fields terminated by ‘ ‘;
LOCATION ‘< HDFS FILE PATH >’;

As we have created a hive table, now we can write this data into “table_rc” table:

INSERT OVERWRITE TABLE table_rc SELECT * FROM table_txt;

Now you can run a query on an individual column basis. The MapReduce job will begin; however, you must watch the HDFS_BYTES_READ parameter to see the difference of the bytes read from the HDFS. You can see that there is a huge difference of the data that is read, as the RCFile is reading only one column and the text format is reading the complete data to execute the query.

ORC File

The ORC File (Optimized Row Columnar) format provides a more efficient way to store relational data than the RC File, reducing the data storage format by up to 75% of the original. The ORC file format performs better than other Hive files formats when Hive is reading, writing, and processing data. Specifically compared to the RC File, ORC takes less time to access data and takes less space to store data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. Also, the ORC File format comes with the Hive 0.11 version and cannot be used with previous versions.

How to create an ORC File using Hive in Hadoop:
If you don’t have an existing file to use, begin by creating one:

CREATE TABLE table_orc (
column1 STRING,
column2 STRING,
column3 INT,
column4 INT
) STORED AS ORC;

Create a hive table:

CREATE TABLE table_temp (
name STRING,
address STRING,
age INT,
salary INT
)
ROW FORMAT DELIMITED fields terminated by ‘ ‘;
LOCATION ‘< HDFS FILE PATH >’;

Write the data into “table_orc” table :

INSERT OVERWRITE TABLE table_orc SELECT * FROM table_temp;

The advancement of data compression is a vital aspect towards the future of Big Data. As the volume and need for data increases, data management will have a large impact on the future of data storage. The data compression tools allows for the file formats to be much smaller in size, thereby leading to efficient management and access of the data. Currently, with the release of Hive 0.12, both ORC and RC file formats have improved to match current data storage needs of business. As time progresses, the continued innovation of data storage will need to correlate with the exponential growth of data itself.

Bhagwan Soni (Big Data Engineer)
Benesh Chudasama (Business Development)

Sources:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

http://www.semantikoz.com/blog/faster-big-data-hadoop-hive-rcfile/

http://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/RCFile.html

http://www.semantikoz.com/blog/the-free-apache-hive-book/

http://hortonworks.com/wp-content/uploads/2013/10/ORCFile.png

Showing 4 comments

  • Echapu
    Reply

    My brother suggested I might like this blog. He
    was totally right. This post truly made my day.
    You cann’t imagine just how much time I had spent for this
    info! Thanks!

  • Jiten
    Reply

    Very nice article for optimization of data and performance. My next step in current project.
    Thanks!!

  • Chandra
    Reply

    Nice article on the hive file formats. Thanks for putting this together.

  • Amol Tanksale
    Reply

    Very Nice Explaination of RC and ORC FIle.

    really Helpful

Leave a Comment

POST COMMENT Back to Top

*

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.