Because of its cost effective way of storing data and analyzing it, Hadoop has become the backbone of every Enterprise Big Data System. But to store data, we must first bring the data from the source to Hadoop. The framework that helps solve this issue is Apache Flume.
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not restricted to only log data aggregation. Since data sources are customizable, Flume can also be used to transport massive quantities of event data. This includes, but is not limited to, network traffic data, social-media-generated data, and email messages.
The following diagram depicts the different components of Flume and what role each of these components performs.
Because of this very simple but powerful architecture, Flume is capable of moving anything that can be read as a byte array from any source system to any sink system. If we can open an input stream to the data generated by the source, then that source can become a flume source. If we can open an output stream to this system, then the system can become a flume sink. Flume provides many source and sink implementations, however, users can also easily implement their own custom sources or sinks if it is more convenient.
Flume provides an ability to create multi hop flows. By this we mean there are multiple flume agents forming a chain together to transfer data from the Source System to the Destination System. This ability allows us to create complex flows where data from multiple Flume Agents is consolidated by intermediate hop of Flume Agents and then stored into Destination System.
*Note that in multi hop flow, the intermediate Source and Sink must use Avro/Thrift
Abhijeet Shingate (Big Data Architect)