Big Data and the Rise of the Enterprise Data Hub

Dr. Phil Shelley, President, DataMetica Solutions Incorporated and former CIO and recently CTO of Sears Holdings. Dr. Shelley has many years’ experience in CIO/CTO and business leadership roles, now working as part of the DataMetica team to bringing Hadoop and new data architectures to other large enterprises.

After decades, where the Enterprise Data Warehouse (EDW) has been the dominant solution for analytics needs, there is now thoroughly underway a major change.

Starting in the early 2000’s, parallel processing on commodity hardware, along with distributed storage concepts, were pioneered by Google. Yahoo engineers and others reverse-engineered Google’s proprietary solution engineering principles and Hadoop was born. Social media giant Facebook adopted the technology and created SQL interfaces, opening-up something that was complex and hard to use, to all business users who are predominantly comfortable with the SQL language. Hadoop and the supporting tools evolved and became mature, secure and accessible. SQL performance became faster and became mainstream with drivers to connect any BI visualization tool. In-memory tools evolved, including the well-known Apache Spark technology. In 2013, at the end of a 10-year evolution, Hadoop was mature and most large enterprises demonstrated adoption. At first, use-cases were special applications, where the flexibility of Hadoop, performance advantages or storage cost reduction made a compelling argument. In the past 2 years however, Hadoop is progressively seeing adoption as an EDW, Oracle or Mainframe replacement. Key drivers are performance, we routinely see up to 100 fold performance improvement, cost, where Hadoop can often be less than 25% of the cost of legacy solutions, or new use-cases, where complex data sets or near real-time analytics were show-stoppers for the legacy EDW.

Now we are seeing massive growth in Hadoop implementations, primarily, in traditional companies who had for decades used only traditional EDW and analytics databases. Historically, EDW and other data stores were loaded periodically with partial data from transactional systems and contained modest data volumes covering limited history. Data had to be frequently archived to save space, to maintain acceptable performance and manage costs: EDW appliances typically cost $10’s of millions to replace, expand and maintain. Now, with a well-designed Enterprise Data Hub (EDH) built on Hadoop and ancillary tools, these limitations are eliminated.

The enterprise data hub (EDH) running on Hadoop is a data management and analytics approach that is rapidly gathering pace and displacing legacy data warehouses (EDW) and even mainframes. An important concept of an EDH is to load data in near real-time and to retain full fidelity and full detail, for as long as retention policies permit. This “no ETL” concept is a primary feature of an EDH that makes it so flexible, cost effective and powerful, especially for analytics. Data is ingested in its native form, in full fidelity, without truncation, filtering or change. This near real-time ingestion in native format, opens-up the opportunity to accelerate data integration, accept data in any format (even voice, video and flat files), capture and store it for use, possibly years later. Inside the EDH data model, the ingested data is taken through successive layers, refined, data quality checks applied and manipulated in secure, sometimes encrypted, data views and marts that make the data analytics-ready. The original full-fidelity ingestion files are persisted, allowing reuse and refactoring as the business evolves over the years. Now we can process, query and run any analytics or business process against this data, at any point in time. We expose the data via SQL and BI tools, then optionally extract data as needed for consumption.

The Hadoop EDH concept is a profound change to enterprise data capture, retention, modeling and consumption. Analytics projects frequently see a 50% reduction in time to value along with associated cost savings. Businesses can bring together data and create products, services and react in near real time. IT executives can save $ millions, as they gradually retire legacy EDW and databases or defer upgrades and expansion.

In my personal experience as a CIO, as CTO at Sears and since then helping other large firms implement these solutions, I have seen a pattern emerge where Hadoop rapidly grows in size and importance to company operations.

It is fascinating to be part of this transformation in data management, to see enterprise data solutions using big data technologies grow and evolve, however there is extreme importance of a well-governed EDH foundation. Today we can benefit from the big data EDH approach, with almost unlimited low-cost compute, storage and analytics versatility, with data loaded in near real-time, but increasingly with the advantages of legacy systems data protection and security.