Value of the cloud for Big Data

Dr. Phil Shelley, President, DataMetica Solutions Incorporated and former CIO and recently CTO of Sears Holdings. Dr. Shelley has many years experience in CIO/CTO and business leadership roles, now working as part of the DataMetica team to bringing Hadoop and new data architectures to other large enterprises.

After decades, where the Enterprise Data Warehouse (EDW) has been the dominant solution for analytics needs, there is now thoroughly underway a major change where the combination of Big Data and cloud options are accelerating adoption.

Starting in the early 2000’s, parallel processing on commodity hardware, along with distributed storage concepts, were pioneered by Google. Yahoo engineers and others reverse-engineered the Google proprietary solution engineering principles and Hadoop was born. Social media giant Facebook adopted the technology and created SQL interfaces, opening-up something that was complex and hard to use, to all business users who are predominantly comfortable with the SQL language. Hadoop and the supporting tools evolved and became mature, secure and accessible. SQL performance became faster and became mainstream with drivers to connect any BI visualization tool. In-memory tools evolved, including the well-known Apache Spark technology. In 2013, at the end of a 10-year evolution, Hadoop was mature and most large enterprises demonstrated adoption. At first, use-cases were special applications, where the flexibility of Hadoop, performance advantages or storage cost reduction made a compelling argument. In the past 2 years however, Hadoop is progressively seeing adoption as an EDW, Relational Database or Mainframe replacement. Key drivers of Hadoop are performance, we routinely see up to 100 fold performance improvement, cost, where Hadoop can often be less than 25% of the cost of legacy solutions, or new use-cases, where complex data sets or near real-time analytics were show-stoppers for the legacy EDW.

Now we are seeing massive growth in Hadoop implementations, primarily, in traditional companies who had for decades used only traditional EDW and analytics databases. Historically, EDW and other data stores were loaded periodically with partial data from transactional systems and contained modest data volumes covering limited history. Data had to be frequently archived to save space, to maintain acceptable performance and manage costs: EDW appliances typically cost $10’s of millions to replace, expand and maintain. Now, with a well-designed Enterprise Data Hub (EDH) built on Hadoop and ancillary tools, these limitations are eliminated.

The role of Cloud
The enterprise data hub (EDH) running on Hadoop is a data management and analytics approach that is rapidly gathering pace and displacing legacy data warehouses (EDW) and even mainframes. Adoption is often slowed by company procedures to gain approval for funding, projects and architecture fit with legacy systems. There are major opportunities to accelerate Hadoop EDH adoption by leveraging cloud options. Previous concerns over security related to cloud are fading. The table below contrasts some of the factors related to policy and procedures restrictions and how the cloud approach can help:


How Cloud Helps

Architecture – Servers
Hadoop uses generic, low-cost, commodity servers

Buying Hadoop as a service eliminates architecture exception discussion and approvals. Cloud can offer these servers in a low cost, “pay for only what your use” model

Architecture – Storage
Hadoop uses internal disk storage and the company standard is SAN

Hadoop in a public or private cloud has storage bundled into the solution, there are not architecture options to get approved

Architecture – Backup
Hadoop uses a remote cluster for backup, the standard is legacy on-site and off-site backup

Cloud vendors have various ways to provide backup that are Hadoop-optimized

Architecture – Networking
Hadoop uses a complex LAN and WAN construct to bring performance at low cost. The high-availability of Hadoop relies on the network design

Networking in a cloud environment is a fundamental aspect of hosting, this is taken care of when purchasing Hadoop as a service

FundingCapital is not available within the current planning period

Gaining capital approval for a large investment in servers, networking, racks, power systems can take months – Cloud translates that into a small incremental monthly expense

Skills – InfrastructureInternal staff are unfamiliar with Hadoop systems, networking, Linux and security

Cloud vendors provide all infrastructure in a fully managed package, not new skills or learning curve for the internal team, they can learn over time in parallel with getting started

Skills – Data IntegrationInternal skills know traditional ETL well, but Hadoop often eliminates the need for ETL, but with different skills

Ingesting, transforming and exporting data into, within and out of Hadoop are new skills. Cloud vendors and their partners have these skills, ready to go

Skills – DevelopmentInternal teams may know JAVA and other languages, but no PIG and associated configurations for a reliable high-performance Data Hub

Moving existing applications, data and code to Hadoop requires skills not readily available in most companies. Cloud vendors and their partners have these capabilities with years of experience

Skills – Operations and SupportInternal The care and attention to maintain a healthy Hadoop system are somewhat different, especially handling errors, software configuration and job scheduling

Running Hadoop with 24/7 critical workload and ensuring high performance and availability are key aspects that cloud vendors build into their model, with SLA’s

Data – Modeling
Hadoop uses a very different approach to data modeling, internal skills may not fully understand how to do this on Hadoop

Moving traditional database data and structures to Hadoop require special skills to ensure that the systems runs optimally. Cloud vendors and their partners do this every day

Security – FirewallingMaking Hadoop secure is different to legacy systems. Network and physical access designs are critically important

Ironically, a cloud-hosted Hadoop environment can be more secure than on-premise. Cloud vendors experience, best practices, standards, audit and full-time focus make this an area of opportunity

Development – SystemsA single production cluster can be used for development, testing and QA, but there are often needs for short-term separate environments during these phases that are expensive to build, tear-down maintain and move data to

Cloud vendors offer the ability to have Hadoop as a service in a flexible model. Starting, stopping, changing the size of and moving data between clusters is a fundamental offering of cloud solutions, making them very flexible for development work

Policy – HostingA single 3rd party firm hosts all systems and has standards that do not fit the Hadoop model

Hadoop as a service is an opportunity to break-out of restrictive single-vendor hosting of traditional system hosting


Cloud – Speed, Flexibility and Cost Reduction
The Hadoop EDH concept is a profound change to enterprise data capture, retention, modeling and consumption. Analytics projects hosted on cloud Hadoop as a service frequently see a 50% reduction in time to value along with associated cost savings. Businesses can bring together data and create products, services and react in near real time. IT executives can save $ millions, as they gradually retire legacy EDW and databases or defer upgrades and expansion.