Big Data Analytics - Scale as needed

Big Data initiatives are now making business and IT leaders rethink how Data Warehouses or "Big Data Warehouses" are being evolved.  The Business Intelligence (BI) or Analytics Landscape needs to adapt to newer technologies like Hadoop and Spark for large data sets that live on multiple physical machines.  At the center of the "Big Data Warehouse" are Data Lakes and more affordable Massively Parallel Processing (MPP) databases which run on one machine or appliance.  Alexicon is part of the Amazon APN network to deliver Big Data solutions (Hadoop, Spark and MPPs) that scale in the cloud.

Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009.  Following is a side-by-side comparison of Hadoop and Spark's performance (numbers gathered from Matei's presentation at Strata+Hadoop World 2015):

It is clear to see that Spark significantly outperforms Hadoop because of its advanced engine and efficient In-memory utilization.  Spark is also written in Scala and provides streamlined programming capabilities.

Big Data queries can be visualized with tools like R and geographic tools for Data Science work as well as aggergates or summarizations included in Enterprise Data Warehouses (EDWs).  Below are sample charts ran in R directly from Hadoop:

Major and mid-size companies are positioned to benefit from these tools.  MPP databases are more affordable for EDWs and even Hadoop and Spark can be used inexpensively through Amazon as a cloud service to process big and massive data sets.  Companies that do not want to put their data in the cloud are bringing Hadoop, Spark and an MPP in-house.

Below is a diagram that shows the connections to traditional BI systems:

MPP Database Enviroments

MPP databases are typically used in major companies for large EDWs.  These are truly large integrated database management systems for Business Intelligence and Analytics reporting.  Major corporations need these systems to ingest and analyze bigger datasets quicker to increase company competitiveness through gained intelligence.  Hadoop and Spark (the truly Big Data systems) can be used to feed MPPs for company-wide reporting.

MPP databases can have millions or billions of records that users run reports directly against.  We have seen over 40 data sources integrated in these databases.  Sizing the initial MPP environment and adding capacity once the system is being used can be a difficult task without professional 3rd party assistance.  Alexicon provides consulting services for big data environments.  Our specialties are architecture, governance, enterprise data models, Extract, Transform and Loading (ETL), computations, Performance Management (goals/targets) and Process-based structures.

We provide Performance-based Testing to determine the most cost effective mix of hardware and software licenses for current and future needs. Workload Management is supported by Integrated Reporting and Work Plan Coordination across the entire system.

Complete System Approach

We take a complete system approach by providing integrated reporting and work management methods to ensure the entire system process is well coordinated.  By using a Process-based Approach, metrics and charts are used to understand how queries are processed through the complete system.  System Throughput and performance inhibitors are the focus to improve overall performance.  Below is a diagram that shows the database and major elements that influence performance:

With the MPP databases at the center of all the action, we focus on loading, administration and user queries to ensure peak periods are resourced correctly or work is moved to better suited times.

MPP Uses

Traditional databases cannot perform with reasonable response times at such record levels. Typically, retailers have dealt with large data volumes and have relied on MPP solutions to help analyze these massive record sets.  An example of why retail has such large data sets is the Universal Product Code (UPC) used at individual Stock Keeping Units (SKU) which creates an explosion of Point-of-sale (POS) records.  These records and many other internal and external data sources feed MPPs at transaction and/or summary levels so users can query this data with BI and other SQL-based tools.  There is still the art of building the right data model(s) to provide optimal frontend dashboard and reporting performance which does not change from traditional databases.  MPPs provide the advantage holding more data and running much quicker which is why the adoption rate is increasing.

In 2010 Gartner reported that "nearly 70% of data warehouses experience performance constraint issues of various types. These typically affect data warehouses with varying levels of mixed workload, especially those with high query counts..."

Our experience is that large organizations with high query counts need planned and coordinated workload management to provide optimal performance.  For global companies, workload management is critical considering different time zones, required loads and user queries.

Integrated Processes and Reporting

Out-of-the-box reports for databases, ETL and BI tools are like typical data mart/warehouse builds for business users.  We need integration from multiple source applications for a complete system picture.

Workload Management Areas

The areas above are focused on MPP databases and are eqaully applicable to Hadoop and Spark.  Without "Workload Analytics" and "Best Practices," bottlenecks often occur with MPPs and cause major performance issues, distractions to the business and consulting or development costs to resolve.   We believe it is better to have the performance visibility needed with "Workload Analytics" to watch by hour, day, week, month and year.  MPPs have also become a costly asset to manage and deserves the attention and opportunity to optimize hourly performance for a lean structure that avoids unnessacary upgrades to hardware/memory or processor and memory licencing.

Contact Us to learn more about Big Data Analytics

«Home Page