Big Data Analytics - Scale as needed
Big Data initiatives are now making business and IT leaders rethink how Data Warehouses or "Big Data Warehouses" are being evolved. The Business Intelligence (BI) or Analytics Landscape needs to adapt to newer technologies like Hadoop and Spark for large data sets that live on multiple physical machines. At the center of the "Big Data Warehouse" is more affordable Massively Parallel Processing (MPP) databases which run on one machine or appliance.
Today’s model is running a large MPP for the Enterprise Data Warehouse (EDW) and running big jobs or queries on Hadoop and now interactive queries on Spark.
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009. Following is a side-by-side comparison of Hadoop and Spark's performance (numbers gathered from Matei's presentation at Strata+Hadoop World 2015):
It is clear to see that Spark significantly outperforms Hadoop because of its advanced engine and efficient In-memory utilization. Spark is also written in Scala and provides streamlined programming capabilities.
Big Data queries can be visualized with tools like R and geographic tools for Data Science work as well as summarizations included in EDWs. Below are sample charts ran in R:
Major and mid-size companies are positioned to benefit from these tools. MPP databases are more affordable for EDWs and even Hadoop and Spark can be used inexpensively through Amazon as a cloud service to process big and massive data sets. Companies that do not want to put their data in the cloud are bringing Hadoop and Spark in-house.
Below is a diagram that shows the connections to traditional Business Intelligence (BI) systems:
Managing Massively Parallel Processing (MPP) Database Enviroments
MPP databases are typically used in major companies for large Enterprise Data Warehouses (EDWs). These are truly big and large integrated database management systems for Business Intelligence and Analytics reporting. Major corporations need these systems to ingest and analyze bigger datasets quicker to increase company competitiveness through gained intelligence. Hadoop and Spark are also used to feed MPPs for company-wide reporting (the truly Big Data systems).
MPP databases can have millions or billions of records that users run reports directly against. We have seen over 40 data sources integrated in these databases. Sizing the initial MPP environment and adding capacity once the system is being used can be a difficult task without professional 3rd party assistance. Alexicon provides consulting services for big data environments. Our specialties are architecture, governance, enterprise data models, Extract, Transform and Loading (ETL), computations, Performance Management (goals/targets) and Process-based structures.
We provide Performance-based Testing to determine the most cost effective mix of hardware and software licenses for current and future needs. Workload Management is supported by Integrated Reporting and Work Plan Coordination across the entire system.
Complete System Approach
We take a complete system approach by providing integrated reporting and work management methods to ensure the entire system process is well coordinated. By using a Process-based Approach, metrics and charts are used to understand how queries are processed through the complete system. System Throughput and performance inhibitors are the focus to improve overall performance. Below is a diagram that shows the database and major elements that influence performance:
With the MPP databases at the center of all the action, we focus on loading, administration and user queries to ensure peak periods are resourced correctly or work is moved to better suited times.
MPP databases are used for providing reasonable query responses on Extreme Data that easily exceeds a billion records. Traditional databases cannot perform with reasonable response times at such record levels. Typically, retailers have dealt with large data volumes and have relied on MPP solutions to help analyze these massive record sets. An example of why retail has such large data sets is the Universal Product Code (UPC) used at individual Stock Keeping Units (SKU) which creates an explosion of Point-of-sale (POS) records. These records and many other internal and external data sources feed MPPs at transaction and/or summary levels so users can query this data with BI and other SQL-based tools. There is still the art of building the right data model(s) to provide optimal frontend dashboard and reporting performance which does not change from traditional databases. MPPs provide the advantage holding more data and running much quicker which is why the adoption rate is increasing.
In 2010 Gartner reported that "nearly 70% of data warehouses experience performance constraint issues of various types. These typically affect data warehouses with varying levels of mixed workload, especially those with high query counts..."
Our experience is that large organizations with high query counts need planned and coordinated workload management to provide optimal performance. For global companies, workload management is critical considering different time zones, required loads and user queries.
Integrated Processes and Reporting
Out-of-the-box reports for databases, ETL and BI tools are like typical data mart/warehouse builds for business users. We need integration from multiple source applications for a complete system picture.
Workload Management Areas
- ETL Jobs
- Interactive Refreshes
- Scheduled Report Refreshes
- On-demand and Ad Hoc Reports
- Query Arrival Rate
- Query Backlog
- Query Wait Time
- Query CPU Time
- Memory Consumption
- Query I/O
- Query Throughput
- Group and User Metrics
- DBA Scheduled Backups
The areas above are focused on MPP Databases and are eqaully applicable to Hadoop and Spark. Without "Workload Analytics" and "Best Practices," bottlenecks often occur with MPPs and cause major performance issues, distractions to the business and consulting or development costs to resolve. We believe it is better to have the performance visibility needed with "Workload Analytics" to watch by hour, day, week, month and year. MPPs have also become a costly asset to manage and deserves the attention and opportunity to optimize hourly performance for for a lean structure that avoids unnessacary upgrades to hardware/memory or processor and memory licencing.
Big Data Links:
► Trifacta - Big Data Wrangling (Modern ETL for Hadoop)
► MapD - GPU-powered Datadase and High-resolution Visualizations (Big Data power with much less hardware)
► Modern Data Science - Defining and leveraging relationships in massive multidimensional spaces
Contact Us to learn more about Big Data Analytics