Storing and exploring mountains of data

datawarehouse

Gavin Payne builds on his article from last month about the common components of a Business Intelligence (BI) platform and focuses on a service to host one of them – the enterprise data warehouse (EDW).

To complement its petabyte scale unstructured big data platforms, Microsoft has Azure SQL Data Warehouse – it’s petabyte scale structured data platform.  Delivered as a fully managed cloud service, it uses a Multi-Parallel Processing variant of the SQL Server database engine to execute large analytics queries that can access external Hadoop data, as well as the column and row format data it stores.

The enterprise data warehouse

This article builds on the article I wrote last month about the common components of a Business Intelligence (BI) platform and focuses on a service to host one of them – the enterprise data warehouse (EDW).

Enterprise data warehouses store records about everything a business has ever done.  It’s common to find a row in a table for each product it’s ever made, every order it’s ever sold and perhaps every interaction it’s ever had with its customers.  All this information can make them very large, which is why they’re often dedicated databases hosted on dedicated servers, rather than an extension of their transactional source systems.  If just managing the size of a data warehouse isn’t hard enough then there’s their hefty query workloads to accommodate as well.  Data warehouse queries often aggregate large sets of values, so they require considerable CPU and memory resources and fast storage to provide acceptable levels of performance.  Providing business decision makers with the historic data they need to make decisions can be difficult, expensive and time consuming.

Fortunately, the data platform industry now has a lot of knowledge about providing high performing data warehouses.  There are database design patterns, such as the star schema, and database engine technology, such as Columnstore indexes, which have been created to do just that.  There are now also complete data platform services available, such as Azure SQL Data Warehouse.

What is Azure SQL Data Warehouse?

Azure SQL Data Warehouse is Microsoft’s fully managed, terabyte-to-petabyte scale data platform, provided as an Azure cloud service.  Based on SQL Server, it provides a structured, relational database storage engine that can execute T-SQL analytics queries.  It’s optimised for the storage of star-schema data models but it’s query engine can also access unstructured Hadoop data allowing a single T-SQL query to access structured and unstructured data.  If big data platforms are what data scientists use, then Azure SQL Data Warehouse is what enterprise decision makers use.

Multi-Parallel Processing

What makes Azure SQL Data Warehouse different to regular database engines is its Multi-Parallel Processing (MPP) architecture.  Instead of storing all its data and running all its queries on a single server, which risks never being large enough, the service distributes its data and queries across multiple servers, known as nodes. This lets it scale to handle workloads far bigger than any single server could ever cope with, a common problem with ageing on-premises data warehouses.

The hub of each deployment of the service is a control node.  It’s what users and applications connect to, what receives their queries and then sends back their results.  It’s also the node that maintains a map of what data is physically stored by each of the many data nodes and manages the movement of data between nodes.  This means when it performs its final role – breaking down one large user query into several smaller internal queries – it knows which data node should execute which smaller part of the query.

While there will only ever be one control node, there are many data nodes.  These are what physically store an individual segment of the data warehouse’s database and run queries against their segment.  Only when all the segments are combined is the complete database visible, which is why the control node decides which nodes run which queries as well as store what data.

The main benefit of having multiple data nodes, and therefore an MPP architecture, is that each node can execute a query in parallel with all the others.  For example, if Data Node 1 contained data for January to June and Data Node 2 for July to December, then an aggregate query for all the year’s data would be executed as two smaller queries, one on each data node.  While this simple example demonstrates the concept of an MPP architecture, it may not demonstrate the benefits.  Imagine then, if each month’s data had millions of fact table rows and the query involved a dozen dimension table filters.  Each query on its own could consume the resources of a large server, so scaling it out is the natural optimisation.

Cloud scalability

Data warehouse platforms that use an MPP architecture are nothing new.  Microsoft has had an on-premises appliance that provides one, known as the Analytics Platform System, for several years.  But what makes the Azure SQL Data Warehouse service different, are the technical and economic benefits of cloud computing that Microsoft Azure offers.

Data warehouse workloads can be variable, which matters when they need large amounts of expensive compute resource.  Running an analytics query or data load over dozens of CPU cores can be the norm in some situations, so only paying for that amount of resource when it’s needed can be a game changer for organisations.  The Azure SQL Data Warehouse service allows its administrators to scale the resources available to it both up and down, as well as pause it entirely – something that’s rare amongst Azure data services.  Data storage is paid for all the time though as even when the service is paused as its data needs to be kept.  What’s also different about this service is that its storage costs are decoupled from its compute costs.  You could have a large 500TB data warehouse that handles almost no queries so needs very little compute resource or the complete opposite – yet you pay accordingly whereas most other cloud data services couple compute costs with storage costs.

Just the surface

This introduction is just that yet the service has several other features that also deserve a pen taking to paper and writing about them.

  • Distributed tables – The science behind how large tables are physically broken up and distributed across many data nodes.
  • Columnstore indexes – The service uses these by default to store the data in its tables, making the physical storage of data optimised for large analytics queries.
  • Polybase – This component of SQL Server allows the database engine to query unstructured data stored in Hadoop or Azure blob storage using regular T-SQL queries – as if it were just another SQL Server table.

More information

The Azure SQL Data Warehouse service can host the largest of data warehouses and brings unique capabilities to the table to allow it to do that.  Some of these are the same features that SQL Server has provided for a long time, such as T-SQL, whereas some are new to its MPP architecture, such as distributed tables.  As well as knowing about its features, there is also design guidance that needs to be considered - such as the service needs at least 1TB of data to offer any performance benefits.

To find out more then, Microsoft has the following links that I recommend reading:

Gavin Payne is the Head of Digital Transformation for Coeo, a Microsoft Gold partner that provides consulting and managed services for Microsoft data management and analytics technologies.  He is a Microsoft Certified Architect and Microsoft Certified Master, and a regular speaker at community and industry events.