Monday, March 11, 2019

Azure SQL Data Warehouse and PolyBase – an Introduction



Azure SQL Data Warehose and PolyBase


Azure SQL Data Warehouse - Introduction


The Azure SQL Data Warehouse is a fully-managed (Platform-as-a-Service (PaaS)) and scalable cloud based distributed relational database management system for data warehousing workloads. It is built on Massively Parallel Processing (MPP) architecture, to execute complex queries quickly to process massive volumes of data containing relational and non-relational. It is a key component of an end to end big data solutions in the cloud.

Microsoft statements, SQL Data Warehouse can support petabytes of data and be scaled up and down in a matter of minutes-or even seconds-with compute and storage resources separated to better control operations and costs. The Azure SQL Data Warehouse is compatible with several Azure services like, Machine Learning and Azure Data Factory, etc. as well integrated with various existing SQL Server tools and Microsoft products.

In brief, the Azure SQL Data Warehouse is a key component of an end-to-end, cloud-based, enterprise solution that meets all of an organization’s big data needs. It has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Later in this article we will see, how the table’s data are distributed across these nodes (e.g Hash distribution or Round Robin distribution , etc.). 

Azure SQL Data Warehouse - Architecture


We came to know that SQL Data Warehouse is built on top of Massively Parallel Processing (MPP) engine, that leverages a scale out architecture to distribute the computational processing of data across multiple nodes. In other words, the internal architecture is somehow based on master-slave architecture, there would be at least one Control node (master) and one or more Compute nodes (slaves) based on your need.

In fact, the unit of scale is an abstraction of computing power that is known as a data warehouse unit (DWU). SQL Data Warehouse separates compute from storage, which enables you to scale compute independently of the data in your system.

SQL Data Warehouse Architecture


SQL Data Warehouse architecture deals and be governed by MPP, few precise components and mechanism like – 
  • Control node
  • Compute node
  • Data Movement Service
  • Storage


Control node

SQL Data Warehouse uses a node-based architecture, where applications connect and execute T-SQL commands to a Control node, which is the single point of entry for connecting to the SQL Data Warehouse. Control node can be denoted as a brain of the data warehouse, which is the front end that interacts with all applications and connections. It contains an instance of SQL Database for storage of metadata and a Massively Parallel Processing (MPP) engine for distributed query optimization, distributed processing and coordination. Once a T-SQL query submitted, an MPP engine on the Control node,  creates a distributed execution plan by breaking down the query into parallel processes and coordinating that processing across multiple Compute nodes running in parallel.

Along with it, the Control node also contains an instance of Data Management Service (DMS) that is a system-level internal service. It coordinates all the data movement across the other nodes as necessary to run queries in parallel and return accurate results. 

Compute nodes

If Control node is master then you can define the Compute node as a slave that provide the computational power. When you submit a T-SQL query to SQL Data Warehouse, the Control node transforms it into queries that run against each distribution in parallel, and distributions map to Compute nodes for processing.

The number of Compute nodes ranges from 1 to 60, and is determined by the service level for the data warehouse. Depending on need, you might have one or more Compute nodes, each of these Compute nodes contains SQL Database and does all the query processes. Each Compute node works on its local data (in certain cases there might be some data movement across compute nodes) and once it finishes processing the query, it returns its result to the Control node. 

Data Movement Service

We looked earlier that the Data Management Service (DMS) is a system-level internal service. It coordinates all the data movement across the other nodes as necessary to run queries in parallel and return accurate results. 

The Data Movement Service (DMS) serves communication and data transportation across nodes in the SQL Data Warehouse. The Control node and each subsequent Compute node run an instance of Data Movement Service locally for coordinating data movement between the control node and compute nodes and between the compute nodes themselves. When data movement is required, DMS ensures the right data gets to the right location.

Azure storage

SQL Data Warehouse uses Azure storage to keep user data safe and secure, though Azure storage can be used in two scenarios. In the first scenaro, when we load data into the SQL DW table, data gets distributed across Compute nodes and persisted on Azure Storage. In the second scenario, you can add or load data into Azure Storage directly without inviting Azure SQL Data Warehouse compute costs. Later on use Polybase to query data in SQL Data Warehouse from the Azure storage.

Since compute and storage is de-coupled in SQL Data Warehouse, that facilitates a real elasticity and leverages scale storage separaetly from the scaling compute. Using de-coupled storage and compute, SQL Data Warehouse can - 
  • Independently size compute power irrespective of your storage needs.
  • Grow or shrink compute power without moving data.
  • Pause compute capacity while leaving data intact, so you only pay for storage.
  • Resume compute capacity during operational hours.


Distributions


Under the Massively Parallel Processing (MPP) approach, a distribution is the basic unit of storage and processing for all parallel queries that has been executed by the MPP engine to run on distributed way. In deeper, when SQL Data Warehouse runs a query, the work is divided into 60 smaller queries that run in parallel. Each of the 60 smaller queries run on one of the data distributions. 

In brief, you can say the data are divided across 60 underlying databases, called Distribution. Each Compute node manages one or more of the 60 distributions. A data warehouse with maximum compute resources has one distribution per Compute node. A data warehouse with minimum compute resources has all the distributions on one Compute node.

A Distributed table appears as a single table, but the rows are actually stored across 60 distributions. You can choose which sharding pattern to use to distribute the rows when you define the table. When you execute the CREATE TABLE AS SELECT statement, you can choose distribution type and table type - 

Distribution type (SQL Data Warehouse supports these sharding patterns) - 
  • ROUND_ROBIN (table is spread evenly among the distributions)
  • HASH (table is distributed using a hash key)
  • REPLICATE (the whole table is copied to all nodes, which means it will take more disk space)
Table type – 
  • CLUSTERED COLUMNSTORE INDEX (usually fastest for reads)
  • HEAP (usually the fastest for writes)
  • CLUSTERED INDEX


Note: 
  1. Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. Technically, sharding is a synonym for horizontal partitioning. In practice, the term is often used to refer to any database partitioning that is meant to make a very large database more manageable.
  2. MPP architecture does not allow PRIMARY KEYS or FOREIGN KEYS on your tables.


Hash-Distributed Tables


In an MPP system like SQL Data Warehouse, data from each table gets divided across 60 underlying databases called distributions. By dividing the data and the processing capability across multiple nodes, this is how SQL Data Warehouse gives us a lot of scalability. But deciding how to distribute the data is one of the most important decisions that you make and that affects optimal performance in SQL Data Warehouse.

In a Hash Distributed option, data gets distributed into the 60 buckets for distributions, depending on the value of the Hash value from a single column. A Hash is a function where we feed in a number and then it produces an output and that output is called a Hash value. In the table definition, one of the columns is designated as the distribution column, as listed below the customerID has been assigned. 

CREATE TABLE demoTable
(
customerID INT NOT NULL,
lastNmae VARCHAR(100),
firstName VARCHAR(100)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(customerID)
);

In other words, when a row gets inserted into the table using that table definition that we created on the above code snippet, the value of the column is going to get passed into a Hash function and the Hash value that gets passed returned out is either going to be the same value each time. In SQL Data Warehouse, we always have the same value generated, so it is a deterministic Hash function.  More  importantly, when you assign a column to Hash on, the column is going to give you an even distribution of data across those 60 distributions. In this way distributions are assigned to Compute nodes and parallel work done in equal chunks, that completes at roughly the same time.

Hash Function

In brief, Hash Distributed having the following key features – 
  • Data divides across nodes based on hashing algorithm.
  • Same value will always hash to same distribution
  • Each table row belongs to one distribution.
  • A deterministic hash algorithm assigns each row to one distribution.


Consider using a hash-distributed table when:
  • The table size on disk is more than 2 GB.
  • The table has frequent insert, update, and delete operations.


Round-Robin Distributed Tables


A Round-robin distribution method is a default option, fairly simple to set  up, but because we are not being known about the location of the data across the distributions, we may end up moving data at query time and data movement is expensive and it can affect our performance.

In fact, a Round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. In the context of Hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution. In listed below code snippet, we can see the create table statement that we don't specify a column here, because SQL Data Warehouse determines where the row is going to be placed randomly.

CREATE TABLE demoTable
(
customerID INT NOT NULL,
lastNmae VARCHAR(100),
firstName VARCHAR(100)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
);

Henceforth, the system sometimes needs to execute a data movement operation to make data better organized before it can resolve a query. For example, joining a Round-robin table usually requires reshuffling the rows, which is a performance hit.

Round-robin


In brief, Round-robin Distributed having the following key features – 
  • Data distributed evenly across nodes.
  • Easy place to start, don’t need to know anything about the data.
  • Simplicity at a cost.
  • Execute more data movement at query time.


Consider using the round-robin distribution for your table in the following scenarios:

  • When getting started as a simple starting point since it is the default.
  • In case there is no obvious joining key.
  • If there is not good candidate column for hash distributing the table.
  • If the table does not share a common join key with other tables.
  • If the join is less significant than other joins in the query.
  • When the table is a temporary staging table.


Replicated Tables


In an MPP system the most optimized queries can be passed through to execute on individual database without interaction with other database. But sometime Data Management Service (DMS) required to  coordinate all the data movement across the other distributions to satisfy the join criteria in a query. Replicated tables are a tool that we can use to avoid data movement for small tables.

A Replicated table delivers the fastest query performance for small tables, along with it, caches a full copy of that table accessible by each Compute node. Replicating a table removes the need to transfer data among the compute nodes before a join or an aggregation. Because the table has multiple copies, we want to make sure that we are replicating tables that are fairly small.

So typically, less than two gigs compressed is the best candidate. The listed below snapshot under SQL Data Warehouse display, a Replicated table is cached on the first distribution on each compute node.

Replicated table

In SQL Data Warehouse, the replicated table gets fully copied to all 60 distributions.

In brief, Replicated Tables having the following key features – 
  • Full copy of table accessible on each Compute node.
  • Reduce data movement between Compute nodes before join or aggregation
  • Simplifies query plans.


PolyBase - Introduction


PolyBase is a new feature available on SQL Server 2016 and later versions that enables to process T-SQL queries, which can read data from external data sources including relational as well non-relational databases. In a broader way, it allows professionals to proficiently connect SQL Server and SQL Data Warehouse containing  Hadoop clusters. PolyBase enables you to run queries on external data in Hadoop or Azure blob storage, queries are optimized to push computation to Hadoop.

No doubt, there are numerous ways to tie external data sources with data stored on SQL server for multisource analytics, like HDFS on Hadoop, Azure Blob storage and Azure Data Lake Store. To keep things simple, PolyBase has the built-in capability to query external data on top of Hadoop by using the same single standard T-SQL query used to query a database table.

PolyBase is currently utilized by  - 
  • SQL Server (starting with 2016) 
  • Azure SQL Database 
  • Azure SQL Data Warehouse 
  • Parallel Data Warehouse


PolyBase

What PolyBase Can Do?


Due to lack of Hadoop and related Java skills, sometime organizations face a barrier to analyzing data sources that exist on top of Hadoop. Here, Polybase allows organizations to perform Hadoop analysis without Hadoop knowledge as well, no additional Hadoop software add-ons.

PolyBase handles all the following scenarios:
[1] Qyery Hadoop data
Oftenly users store data in Hadoop due to distributed and scalable feature, PolyBase enables to qyery the data by using T-SQL.

[2] Query Azure Blob Storage data
Azure blob storage stores massive amount of structure and unstructured data that is being used by diffferent Azure services, PolyBase enables to qyery the data by using T-SQL.

[3] Import data from Hadoop, Azure Blog Storage or Azure Data Lake Store
By utilizing Microsoft SQL’s columnstore technology and analysis capability, PolyBase can import data from Hadoop, Azure Blob Storage or Azure Data Lake Store without a separate import or an ETL tool.

[4] Export data to Hadoop, Azure Blog Storage or Azure Data Lake Store
Using exporting to different destination data can be archived to achieve the cost-effective storage and keep online access easier.

[5] Integrating with Microsoft BI tools
PolyBase can be integrated with Microsoft’s BI tool and analysis stack or other third-party compatible analytics tools.

PolyBase process


Here in the above diagram, we can see that mediums are getting changed but the task and goals stays the same. In other words, either it would be on the premises (scaling out with APS), or Azure (SQL DW and SQL DB) but they will provide the same functionality, though some exceptions granted.

Better to say, PolyBase acts as a bridge between SQL Server and different external databases that are designed for the storage of massive data sets.

In further articles we will go through some hands aon activities on top of Azure Data Warehouse and PolyBase, stay in touch.

2 comments:

  1. Looking for humidity test chamber? Then we are the best for you, who suit your expectations. We have huge experience on this field. We are 100% dedicated to find you humidity test chamber.
    We are dedicated to manufacturing and sales of a wide range of humidity test chamber.
    You can find here your best one.
    If you want to know more, Please check out here: Bell Experiment Equipment Co. Ltd

    ReplyDelete
  2. That's a nice visual inforgraphic explanation about azure sql data warehouse. Thanks for sharing these nice detailed diagram! As a data warehouse consulting partner, our team always helps customers with their digital transformation through providing analytics insights, predictive analytics, data warehouse, data migration, data integration, data governance, data quality, master data management and data security initiatives.

    ReplyDelete