Tuesday, March 26, 2019

Loading data into Azure SQL Data Warehouse using PolyBase



PolyBase

  

Azure SQL Data Warehouse and PolyBase 


We have talked and walked through PolyBase and enablement the PolyBase feature in the SQL Server 2016. It is a new feature that serves to process T-SQL query to read data from external data sources, including relational as well non-relational databases.

Along with PolyBase, we went through an introduction and a brief architecture of Azure SQL Data Warehouse that is a cloud based enterprise solution for data warehousing workload built on top of Massively Parallel Processing (MPP) design. MPP engine leverages a scale out architecture to distribute the computational processing of massive data across multiple nodes to execute quickly.

Next, onwards in this article, we will dive deep to do some hands on activity loading data from Azure Storage Blobs to Azure SQL Data Warehouse using PolyBase feature. Since we have already covered in one of the articles, the provisioning of an Azure Storage Blobs so will focus the following tasks – 
  • Create an Azure SQL Data Warehouse
  • Connect the Azure SQL Data Warehouse using SQL Server Management Studio (SSMS)
  • Upload a csv file to earlier created Azure Storage Blob Container
  • Load data into Azure SQL Data Warehouse from the csv file (Blob Container), using PolyBase


Meanwhile, if required, then you can visit few previous articles to have a brief awareness about the following artifacts – 


Pre-requisites


Before moving ahead, we need some pre-requisites to accomplish this Azure SQL Data Warehouse and PolyBase hands on activities on top of Azure Cloud.
  1. Azure subscription, if you don't have an account then sign up for a free Azure account - https://azure.microsoft.com/en-gb/free/
  2. Download the required lab files, essential to this task from the GitHub repository - https://github.com/rajendraazure/SQLDW-PolyBase
  3. Must have an existing Azure Storage Blob Container, where you can upload files.
  4. Must have a VM or local workstation, well equipped with SQL Server 2016 and PolyBase enabled.


Create Azure Storage Blob Containers


In this walk-through, I am using an existing Azure Storage Blob; you either can provision a new one or go with an earlier created Storage Blob. 
Essential to log on the Azure Portal and move to the Storage Blobs where you will have to create container as – 
  • sensors-data

Login to the Azure portal https://portal.azure.com/

STEP – 1: Launch Azure Storage Blob 


On the Hub menu, click All services and select the storage account, required to browse and load its blade in the Azure portal. Here you can see all details and options under different sections like Essentials and Services.

Storage account


Time to create one container under the storage account as we considered earlier; click the Blobs, which will load the container blade.

Storage Blobs

STEP – 2: Create a Container


Since, you do not have any containers so the list is empty, click + Container from the menu to launch the New container blade with following properties – 
  • Name – The name of the container (case-sensitive).
  • Public access level – Simply go ahead with default selection as Private (no anonymous access).


Add Container


Click the OK button to proceed with creating the said container; notification will be appeared about the validation and deployment. Sooner you can see the new entry in the container list.

Container List


Prepare sample Data Files


In the context of this Azure SQL Data Warehouse and PolyBase hands on activity, you will use PolyBase to load data from your Azure Blob store account to your Azure SQL Data Warehouse. Henceforth required a simple data file for the blob container.

Either you can create these files or download the same from the GitHub location; you have noticed the same in the pre-requisites section the GitHub link. Go to the said GitHub location and download the sample file - https://github.com/rajendraazure/SQLDW-PolyBase

GitHub


Here you can see three different files, though required the third one i.e. weather data – 
  1. 2016Orders.csv – Some random orders for the year 2016
  2. 2017Orders.csv – Some random orders for the year 2017
  3. BeachWeatherStationsAutomatedSensors.csv – A small CSV file contains sensor data from various Lake Michigan beaches in Chicago, Illinois. This dataset was obtained from the city’s open data portal (https://data.cityofchicago.org).


Upload sample Data Files to Azure Storage Blob Containers


Once downloaded the weather file, required to upload in the storage containers, which you have created in earlier steps. Navigate to the Azure storage account and select the container named sensors-data.

Sensors-data


You can see there is no file so go ahead by clicking the Upload button from the top menu. It will launch the Upload blob blade where required few property details, make sure to fill the appropriate details before uploading a file to the blob container as – 
  • Files – File name will be displayed once you browse the data file.
  • Overwrite if files already exist – Select the box.


Under Advanced option, provide a few more details as – 
  • Authentication type – Simply go with default selection SAS.
  • Blob type – No doubt, it would be a Block blob as the default selection.
  • Block size – Go with default selection as 4 MB.
  • Upload to folder – Provide a folder name.


Upload Blob


Once you have filled all details then can click the Upload button to load the data file from your local source to the Azure Blob Container. Sooner the data file will be uploaded to the container, refresh the blob container blade and you can notice this new data file exist under order folder.

Weather data file

Provisioning an Azure SQL Data Warehouse 


You see an Azure SQL Data Warehouse is a fully managed (Platform-as-a-Service (PaaS)) and scalable cloud based distributed relational database management system for data warehousing workloads. In this activity, the PolyBase enables to query the storage data by using T-SQL statements and load into an Azure SQL Data Warehouse. 

If required, in the meantime, visit one of previous articles to know the brief about SQL Data Warehouse and its architecture – 
On the other hand, have a look the Microsoft official link – 

STEP – 1: Launch an Azure SQL Data Warehouse 


In the Microsoft Azure portal, click the + Create a resource from the Hub and click the Databases from the Azure Marketplace. It will load all available database services under the Featured section, select the SQL Data Warehouse or can search from the search box.

SQL Data Warehouse
  
Promptly the SQL Data Warehouse blade will be loaded, the time you click the SQL Data Warehouse and the required details needed to submit for the following properties – 
  • Database name – The name of database.
  • Subscription – The name of your Azure subscription, it would be either free, paid or specific subscription etc. 
  • Resource group – The name of the resource group used to organize related resources, you can either create a new resource or choose an existing one.
  • Select source – Select Sample and later select the AdventureWorksDW under the Select sample box (though you can go with Blank).

New SQL Data Warehouse

STEP – 2: Configure an Azure SQL Server 


Next, the property Server, here essentials to configure the server settings to create a new server with the following sets – 
  • Server name: The name of SQL Data Warehouse DB server (a unique name). 
  • Server admin login: Administrator login name as per your choice. 
  • Password: Strong password for the above said administrator.
  • Confirm password – Confirm the same password. 
  • Location – Ideally, select the same location as your storage account.
  • Allow azure services to access server: Go ahead with default selection as Selected.


SQL DB Server

If you see all information are correct, then proceed by clicking the Select button. It will create the said DB server with defined properties. Once again, continue to provide rest properties details in the context of creating a SQL Data Warehouse under the blade – 
  • Server – Now, it has configured with the created server name.
  • Performance level – Need to select appropriate performance level that fits for our task.
  • Collation – Usually it display the default selection as SQL_Latin1_General_CP1_CI_AS. 


Click the Performance level that launch another blade to configure performance and decide the pricing tier. Though If you need some different configuration , then configure and click the Apply button from the Configure performance blade.

Configure performance

Since we are dealing with sample data that does not have massive size so better to go with the default selection. Now the remaining part is Collation, usually it maps with the default selection SQL_Latin1_General_CP1_CI_AS. However, here it will be specified by the selected sample source.

SQL Data Warehouse

Now, nearly all properties have filled up; proceed by clicking the Create button to create the SQL Data Warehouse instance. Post validation, it can take a few minutes to be deployed, depends on the configuration as you configured the SQL Data.

In the Azure Portal, you can view the notification and sooner you will be notified once SQL Data Warehouse is created successfully. Once deployed, you can verify the newly created SQL Data Warehouse under the SQL databases blade as a new entry.

SQL Databases


Congratulation, an Azure SQL Data Warehouse is created!! 😊

Connect to SQL Data Warehouse 


In this demo exercise, you will connect to the SQL Data Warehouse to proceed further to complete the PolyBase exercises.

Move to Azure portal and expand the SQL Data Warehouse, you can see the Server name under the Overview section. Copy the server name that is required to connect from a client/local workstation.

SQL Data Warehouse


Use your taskbar shortcut or any other preferred method to launch SQL Server Management Studio (SSMS) from a client/local workstation; it will launch the Connect to Server dialog.

Here, essentials to provide server details as – 
  • Server type – Database Engine
  • Server name – Provide the SQL Data Warehouse server name, which you copied in earlier steps
  • Authentication – Change the Authentication drop-down list to SQL Server Authentication
  • Login – Same name that you had provided during provisioning of server.
  • Password – Same password that you had provided during provisioning of server.

Connect to Server

Once you submit all details, and then proceed ahead by clicking the Connect button. If everything goes well, it will connect the SQL Data Warehouse.

However, in my case encountered an error, will talk in the next section.

Firewall Rule – Fix Exception


If you have not configured the firewall for the DB server previously, then might you encounter can be an exception as ‘Client with IP address ‘**.**.**.**’ is not allowed to access the server, like – 

Exception

Move to Azure Portal and select the Firewalls and virtual networks option under the Security section of the provisioned Azure SQL Data Warehouse. Here need to add one rule with precise Client IP address, which can allow access the Azure SQL Data Warehouse service.

Firewalls and Virtual Networks


Sometimes you can see a different IP address in blade due to proxy settings, henceforth make sure to create a rule assigning the same IP address that appears in the exception message.

Allow access to Azure services


Now try to connect again, this time SQL Data Warehouse will be connected easily with the SQL Server Management Studio (SSMS).

DB Object Explorer


Next, in the Object Explorer pane, expand the node for your SQL Data Warehouse, and continue to expand nodes to see the database and sample tables that were loaded during the provisioning of SQL Data Warehouse.

Object Explorer


On the SSMS toolbar, click the New Query button and launch new query window. In the query window, execute the following T-SQL query – 

SELECT SalesAmount, ProductLine FROM dbo.SalesByCategory;

Select query


Review the results, you have now successfully connected to and queried your SQL Data Warehouse.

Load and Transform Data using T-SQL with PolyBase


In this hand on an activity, you will load the data into the Azure SQL Data Warehouse that extracted from the CSV file in Azure Storage. In this exercise, you will execute T-SQL with PolyBase that can load and transform the data.

To pursue the tasks, required to create desired DB objects, which is essentials to map and import the data from the CSV file in Azure Storage. In fact, the said file BeachWeatherStationsAutomatedSensors.csv is a small CSV file that contains sensor data from various Lake Michigan beaches in Chicago, Illinois. 

SOURCE: Dataset was obtained from the city’s open data portal - https://data.cityofchicago.org.

STEP – 1: Get the Azure Storage Access Keys 


Since you supposed to execute some T-SQL query to read the Azure Storage, therefore storage access details required to map with a query.

Move to the Azure Portal, click the Access keys under Settings of your Azure Storage account blade, copy, and note down the Storage account name and Key1.

Access Keys


Apart from this copy the container name which appears in the Blobs list, as – 

Container


Henceforth you will catch the following essential details – 
  1. Storage account name – democloudstorage
  2. Container name – sensors-data
  3. Key1 - ***************************************************************************************************************************=='


STEP – 2: Prepare Requisites Objects 


In this activity task, you will create a table for the weather data that will be extracted from the weather file in Azure storage. Now, time to prepare the T-SQL query that will create requisites objects, look below listed code-snippet.

CREATE MASTER KEY;
CREATE DATABASE SCOPED CREDENTIAL DemoCloudStorageCredential WITH IDENTITY = 'user',
SECRET = '*****************************************************************************************************************************==';
CREATE EXTERNAL DATA SOURCE DemoCloudStorage WITH (
TYPE = HADOOP,
LOCATION =
'wasbs://sensors-data@democloudstorage.blob.core.windows.net',
CREDENTIAL = DemoCloudStorageCredential
);
CREATE EXTERNAL FILE FORMAT TextFile WITH (
FORMAT_TYPE = DelimitedText, FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

Here you can see that Access Key, Storage Container and Storage Account has mapped with the T-SQL statements. Like,
  • SECRET = Access Key 1
  • LOCATION = wasbs://StorageContainer@StorageAccount.blob.core.windows.net


NOTE: Make sure to apply the above details, as it exists in the Azure Portal, because the account, container and the key fields are case-sensitive.

Post synched-up the details, execute the T-SQL query that will create necessary objects before proceeding the creation of an external table – 

PolyBase T-SQL


You can notice, DemoCloudStorage and TextFile objects will be appear under the External Data Source and External File Formats respectively.

Object Explorer


You can execute a couple of queries to validate new objects by querying the following catalog views – 
  • sys.database_credentials
  • sys.external_data_sources
  • sys.external_file_formats


SELECT * FROM sys.database_credentials
SELECT * FROM sys.external_data_sources
SELECT * FROM sys.external_file_formats

Select queries

STEP – 3: Create an External Table 


In previous steps, you synched-up and created the External Data Source, now time to create the external table for loading weather data. Prepare the following query with the data source mapping and execute in a query window, which connected to the SQL Data Warehouse.

CREATE EXTERNAL TABLE dbo.BeachSensorsExternal (
StationName VARCHAR(50) NOT NULL,
MeasurementTimestamp VARCHAR(50) NOT NULL,
AirTemperature DECIMAL(9,2) NULL,
WetBulbTemperature DECIMAL(9,2) NULL,
Humidity DECIMAL(9,2) NULL,
RainIntensity DECIMAL(9,2) NULL,
IntervalRain DECIMAL(9,2) NULL,
TotalRain DECIMAL(9,2) NULL,
PrecipitationType DECIMAL(9,2) NULL,
WindDirection DECIMAL(9,2) NULL,
WindSpeed DECIMAL(9,2) NULL,
MaximumWindSpeed DECIMAL(9,2) NULL,
BarometricPressure DECIMAL(9,2) NULL,
SolarRadiation DECIMAL(9,2) NULL,
Heading DECIMAL(9,2) NULL,
BatteryLife DECIMAL(9,2) NULL,
MeasurementTimestampLabel VARCHAR(50) NOT NULL,
MeasurementID VARCHAR(100) NOT NULL
)
WITH (
LOCATION='/',
DATA_SOURCE=DemoCloudStorage, FILE_FORMAT=TextFile
);

Create External Table

It will create an external table; you can validate the same in the Object Explorer.

Object Explorer


You can validate the data also; execute the SELECT query to retrieve weather data from the external table. It will display top 10 records of the weather data that has been loaded from the weather file in Azure Storage.

SELECT TOP 10 * FROM dbo.BeachSensorsExternal;

Select query


STEP – 4: Create another Table 


In previous steps, you have created an External Table containing the raw weather data successfully. Now you will create another table in the SQL Data Warehouse that will transform the data and load it, look the following code snippet – 

CREATE TABLE [dbo].[BeachSensor]
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
) AS
SELECT
StationName,
CAST(MeasurementTimestamp as DATETIME) AS MeasurementDateTime,
AirTemperature,
WetBulbTemperature,
Humidity,
RainIntensity,
IntervalRain,
TotalRain,
PrecipitationType,
Lab 03 | Integrating and Ingesting Data
WindDirection,
WindSpeed,
MaximumWindSpeed,
BarometricPressure,
SolarRadiation,
Heading,
BatteryLife
FROM dbo.BeachSensorsExternal;

SELECT COUNT(*) FROM dbo.BeachSensor;

Here, T-SQL statements are creating a new table named BeachSensor, then take a subset of columns from the raw external table. Post loads of the modified data into BeachSensor table, executing a SELECT query to display the table’s record count.

Create Table


Congratulation, loading data into the Azure SQL Data Warehouse done!! 😊

Thus, in this walked through we uploaded a raw data file into an Azure Blob Container and covered to provisioning an Azure SQL Data Warehouse. Later on, by using PolyBase T-SQL queries, loaded the raw data into the SQL Data Warehouse and verified the same.

Keep visiting for further posts.


No comments:

Post a Comment