Azure Data Factory and Data Pipeline – Hands on activity
In reference to Azure Data Factory hands on activities, we already walked through in one of the previous post, provisioning an Azure Data Factory and copy data from a file in Azure Blob Storage to a table in an Azure SQL Database using Copy Wizard. Along with it, we covered some other activities simultaneously like creating a Storage Account and a Blob Container as well set up an Azure SQL Database in context of Copy Wizard task.
Next, in this series here we will extend some more hands on activities on top of Azure Data Factory, like –
- Implement a data Pipeline.
- Copy data from a file in Azure Blob Storage to a table in an Azure SQL Database using Pipeline.
Pre-requisites
Before moving ahead, we need some pre-requisites to accomplish this Azure Data Factory hands on activities on top of Azure Cloud.
Meanwhile, in previous article already we covered provisioning few resources, from now assuming you completed those. In case not covered, then please move to that article and do all preliminary hands on activities.
- Azure subscription, if you don't have an account then sign up for a free Azure account - https://azure.microsoft.com/en-gb/free/
- Download the required lab files, essential to this task from the GitHub repository - https://github.com/rajendraazure/azure-data-factory-copy-wizard.
- Previously provisioned Storage Account and a Blob Container
- Previously provisioned Azure SQL Database.
- Previously provisioned Azure Data Factory.
Validate the SQL Database and prepare the Table
In this walk-through, you will consume all previously provisioned resources though better to validate before proceeding ahead. To avoid mix-up, you will have to prepare the SQL table again by deleting the database records, which were inserted by the Azure Data Factory Copy Wizard in the previous post.
Login to the Azure portal https://portal.azure.com/
STEP – 1: Launch SQL Database
In the Microsoft Azure portal, click the All resources from the Hub and click your Azure SQL Database.
It will load the SQL database blade for the selected database; you either can connect this database remotely or can go ahead in the Azure web-based query interface.
STEP – 2: Launch Query editor (preview)
Next, click the Query editor (preview) from the database blade. Subsequently, it will launch login blade and submit the same administrator credentials, which you had provided during the provision of SQL database.
If you provide the correct credentials, then it will log on and load the query blade promptly where you can see all DB objects in left side as well a new query window in the right side.
STEP – 3: Execute and prepare the SQL Table
In the query editor, write the following T-SQL query to truncate and prepare the dbo.transactions table in your SQL database –
TRUNCATE TABLE dbo.transactions;
Next, click Run to execute the T-SQL statement.
Run the following command to verify that the table is empty again -
SELECT * FROM dbo.transactions;
Validate the existing Linked Services
An Azure Data Factory uses linked services to access, transform, and store data. Even in the previous article, you applied the same using Copy Wizard.
Here in the case of the pipeline approach, you require linked services for the blob store account where the source data is stored as well the Azure SQL Database containing the table you want to load.
Move to the Microsoft Azure Portal.
STEP – 1: Launch Azure Data factory
In the Microsoft Azure portal, click the All resources from the Hub and click your Azure Data Factory.
It will load the Data factory blade for the selected Azure Data Factory; and required to click the Author and deploy tile.
STEP – 2: Validate existing Linked services
By clicking the Author and deploy tile, the next blade will be available along with all required configurable settings like linked services, datasets and pipelines, etc.
In the pane on the left, expand the first option Linked Services and you can notice two linked services named destination-sqldatabase-linked-service and source-blobstorage-linked-service are already defined for the blob store and SQL database.
In fact, the Copy Data wizard in the previous hands on activities created these linked services.
Next, click the blob linked service source-blobstorage-linked-service, the JSON definition should like something this –
{
"name": "source-blobstorage-linked-service",
"properties": {
"hubName": "democopydatafactory_hub",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=dfactorystorageaccount;AccountKey=**********"
}
}
}
Similarly, click the SQL linked service destination-sqldatabase-linked-service, the JSON definition should like something this –
{
"name": "destination-sqldatabase-linked-service",
"properties": {
"hubName": "democopydatafactory_hub",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=datafactorydbserver.database.windows.net;Initial Catalog=datafactorydb;Integrated Security=False;User ID=dbadmin;Password=**********;Connect Timeout=30;Encrypt=True"
}
}
}
Create and configure Datasets
You see that datasets define schema for the data flows, the data, which required extracting from a source, and the data to load into a destination or sink. In other words, a dataset is a named view or a data structure of data that refer the precise data you want to use with the associated activities as input as well as output.
Visit the precise post to know about all other relevant bits & pieces like pipelines, activities, datasets and linked services, etc.
Henceforth must create datasets to define the text file data source and the database table, which will be used in your pipeline.
STEP – 1: Create a source Dataset (Azure Blob)
Click the context menu More which will display a couple of options, click the New dataset, which further will display different options.
By clicking the New dataset a further context menu will be appeared with multiple options, then click the Azure Blob storage to create a new document for an Azure Blob store dataset.
STEP – 2: Configure source Dataset JSON
By clicking the Azure Blob storage, a new JSON document would be launched. In the new JSON document, replace the default code with the following code, which you can copy and paste from the transactions-txt.json file available in the GitHub repository –
{
"name": "transactions-txt",
"properties": {
"structure": [
{
"name": "tdate",
"type": "Datetime"
},
{
"name": "amount",
"type": "Decimal"
}
],
"type": "AzureBlob",
"linkedServiceName": "source-blobstorage-linked-service",
"typeProperties": {
"folderPath": "dfactorycontainer/data/",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"firstRowAsHeader": true
}
},
"external": true,
"availability": {
"frequency": "Minute",
"interval": 15
}
}
}
The custom JSON defines a schema for comma-delimited text containing a tdate column of DateTime values, and an amount column of Decimal values. Along with it, the data is available in the dfactorycontainer/data folder of the blob storage account defined by the source-blobstorage-linked-service linked service, and new data will be available every fifteen minutes.
Once you validate the JSON parameters and values, deploy the new dataset definition to your Azure Data Factory by clicking the Deploy button that appears in the top menu.
Sooner you will be notified once the new dataset is created successfully. Even you can verify the same thing, by selecting a new dataset named transactions-txt that exist under the Datasets option from the left pane.
STEP – 3 : Create and configure destination Dataset (Azure SQL)
Similar to source dataset, required to create and configure the destination dataset for the Azure SQL Database. Once again, click the context menu More and subsequently click the New dataset from the displayed option. Next then click Azure SQL to create a new JSON document for an Azure SQL Database dataset.
In the new JSON document, replace the default code with the following code, which you can copy and paste from the dbo-transactions.json file available in the GitHub repository –
{
"name": "dbo-transactions",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "destination-sqldatabase-linked-service",
"structure": [
{
"name": "tdate",
"type": "Datetime"
},
{
"name": "amount",
"type": "Decimal"
}
],
"typeProperties": {
"tableName": "dbo.transactions"
},
"availability": {
"frequency": "Minute",
"interval": 15
}
}
}
The custom JSON defines a schema for a table named dbo.transactions containing a tdate column of DateTime values, and an amount column of Decimal values in the database defined by the destination-sqldatabase-linked-service linked service. The dataset can be updated every fifteen minutes.
Once you validate the JSON parameters and values, deploy the new dataset definition to your Azure Data Factory by clicking the Deploy button that appears in the top menu.
Sooner you will be notified once the new dataset is created successfully. Even you can verify the same thing, by selecting a new dataset named dbo.transactions that exist under the Datasets option from the left pane.
Create and configure Pipeline
You know that a pipeline encapsulates a data flow that includes copying the data and transforming the data as we move it from one place to another. Since now that you have defined the linked services and datasets for your data flow, you can create a pipeline to encapsulate it.
In other words, a pipeline defines a series of actions that use linked services to operate on datasets. In this case, your pipeline will consist of a single action to copy data from the source-blobstorage-linked-service linked service to the destination-sqldatabase-linked-service linked service.
You can visit the precise post to know about all other relevant bits & pieces like pipelines, activities, datasets and linked services, etc.
STEP – 1: Create a Pipeline
Click the context menu More and subsequently click the New pipeline from the all available options.
STEP – 2: Configured Pipeline JSON
In the new JSON document, replace the default code with the following code, which you can copy and paste from the copytrans.json file available in the GitHub repository –
{
"name": "copytrans",
"properties": {
"activities": [{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "tdate:tdate,amount:amount"
}
},
"inputs": [{
"name": "transactions-txt"
}],
"outputs": [{
"name": "dbo-transactions"
}],
"name": "CopyTransData"
}],
"pipelineMode": "OneTime"
}
}
The custom JSON defines a pipeline containing a Copy action to copy the transaction-txt dataset in the blob store to the dbo.transactions table in the SQL database.
Once you validate the JSON parameters and values, deploy the new pipeline definition to your Azure Data Factory by clicking the Deploy button that appears in the top menu.
Sooner you will be notified once the new pipeline is created successfully. Even you can verify the same thing, by selecting new pipeline named copytrans that exist under the Pipelines option from the left pane.
Validate the Pipeline Status
Now all datasets and pipeline have deployed successfully, time to validate and monitor the pipeline status using the Azure portal.
Move to your Azure Data Factory blade, and click the Pipelines tile.
By clicking the Pipelines tile, promptly the pipelines blade will be launched. Here you can notice two pipelines exist, one was created using the Copy Data Wizard previously and along with it you can see the newly created copytrans pipeline with a Pipeline State of Running.
Next, click the copytrans pipeline, and observe the state of the activities it contains. Might be initially it contain no activities, as the datasets are validated and the pipeline is prepared.
For the moment close the Pipelines blade, and wait for a few minutes, approx. 15 minutes. Later on, reopen the Pipelines blade and the copytrans blade to view its state. When the pipeline has been run successfully, its State in the copytrans blade will be indicated as Ready.
Along with it, under the Pipelines blade its Pipeline State will be set to Completed simultaneously.
Verify the Data has been copied at Azure SQL end
Now onward your pipeline has run without any exception, time to verify that the data has been copied to a SQL database. Return to the Query editor (preview) from the Azure SQL database blade and submit the administrator credentials.
Post login, run the following command to verify that data has been inserted into the table -
SELECT * FROM dbo.transactions;
If you verify the table, you can see it contains the same transaction data, which has been copied from the text file, exist in the Azure Blob store.
Congratulation, Data Pipeline exercise completed!! 😊
As a result, we covered one more hand on activity in the context of Azure Data Factory using Data Pipeline to copy data from an Azure Blob Storage to an Azure SQL Database. Here we brushed up some previous accomplishments like pipeline, linked service and datasets, etc. as well did some JSON codes.
Keep visiting for further posts.
Awesome explanation.
ReplyDeleteCan you please tell me, What is the difference between Azure Data Lake and Azure Data Warehouse?