TALK GRID (Cloud Trainer): January 2019

Azure Data Factory (ADF) – An Introduction

Microsoft Azure Data Factory (ADF) is a fully managed, cloud-based data integration service that allows you to create data-driven workflows for orchestrating and automating data movement and data transformation. It is a platform somewhat like SSIS in the cloud built for complex hybrid Extract-Transform-Load (ETL), Extract-Load-Transform(ELT) and data integration projects.

Azure Data Factory lets companies transform all their raw big data from diverse sources, including relational, non-relational and other storage systems and integrate it for use with data-driven workflows to help companies map futuristic plans, attain goals and drive business value from the data they possess.

In fact, consuming Azure Data Factory, you can create and schedule data-driven workflows (called Pipelines) that can ingest data from dissimilar data stores. It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.

In brief, transform raw data into finished, shaped data ready for consumption by business intelligence tools or custom applications as well easily lift your SQL Server Integration Services (SSIS) packages to the Azure eco-system.

Pipelines – Data Driven Workflows

A pipeline is a logical grouping of activities that together perform a task, for example, a pipeline could contain a set of activities that ingest and clean log data, and then kick off a Spark job on an HDInsight cluster to analyze the log data.

The plus point of this is that the pipeline permits you to manage the activities as a set instead of each one individually, like you can deploy and schedule the pipeline, instead of the activities independently.

The pipelines which in fact, data-driven workflows in Azure Data Factory, usually accomplish the following four steps as –

Connect and Collect
Transform & Enrich
Publish
Monitor

Connect and Collect

In the context of enterprise scenario, data may exist in various forms and located in disparate sources like on-premises, cloud, structured, unstructured and semi-structured, etc. Connect and collect is the first step to connect all required sources of data and move to a centralized location for subsequent processing.

Transform and enrich

Post centralization of data, the next step is to process or transform the collected data using different services like HDInsight Hadoop, Spark and Data Lake Analytics etc. and feed the live eco-system neediness.

Publish

Transformation onwards, the next step is to publish the data into a dedicated warehouse or DB like load to Azure Data Warehouse, Azure SQL Database etc. which can be pointed out from any business intelligence tools.

Monitor

Once the data-driven workflows, pipelines have been scheduled and executing the business value from the polished data, the monitoring is final step to monitor the scheduled activities and pipelines.

Activity – Processing Steps

We went through a pipeline, which encapsulates a data flow that includes copying the data and transforming the data as we move it from one place to another. And to make that happen, we are executing a couple of activities within our pipeline.

Activity on a pipeline defines action or processing steps to perform on the data, for example, you may use a Copy Activity to copy data from an on-premises SQL Server to an Azure Blob Storage. Furthermore, use a Hive Activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the blob storage to produce output data.

Azure Data factory supports three types of activities, as –

Data movement activities
Data transformation activities
Control activities

Data movement activities

In Azure Data Factory, Copy Activity can be used to copy data between on-premises and cloud data stores. Once the data is copied, it can be further transformed and analyzed, along with you can use this Copy Activity to publish transformation and analysis results for business intelligence (BI) and application feeding.

In brief, Copy Activity in Data Factory copies data from a source data store to a sink data store, Azure Data Factory supports multiple data stores segments like –

Azure
Database
NoSQL
File
Services and Apps, etc.

Data transformation activities

Data transformation activities can be used to transform and process the raw data into forecasts and perceptions purposes. Certainly, a transformation activity executes in a precise computing environment such as Azure HDInsight cluster or an Azure Batch.

Azure Data Factory supports multiple transformation activities that can be added to pipelines either individually or chained with another activity, like –

HDInsight activity (Hive, Pig, MapReduce, Spark etc.)
Azure VM Machine learning activity
Stored Procedure activity
Custom activity, etc.

Control activity

Control activity is the specific activity which control and manage the flow of data process in the defined pipeline and relevant activities. Azure Data Factory supports multiple control activities like –

For Each Activity
Web Activity
If Condition Activity
Lookup Activity
Until Activity etc.

Dataset and Linked Services

We talked about the Azure Data Factory that can have one or more pipelines, where a pipeline is a logical grouping of activities that together accomplish a task. As well the activities in a pipeline define actions to perform on the data, time to know about the Datasets and the Linked Services.

Datasets

In this consequence, a dataset is a named view or a data structure of data that refer the precise data you want to use with the associated activities as input as well as output. Here an activity can take zero or more input datasets and produce one or more output datasets.

See the following diagram that illustrates the relationship between pipeline, activity, and dataset in an Azure Data Factory eco-system.

Linked Services

It is essential to define a linked service to link the data store to the Azure Data Factory, before proceeding to create a dataset. In the context of Azure Data Factory, linked services are similar to connection strings, which define the connection information required to connect external resources.

For example, an Azure Storage linked service make a relationship between a storage account and the data factory. An Azure Blob dataset represents the blob container and the folder within that Azure storage account that contains the input blobs to be processed.

Look into the following diagram that illustrates the relationships among pipeline, activity, dataset, and linked service in an Azure Data Factory eco-system.

Azure Data Factory Benefits

If we talk about the benefit of Azure Data Factory, then following four KPIs describe the paybacks –

Productive - Move data seamlessly from more than sixty sources without writing code.
Hybrid - Build data integration pipelines which span on-premises and cloud.
Trusted - Data movement using Azure Data Factory has been certified by different compliance authority like HIPAA/HITECH, ISO/IEC 27001, etc.
Scalable - Build server-less, cloud-based data integration with no infrastructure to manage.

In another way, if we look, what comes with Azure Data Factory or better to say characteristics, then following KPIs narrate the drive correctly –

Visual drag-and-drop UI

Using the code-free drag-&-drop interface to build, deploy, monitor and manage data integration, maximize the productivity via pipelines up and quick execution.

Multiple Language Support

You can either use the visual interface or write your own code in Python, .NET, or ARM to build pipelines based on your skill-sets.

SSIS package execution in Azure

You can easily execute and schedule the existing SSIS packages in the managed execution environment via lifting to the Azure eco-system

Code-free data movement

You are feasible to connect globally using seventy plus supported connectors like Azure data services, AWS S3 and Redshift, SAP HANA, Oracle, DB2, MongoDB etc.

Comprehensive control flow

It facilitates looping, branching, conditional constructs, on-demand executions and flexible scheduling with extensive control flow constructs.

Data integration with Azure Data Factory

In brief, an Azure Data Factory is a workflow system for organizing data flows between storage and processing systems, enables building, scheduling and monitoring of hybrid data pipelines. It executes a number of activities that take a dataset as input and deliver an output dataset.

Azure Data Factory services similar to the functions of ETL tool, though especially designed to move large volumes of data between the cloud and on-premises environments. Find the transitory of the data integration process with Azure Data Factory –

[1] Access and ingest data with built-in connectors

Essential to move data from on-premises as well cloud sources to a centralized data store in the cloud for further analysis by using the Copy Data activity in a data pipeline.

[2] Build scalable data flow with codeless UI, or write your own code

Next, build data integration and easily transform and integrate big data processing and machine learning with the visual interface that is codeless or can create our custom code using known languages.

[3] Schedule, run and monitor your pipelines

Time to invoke pipelines that contains specific activities with on-demand and trigger-based scheduling. You can monitor these pipeline activities visually with logging and pipeline history and track error sources.

In summarization, using Azure Data Factory service you can take your big data workflow and encapsulate that in the pipeline. And that pipeline includes all the different activities that are essentials to copy and process your data and get it into the destination where you required it.

Along with it, you can schedule those activities so that your pipeline runs on a frequent, recurring basis when you have to repeatedly do the same batch transformations to your data on a regular tempo. Here in this post we have gone through the concept and artifacts about the Azure Data Factory, in the next post we will dig some hands on activities on top of the Azure Data Factory (ADF).

Stay in touch for further posts!!

Pages

Wednesday, January 30, 2019

Microsoft Azure Data Factory – An Introduction