Thursday, May 25, 2023

Part 1 : Applying DevOps, GitOps principles for data engineering / analytics use-cases


In this series, we will take 12 factor app principles and apply it to a typical data engineering use-case. 

First principle is Codebase. This principle states that one codebase (branch, typically) is to be maintained which would take various configuration settings, parameters, secrets as reference and deploys to multiple environments as shown in the figure below.


reference : developer.ibm.com

  

Often software engineering & DevOps perspective is overlooked for most of the analytics use-cases. Most Data engineers, Data scientists gives more priority on data and how to create data pipelines that would transport data from one layer to another to carry out activities cleansing, validation, transformation, enrichment but DevOps aspect is seen as if it is only applicable for an application workload.

It could be stemming from the thinking that the pipeline can be replicated with less effort from environment to another in a much less-hassle way compared to deploying a microservice or a web app from one to another environment. Thanks to many of the UI framework based ETL / ELT tools in the market. (Talend, Azure Data Factory etc..)

Not only data engineers, scientists in this case even clients or product owners or any relevant stakeholder of a data use-case does not really recognize the value DevOps, GitOps can bring to achieve quality, consistency, automation of deployment. 

Another factor attributes to this is the less availability of DevOps framework and tools for data analytics workloads compared to the toolset available for web or services based workloads.

Even if a team has taken this DevOps into consideration, in most cases it might not follow GitOps deployment model.

Having multiple codebase (branches) for each environment (dev, test, prod) with hard-coded environment specific parameters makes it difficult to do merge from Dev to Test or from Test to Prod branch. It violates the first principle of 12 factor app.

Trunk-based development model is the go-to option to achieve this principle of having single Codebase. This method defines one of the branch (usually 'Main') as trunk and other branches (feature/<feature name or developer name>) or short-lived branch which are created and destroyed post merge to trunk and feature release.

  

Environment specific branching 

 

Trunk-based branching approach

Trunk-based development model is used in Kubernetes and such cloud native development / deployment model when dealing with microservices or web apps like workloads. In Data related use-cases could also follow similar model to bring in best practices when it comes to continuous integration and deployment.


In Azure DevOps, such a trunk-based branching can be followed and to do the environment specific substitutions one could leverage variable groups (one per environment) of library section which maps to key vaults created per environment in their own resource groups / subscriptions based on how different environments are segregated.

  


Library section in Azure DevOps with environment specific variable groups pointing to key vault of respective environment

From DevOps pipelines, these variable groups can be used to substitute environment specific configurations or secrets from the deployment stages designated per environment.

Database deployment task in DevOps

When it comes to database deployment, one could dacpac tasks that are available in Azure DevOps by default. While deploying changes from dev to test or from test to prod (with or without data, based on need), database username and password, database names could be referred from key vault secrets mapped to variable groups. These variables could then be used in DevOps pipelines.

SQL Database deployment task 


Same goes for other data services like Data factory, Databricks, Analysis services etc. 

Data factory deployment task in DevOps

In case of data factory, it is a bit more complicated as one has to substitute many objects


  • global parameters
  • Linked connections to services
  • Datasets
  • Pipelines
  • etc..

Adftools is one great DevOps tool to use for Data factory DevOps.


Databricks deployment task in DevOps

 In case of Databricks deployment, packaging of code and deploying it as library in cluster can be done using Azure DevOps.

Databricks DevOps

 In upcoming blogs, we would see each of this deployments in detail. This blog is just to iterate on DevOps & GitOps practices for data engineering use-case.

 




No comments: