Thursday, May 25, 2023

Part 1 : Applying DevOps, GitOps principles for data engineering / analytics use-cases


In this series, we will take 12 factor app principles and apply it to a typical data engineering use-case. 

First principle is Codebase. This principle states that one codebase (branch, typically) is to be maintained which would take various configuration settings, parameters, secrets as reference and deploys to multiple environments as shown in the figure below.


reference : developer.ibm.com

  

Often software engineering & DevOps perspective is overlooked for most of the analytics use-cases. Most Data engineers, Data scientists gives more priority on data and how to create data pipelines that would transport data from one layer to another to carry out activities cleansing, validation, transformation, enrichment but DevOps aspect is seen as if it is only applicable for an application workload.

It could be stemming from the thinking that the pipeline can be replicated with less effort from environment to another in a much less-hassle way compared to deploying a microservice or a web app from one to another environment. Thanks to many of the UI framework based ETL / ELT tools in the market. (Talend, Azure Data Factory etc..)

Not only data engineers, scientists in this case even clients or product owners or any relevant stakeholder of a data use-case does not really recognize the value DevOps, GitOps can bring to achieve quality, consistency, automation of deployment. 

Another factor attributes to this is the less availability of DevOps framework and tools for data analytics workloads compared to the toolset available for web or services based workloads.

Even if a team has taken this DevOps into consideration, in most cases it might not follow GitOps deployment model.

Having multiple codebase (branches) for each environment (dev, test, prod) with hard-coded environment specific parameters makes it difficult to do merge from Dev to Test or from Test to Prod branch. It violates the first principle of 12 factor app.

Trunk-based development model is the go-to option to achieve this principle of having single Codebase. This method defines one of the branch (usually 'Main') as trunk and other branches (feature/<feature name or developer name>) or short-lived branch which are created and destroyed post merge to trunk and feature release.

  

Environment specific branching 

 

Trunk-based branching approach

Trunk-based development model is used in Kubernetes and such cloud native development / deployment model when dealing with microservices or web apps like workloads. In Data related use-cases could also follow similar model to bring in best practices when it comes to continuous integration and deployment.


In Azure DevOps, such a trunk-based branching can be followed and to do the environment specific substitutions one could leverage variable groups (one per environment) of library section which maps to key vaults created per environment in their own resource groups / subscriptions based on how different environments are segregated.

  


Library section in Azure DevOps with environment specific variable groups pointing to key vault of respective environment

From DevOps pipelines, these variable groups can be used to substitute environment specific configurations or secrets from the deployment stages designated per environment.

Database deployment task in DevOps

When it comes to database deployment, one could dacpac tasks that are available in Azure DevOps by default. While deploying changes from dev to test or from test to prod (with or without data, based on need), database username and password, database names could be referred from key vault secrets mapped to variable groups. These variables could then be used in DevOps pipelines.

SQL Database deployment task 


Same goes for other data services like Data factory, Databricks, Analysis services etc. 

Data factory deployment task in DevOps

In case of data factory, it is a bit more complicated as one has to substitute many objects


  • global parameters
  • Linked connections to services
  • Datasets
  • Pipelines
  • etc..

Adftools is one great DevOps tool to use for Data factory DevOps.


Databricks deployment task in DevOps

 In case of Databricks deployment, packaging of code and deploying it as library in cluster can be done using Azure DevOps.

Databricks DevOps

 In upcoming blogs, we would see each of this deployments in detail. This blog is just to iterate on DevOps & GitOps practices for data engineering use-case.

 




Wednesday, May 17, 2023

 

Databricks performance issue and resolution

 

 

Setting the Stage:

 

We are using VNET Injected Databricks on Azure for a data engineering use-case.

Infra team created all the Databricks with VNET configuration in all the  environments thru Terraform code and Devops pipeline.

 

Issue:

 

In our DEV environment,  Databricks cluster works fine in terms of performance.

 

In DEV, it takes of about 1 hour to do end to end cleansing, validation, enrichment process for a file with approximately 1000+ transactions. 

 

Same process with similar setup in QA environment, took 4x times more time to complete the pipeline.

 

Troubleshooting:

 

Since the Databricks Terraform module code is same for all environments, we ruled out any infra related issues at first.

 

After spending loads of time on troubleshooting,  we have raised a ticket with Azure and in-turn with Databricks  support team.

 

First, support team looked at number of python packages that is getting installed on to the cluster and suggested adding init script to install all libraries to be installed on all executors during cluster init to avoid time taken to install on each executors after cluster start event. Following init script was added to the cluster configuration:

 

databricks.libraries.enableSparkPyPI false 

 

No luck with this suggestion.

 

Next step, we shared detailed logs the Databricks support team running the pipeline in our DEV and QA environment. Nothing significant there when we compared the logs from all the environments.

 

Next, we did a simple test in DEV and QA environment, to create few 100 items on the mounted storage location (which is a delta lake enabled blob storage) where we are reading raw data and writing delivery data. QA environment took few seconds more than the dev environment.

 

Azure support team investigated the cause of performance using the following command from the notebook from two environments:


%sh nslookup ourdatalake.dfs.core.windows.net
%sh telnet (ipaddress returned from nslookup)



The IP address returned from Dev was reachable thru telnet but QA returned different IP address. Telnet to DEV took only 0.1 seconds whereas telnet to QA IP took 4 seconds. 

 

It was found out that the VNET peering in QA environment was done to a custom DNS which was not correct. 

 

Though Databricks terraform code is same for both environments, VNET peering and configuration (IP Address to DNS) was different in QA environment which caused the latency. After doing the proper VNET peering to the correct DNS server, the issue was resolved.

 

Learnings:

 

- Standard rulebook defined for troubleshooting alone would not help. At times, it requires pragmatic, practical thinking outside the box, questioning the obvious.

 

- Not to rule out any specific area completely off the table (in our case we ruled out issues with infra at earlier stage citing Terraform created environment)

 

It took four parties (In this case, Databricks support, Azure support, project team (us), client infra team)  to work together as a team with transparency and non-blaming attitude to eventually resolve the issue. (Good) Team work, works!