Wednesday, May 17, 2023

 

Databricks performance issue and resolution

 

 

Setting the Stage:

 

We are using VNET Injected Databricks on Azure for a data engineering use-case.

Infra team created all the Databricks with VNET configuration in all the  environments thru Terraform code and Devops pipeline.

 

Issue:

 

In our DEV environment,  Databricks cluster works fine in terms of performance.

 

In DEV, it takes of about 1 hour to do end to end cleansing, validation, enrichment process for a file with approximately 1000+ transactions. 

 

Same process with similar setup in QA environment, took 4x times more time to complete the pipeline.

 

Troubleshooting:

 

Since the Databricks Terraform module code is same for all environments, we ruled out any infra related issues at first.

 

After spending loads of time on troubleshooting,  we have raised a ticket with Azure and in-turn with Databricks  support team.

 

First, support team looked at number of python packages that is getting installed on to the cluster and suggested adding init script to install all libraries to be installed on all executors during cluster init to avoid time taken to install on each executors after cluster start event. Following init script was added to the cluster configuration:

 

databricks.libraries.enableSparkPyPI false 

 

No luck with this suggestion.

 

Next step, we shared detailed logs the Databricks support team running the pipeline in our DEV and QA environment. Nothing significant there when we compared the logs from all the environments.

 

Next, we did a simple test in DEV and QA environment, to create few 100 items on the mounted storage location (which is a delta lake enabled blob storage) where we are reading raw data and writing delivery data. QA environment took few seconds more than the dev environment.

 

Azure support team investigated the cause of performance using the following command from the notebook from two environments:


%sh nslookup ourdatalake.dfs.core.windows.net
%sh telnet (ipaddress returned from nslookup)



The IP address returned from Dev was reachable thru telnet but QA returned different IP address. Telnet to DEV took only 0.1 seconds whereas telnet to QA IP took 4 seconds. 

 

It was found out that the VNET peering in QA environment was done to a custom DNS which was not correct. 

 

Though Databricks terraform code is same for both environments, VNET peering and configuration (IP Address to DNS) was different in QA environment which caused the latency. After doing the proper VNET peering to the correct DNS server, the issue was resolved.

 

Learnings:

 

- Standard rulebook defined for troubleshooting alone would not help. At times, it requires pragmatic, practical thinking outside the box, questioning the obvious.

 

- Not to rule out any specific area completely off the table (in our case we ruled out issues with infra at earlier stage citing Terraform created environment)

 

It took four parties (In this case, Databricks support, Azure support, project team (us), client infra team)  to work together as a team with transparency and non-blaming attitude to eventually resolve the issue. (Good) Team work, works!

 

 





No comments: