Databricks
performance issue and resolution
Setting the Stage:
We are using
VNET Injected Databricks on Azure for a data engineering use-case.
Infra team
created all the Databricks with VNET configuration in all the environments thru
Terraform code and Devops pipeline.
Issue:
In our DEV environment, Databricks cluster works fine in terms of
performance.
In DEV, it takes of
about 1 hour to do end to end cleansing, validation, enrichment process for a
file with approximately 1000+ transactions.
Same process
with similar setup in QA environment, took 4x times more time to complete the
pipeline.
Troubleshooting:
Since the
Databricks Terraform module code is same for all environments, we ruled out any
infra related issues at first.
After spending
loads of time on troubleshooting, we have raised a ticket with Azure and
in-turn with Databricks support team.
First, support
team looked at number of python packages that is getting installed on to the
cluster and suggested adding init script to install all libraries to be
installed on all executors during cluster init to avoid time taken to install on each
executors after cluster start event. Following init script was added to the cluster configuration:
databricks.libraries.enableSparkPyPI false
No luck with this suggestion.
Next step, we shared detailed logs the Databricks support team running the pipeline in our DEV and QA environment. Nothing significant there when we compared the logs from all the environments.
Next, we did a simple test in DEV and QA environment, to create few 100 items on the mounted storage location (which is a delta lake enabled blob storage) where we are reading raw data and writing delivery data. QA environment took few seconds more than the dev environment.
Azure support team investigated the cause of performance using the following command from the notebook from two environments:
The IP address returned from Dev was reachable thru telnet but QA returned different IP address. Telnet to DEV took only 0.1 seconds whereas telnet to QA IP took 4 seconds.
It was found
out that the VNET peering in QA environment was done to a custom DNS which was not correct.
Though
Databricks terraform code is same for both environments, VNET peering and
configuration (IP Address to DNS) was different in QA environment which
caused the latency. After doing the proper VNET peering to the correct DNS server, the
issue was resolved.
Learnings:
- Standard rulebook defined for troubleshooting alone would not help. At times, it requires pragmatic, practical thinking outside the box, questioning the obvious.
- Not to
rule out any specific area completely off the table (in our case we ruled out issues with infra at
earlier stage citing Terraform created environment)
It took four
parties (In this case, Databricks support, Azure support, project team (us),
client infra team) to work together as a team with transparency and non-blaming attitude to
eventually resolve the issue. (Good) Team
work, works!
No comments:
Post a Comment