Part 2 : Applying 12 factor
app principle to data engineering
2nd principle is handling
dependencies The Twelve-Factor App (12factor.net)
Explicitly
declare and isolate all dependencies
This principle states that the source code maintained in source control should contain only code that is relevant and unique to the application.
External dependencies (like Node JS, python libraries, .NET add-ins etc..) should not reside within source control but those should be identified, isolated and made sure it is available during runtime when the application / service is trying to do reference.
Lets take this principle and fit it in our data engineering / data integration scenario.
Handling dependencies in Databricks
In your python based solution, create a requirements.txt file including all dependency packages along with their versions:
- Numpy==1.24.3
- Pandas==2.0.2
- Py==3.1.0
- Py4 ==0.10.9.7
- Xlrd==1.2.0
to list a few.
These Python libraries are installed in Databricks cluster from DevOps pipeline downloading it from external artefacts repository. Alternatively, we could also have it stored in Azure DevOps -> Artefacts and download from there from pipeline if there is a need to restrict internet access in pipeline or when working with a self-hosted deployment agent.
Practical Tip:
If we don't specify explicit
version of packages, from Azure DevOps pipeline when it downloads from
respective store, it gets the latest version. In our project, this lead to
unexpected issues with data pipeline. Since these packages are Open source and
community driven, it keeps evolving with new versions. High chance a particular
feature might not work in a higher version. Hence it is always recommended to
specify exact version with which you have done your development and
testing.
Following Microsoft article explains steps involved in creating a DevOps pipeline to package python libraries and to deploy libraries to Databricks cluster with use of Databricks-connect CLI tool to securely connect to cluster from DevOps agent.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops
In addition to external libraries listed above, our python code needs to be packaged, versioned and deployed to cluster for the data pipeline to work.
Link to blog on writing python script in notebooks vs writing software engineering approach-based Python classes in Databricks https://bharaniblogs.blogspot.com/2023/06/writing-python-script-in-notebooks-vs.html
Initially, in our project we were hard-coding python code version every time we were doing a build and deploy during development phase. As part of continuous improvement, we introduced dynamic versioning using Versioneer.
Link to blog on how to create and apply dynamic version to python packages and deploy to Databricks from DevOps
Bharani blogs: Dynamically generate version and apply python package from DevOps
No comments:
Post a Comment