Thursday, June 8, 2023

 

Part 2 : Applying 12 factor app principle to data engineering

 

2nd principle is handling dependencies The Twelve-Factor App (12factor.net)

 

Explicitly declare and isolate all dependencies

 


Reference : https://developer.ibm.com/developer/default/articles/creating-a-12-factor-application-with-open-liberty/images/images02.png

 This principle states that the source code maintained in source control should contain only code that is relevant and unique to the application.

 External dependencies (like Node JS, python libraries, .NET add-ins etc..) should not reside within source  control but those should be identified, isolated and made sure it is available during runtime when the application  / service is trying to do reference.

 Lets take this principle and fit it in our data engineering /  data integration scenario.

 Handling dependencies in Databricks 

 In your python based solution, create a requirements.txt file including all dependency packages along with their versions:

  • Numpy==1.24.3
  • Pandas==2.0.2
  • Py==3.1.0
  • Py4 ==0.10.9.7
  • Xlrd==1.2.0

to list a few.

These Python libraries are installed in Databricks cluster from DevOps pipeline downloading it from external artefacts repository. Alternatively, we could also have it stored in Azure DevOps -> Artefacts and download from there from pipeline if there is a need to restrict internet access in pipeline or when working with a self-hosted deployment agent.

 Practical Tip:

If we don't specify explicit version of packages, from Azure DevOps pipeline when it downloads from respective store, it gets the latest version. In our project, this lead to unexpected issues with data pipeline. Since these packages are Open source and community driven, it keeps evolving with new versions. High chance a particular feature might not work in a higher version. Hence it is always recommended to specify exact version with which you have done your development and testing. 

Following Microsoft article explains steps involved in creating a DevOps pipeline to package python libraries and to deploy libraries to Databricks cluster with use of Databricks-connect CLI tool to securely connect to cluster from DevOps agent.

https://learn.microsoft.com/en-us/azure/databricks/dev-tools/ci-cd/ci-cd-azure-devops

 In addition to external libraries listed above, our python code needs to be packaged, versioned and deployed to cluster for the data pipeline to work.

Link to blog on writing python script in notebooks vs writing software engineering approach-based Python classes in Databricks https://bharaniblogs.blogspot.com/2023/06/writing-python-script-in-notebooks-vs.html

 Initially, in our project we were hard-coding python code version every time we were doing a build and deploy during development phase. As part of continuous improvement, we introduced dynamic versioning using Versioneer.

 Link to blog on how to create and apply dynamic version to python packages and deploy to Databricks from DevOps

Bharani blogs: Dynamically generate version and apply python package from DevOps 

 


No comments: