Thursday, June 8, 2023

Writing python script in notebooks vs writing software engineering approach-based Python classes in Databricks

 

Software engineering approach to code for Data engineering

Typically data engineers / developers write blocks of code in Databricks or Jupyter notebooks in the form of code snippets /scripts as shown in a screen snippet below.

 

Usage of ad-hoc code in the form of scripting

 This works fine in case of single member team or one-time activity to get data from source to target with or without transactions and enrichment.

If the project under consideration is a major program involving many team members or if there is a repeated ingestion of files and running of pipelines to cleanse, transform, enrich and harmonize data, it   makes more sense to take a software engineering approach towards source code management instead of just writing ad-hoc non-reusable scripts.

 Create python classes and ideally one python class per one python file(.py) with logically related functions:


This brings following benefits:

 

  • Modularity
  • consistent and  standardized solution framework
  • Use Git Hub Repos or Azure DevOps or any such source control supported
  • limiting number of libraries, packages to install on clusters
    • Instead of each developer using their own set of libraries, having common set and reusing would save a lot of time in cluster initialization  (less libraries to install on new executors of cluster, less time it takes to get a cluster up and running)

This brings following benefits:

 

  • Allows code versioning
  • Collaborate better between teams
  • Follow GitOps, DevOps process
  • Unit testing automation thru DevOps, 
  • Code scanning
  • Automated deployment thru DevOps pipelines


 


          

 

Usage of python classes reusing classes


This page explains the style guide, coding conventions to follow as a standard recommendation when writing Python code:


PEP 8 – Style Guide for Python Code | peps.python.org


 

 

 

 

 

 

 

 





No comments: