Bharani blogs: Writing python script in notebooks vs writing software engineering approach-based Python classes in Databricks

Software engineering approach to code for Data engineering

Typically data engineers / developers write blocks of code in Databricks or Jupyter notebooks in the form of code snippets /scripts as shown in a screen snippet below.

Usage of ad-hoc code in the form of scripting

This works fine in case of single member team or one-time activity to get data from source to target with or without transactions and enrichment.

If the project under consideration is a major program involving many team members or if there is a repeated ingestion of files and running of pipelines to cleanse, transform, enrich and harmonize data, it makes more sense to take a software engineering approach towards source code management instead of just writing ad-hoc non-reusable scripts.

Create python classes and ideally one python class per one python file(.py) with logically related functions:

This brings following benefits:

Modularity
consistent and standardized solution framework
Use Git Hub Repos or Azure DevOps or any such source control supported
limiting number of libraries, packages to install on clusters

Instead of each developer using their own set of libraries, having common set and reusing would save a lot of time in cluster initialization (less libraries to install on new executors of cluster, less time it takes to get a cluster up and running)

This brings following benefits: