Software engineering approach to code for Data engineering
Typically data engineers /
developers write blocks of code in Databricks or Jupyter notebooks in the form
of code snippets /scripts as shown in a screen snippet below.
Usage of ad-hoc code in the
form of scripting
This works fine in case of single member team or one-time activity to get data from source to target with or without transactions and enrichment.
If the project under consideration is a major program involving many team members or if there is a repeated ingestion of files and running of pipelines to cleanse, transform, enrich and harmonize data, it makes more sense to take a software engineering approach towards source code management instead of just writing ad-hoc non-reusable scripts.
Create python classes and ideally one python class per one python file(.py) with logically related functions:
This brings following benefits:
- Modularity
- consistent and standardized solution framework
- Use Git Hub Repos or Azure DevOps or any such source control supported
- limiting number of libraries, packages to install on clusters
- Instead of each developer using their own set of libraries, having common set and reusing would save a lot of time in cluster initialization (less libraries to install on new executors of cluster, less time it takes to get a cluster up and running)
This brings following benefits:
- Allows code versioning
- Collaborate better between teams
- Follow GitOps, DevOps process
- Unit testing automation thru DevOps,
- Code
scanning
- Automated deployment thru DevOps
pipelines
Usage of python classes reusing classes
This page explains the style guide, coding
conventions to follow as a standard recommendation when writing Python code:
No comments:
Post a Comment