In this series, we will take 12 factor app principles and apply it to a typical data engineering use-case.
First principle is Codebase. This principle states that one codebase (branch, typically) is to be maintained which would take various configuration settings, parameters, secrets as reference and deploys to multiple environments as shown in the figure below.
Often software engineering & DevOps perspective is overlooked for most of the analytics use-cases. Most Data engineers, Data scientists gives more priority on data and how to create data pipelines that would transport data from one layer to another to carry out activities cleansing, validation, transformation, enrichment but DevOps aspect is seen as if it is only applicable for an application workload.
It could be stemming from
the thinking that the pipeline can be replicated with less effort from
environment to another in a much less-hassle way compared to deploying a
microservice or a web app from one to another environment. Thanks to many of the UI framework based ETL / ELT tools in the market. (Talend, Azure Data Factory etc..)
Not only data engineers, scientists in this case even clients or product owners or any relevant stakeholder of a data use-case does not really recognize the value DevOps, GitOps can bring to achieve quality, consistency, automation of deployment.
Another factor attributes to
this is the less availability of DevOps framework and tools for data analytics
workloads compared to the toolset available for web or services based
workloads.
Even if a team has taken this DevOps into consideration, in most cases it might not follow GitOps deployment model.
Having multiple codebase (branches) for each environment (dev, test, prod) with hard-coded environment specific parameters makes it difficult to do merge from Dev to Test or from Test to Prod branch. It violates the first principle of 12 factor app.
Trunk-based development model is the go-to option to achieve this principle of having single Codebase. This method defines one of the branch (usually 'Main') as trunk and other branches (feature/<feature name or developer name>) or short-lived branch which are created and destroyed post merge to trunk and feature release.
Environment specific branching
Trunk-based branching approach
Trunk-based development
model is used in Kubernetes and such cloud native development / deployment
model when dealing with microservices or web apps like workloads. In Data
related use-cases could also follow similar model to bring in best practices
when it comes to continuous integration and deployment.
In Azure DevOps, such a
trunk-based branching can be followed and to do the environment specific
substitutions one could leverage variable groups (one per environment) of
library section which maps to key vaults created per environment in their own
resource groups / subscriptions based on how different environments are
segregated.
Library section in Azure
DevOps with environment specific variable groups pointing to key vault of
respective environment
From DevOps pipelines, these variable groups can be used to substitute environment specific configurations or secrets from the deployment stages designated per environment.
Database deployment task in DevOps
When it comes to database
deployment, one could dacpac tasks that are available in Azure DevOps by
default. While deploying changes from dev to test or from test to prod (with or
without data, based on need), database username and password, database names
could be referred from key vault secrets mapped to variable groups. These
variables could then be used in DevOps pipelines.
Same goes for other data services like Data factory, Databricks, Analysis services etc.
Data factory deployment task in DevOps
In case of data factory, it
is a bit more complicated as one has to substitute many objects
- global parameters
- Linked connections to services
- Datasets
- Pipelines
- etc..
Adftools is one great DevOps
tool to use for Data factory DevOps.
Databricks deployment task in DevOps
In case of Databricks deployment, packaging of code and deploying it as library in cluster can be done using Azure DevOps.
In upcoming blogs, we would see each of this deployments in detail. This blog is just to iterate on DevOps & GitOps practices for data engineering use-case.
No comments:
Post a Comment