Bharani blogs: March 2023

Custom code or low-code / serverless ?

In this series, we will take a high-level architecture of a data engineering use-case and let us discuss on trade-offs that arises when deciding to choose technology / services option to implement a certain functionality.

Use-case Requirement:

To connect to a non-personal outlook email account and look for mails from vendors (distributors of our client products) which would have attachments considering transactions of sales made to end customers. Emails are sent on a monthly recurring schedule. Requirement is to download this attachment, store it in a data lake for further downstream processing like cleansing, validation, transformation, enrichment etc.

Trade off here is to choose between writing custom code option to connect to Outlook email account and look for specific email attachments from a specific email address or to use a PaaS serverless service like Logic Apps which can perform this functionality out-of-the-box following low-code model.

Using one of the PaaS low-code service (In this case, Logic Apps)

Using logic apps

Pros

Establish connection to outlook, storage account using built in Azure authentication in quick time without having to write code to authenticate / authorize following token based auth model
Low-code framework helps to develop quick prototypes and turn it into production grade soon
Less maintenance as less code involved
Transparent cost and billing (Azure cost)
Provides security against common security threats thru Azure built-in security features

Cons

Though cost here is not a drawback but one need to understand the pricing model as each connections and activities in the workflow consumes cost. If your requirement has a need to connect to multiple services or do multiple activities, it is best to do a rough estimate using pricing calculator (Pricing - Logic Apps | Microsoft Azure )
Vendor lock-in
Provides less flexibility to developers if there is more customization needed (like performing scanning of attachments)
Limitations that comes with such services (Known issues and limitations)

Using any of the compute service in Azure (to host the custom code)

Pros

No vendor lock-in
Though you can run code locally free of cost, code requires hosting in a compute service on cloud. Cost factor depends on the service chosen to host the code. If shared with other compute services, cost can be reduced
More control over functionality

Cons

Increase in development effort
Maintenance effort involved to troubleshoot, debug bugs in code
Manual effort to prevent against DDOS like attacks

Ask the following questions (to self and to client stakeholder)

Is the application / platform already using any of these compute instances?

It makes sense to host this custom code there to make use of the existing compute capacity there by we can save cost without going for an additional service. If Azure functions is already being run on Application service environment, we could indeed provision another instance to host this custom code to leverage existing compute capacity.

Is there a requirement for the client to use this piece of functionality in multiple places or connecting to multiple email accounts?

If there is intention on reusing it in multiple places in other apps as well, then it makes more sense to use go with custom code approach.

Is it an one-off functionality used as a small piece in a large application, does it make sense to invest a lot of time writing code ?

If yes, it does make sense to go for such serverless services out-of-the-box

Is there a chance your application would get migrated to a different cloud platform?

If yes, it does make sense to go with custom code functionality as we don't want to get into vendor lock-in with such cloud specific services which requires effort to redesign in an equivalent service in another cloud.

Summary

It all boils down to the client & application requirement, based on that do a cost-to-benefit ratio to select the option that fits. There is no one-option fits for all purpose when it comes to such trade-off decisions.

Additional note here, with either of the option DevOps to do continuous integration, testing and deployment would require more or less similar effort.

DEFAULT value in CREATE TABLE when creating delta table in Azure Databricks

Issue description:

This was a thread posted in a forum about issue creating DELTA lake table with DEFAULT value specifier in create table SQL statement.

INSERT INTO.. with DEFAULT keyword / value for a column value is failing in Databricks.

Following statement creates a table using DELTA format:

CREATE TABLE students (name VARCHAR(64), address STRING DEFAULT 'unknown', student_id INT)USING DELTA

Inserting into this table with 'DEFAULT' keyword throws this error:

Though Databricks link below indicates that DEFAULT value works with following version criteria, this member in the forum reported that it does not work when doing INSERT into.

Applies to

Databricks SQL SQL warehouse version 2022.35 or higher

Databricks Runtime 11.2 and above

Reference : https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

This is the MS link the forum member posted as reference and questioning why INSERT with DEFAULT keyword is failing .

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-dml-insert-into#insert-using-values

Side story:

To this post, one of the other forum member created table and inserted with DEFAULT and confirmed working fine posting as a reply.

Turned it was a classis misunderstanding, this member ran the query creating a typical SQL table in SQL Server using SQL studio instead of Databricks 😅

Resolution:

Trick is to include this tblproperties setter to make DEFAULT work

TBLPROPERTIES( 'delta.feature.allowColumnDefaults' = 'supported');

When i tried running this command, it kept failing for me and later realized that i was still using 11.x DBR and my attempt to upgrade to 12.2 was not successful.

CREATE TABLE students (name VARCHAR(64), address STRING DEFAULT 'unknown', student_id INT)USING DELTA TBLPROPERTIES( 'delta.feature.allowColumnDefaults' = 'supported');

After upgrading cluster and did a detach & attach of notebook, it did work!!

In my INSERT query, I am passing DEFAULT for 'address' column and it took 'unknown' string as specified in CREATE table query.

You can also use leave out the 'address' column in INSERT and specify other columns to insert default value:

INSERT INTO STUDENT (name, student_id) VALUES ('bharani', '1233')

to get 'unknown' inserted in place of address.

Wednesday, March 15, 2023

Friday, March 10, 2023

Bharani blogs

Followers

Blog Archive

About Me