Wednesday, March 15, 2023

  

Custom code or low-code / serverless ?

In this series, we will take a high-level architecture of a data engineering use-case and let us discuss on trade-offs that arises when deciding to choose technology / services option to implement a certain functionality.

 Use-case Requirement:

 To connect to a non-personal outlook email account and look for mails from vendors (distributors of our client products) which would have attachments considering transactions of sales made to end customers. Emails are sent on a monthly recurring schedule. Requirement is to download this attachment, store it in a data lake for further downstream processing like cleansing, validation, transformation, enrichment etc.

Trade off here is to choose between writing custom code option to connect to Outlook email account and look for specific email attachments from a specific email address or to use a PaaS serverless service like Logic Apps which can perform this functionality out-of-the-box following low-code model.

 Using one of the PaaS low-code service (In this case, Logic Apps)

Using logic apps

Pros

  • Establish connection to outlook, storage account using built in Azure authentication in quick time without having to write code to authenticate / authorize following token based auth model
  • Low-code framework helps to develop quick prototypes and turn it into production grade soon
  • Less maintenance as less code involved
  • Transparent cost and billing (Azure cost)
  • Provides security against common security threats thru Azure built-in security features

 Cons

  • Though cost here is not a drawback but one need to understand the pricing model as each connections and activities in the workflow consumes cost. If your requirement has a need to connect to multiple services or do multiple activities, it is best to do a rough estimate using pricing calculator (Pricing - Logic Apps | Microsoft Azure )
  • Vendor lock-in
  • Provides less flexibility to developers if there is more customization needed (like performing scanning of attachments)
  • Limitations that comes with such services (Known issues and limitations)

 Using any of the compute service in Azure (to host the custom code)

 

 

 

Pros

  • No vendor lock-in
  • Though you can run code locally free of cost, code requires hosting in a compute service on cloud. Cost factor depends on the service chosen to host the code. If shared with other compute services, cost can be reduced
  • More control over functionality

 Cons

  • Increase in development effort
  • Maintenance effort involved to troubleshoot, debug bugs in code
  • Manual effort to prevent against DDOS like attacks

 Ask the following questions (to self and to client stakeholder)

 

  1. Is the application / platform already using any of these compute instances?

 It makes sense to host this custom code  there to make use of the existing compute capacity there by we can save cost without going for an additional service. If Azure functions is already being run on Application service environment, we could indeed provision another instance to host this custom code to leverage existing compute capacity.

 

  1. Is there a requirement for the client to use this piece of functionality in multiple places or connecting to multiple email accounts?

 If there is intention on reusing it in multiple places in other apps as well, then it makes more sense to use go with custom code approach.

 

  1. Is it an one-off functionality used as a small piece in a large application, does it make sense to invest a lot of time writing code ?

 If yes, it does make sense to go for such serverless services out-of-the-box

 

  1. Is there a chance your application would get migrated to a different cloud platform?

 If yes, it does make sense to go with custom code functionality as we don't want to get into vendor lock-in with such cloud specific services which requires effort to redesign in an equivalent service in another cloud.


Summary

 

It all boils down to the client & application requirement, based on that do a cost-to-benefit ratio to select the option that fits. There is no one-option fits for all purpose when it comes to such trade-off decisions.

 Additional note here, with either of the option DevOps to do continuous integration, testing and deployment would require more or less similar effort.

 

 




Friday, March 10, 2023

DEFAULT value in CREATE TABLE when creating delta table in Azure Databricks

 

Issue description:

 

This was a thread posted in a forum about issue creating DELTA lake table with DEFAULT value specifier in create table SQL statement.

 

 INSERT INTO.. with DEFAULT keyword / value for a column value is failing in Databricks.

 

Following statement creates a table using DELTA format:

 

CREATE TABLE students (name VARCHAR(64), address STRING DEFAULT 'unknown', student_id INT)USING DELTA

 

Inserting into this table with 'DEFAULT' keyword throws this error:

 



 

Though Databricks link below indicates that DEFAULT value works with following version criteria, this member in the forum reported that it does not work when doing INSERT into.

 

Applies to

 Databricks SQL SQL warehouse version 2022.35 or higher 


Databricks Runtime 11.2 and above


Reference : https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html

 

This is the MS link the forum member posted as reference and questioning why INSERT with DEFAULT keyword is failing .

 

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-dml-insert-into#insert-using-values

 



 

 

Side story:

 

To this post, one of the other forum member created table and inserted with DEFAULT and confirmed working fine posting as a reply.

 

Turned it was a classis misunderstanding, this member ran the query creating a typical SQL table in SQL Server using SQL studio instead of Databricks ðŸ˜…

 

 



 

Resolution:

 

 

Trick is to include this tblproperties setter to make DEFAULT work

 

TBLPROPERTIES(  'delta.feature.allowColumnDefaults' = 'supported');

 

When i tried running this command, it kept failing for me and later realized that i was still using 11.x DBR and my attempt to upgrade to 12.2 was not successful.

 

CREATE TABLE students (name VARCHAR(64), address STRING DEFAULT 'unknown', student_id INT)USING DELTA TBLPROPERTIES(  'delta.feature.allowColumnDefaults' = 'supported');

 

After upgrading cluster and did a detach & attach of notebook, it did work!!

 

In my INSERT query, I am passing DEFAULT for 'address' column and it took 'unknown' string as specified in CREATE table query.

 

You can also use leave out the 'address' column in INSERT and specify other columns to insert default value:

 

INSERT INTO STUDENT  (name, student_id) VALUES ('bharani', '1233')

 

to get 'unknown' inserted in place of address.