SQLBits 2023

Declarative ETL pipelines with Delta Live Tables

Modern software engineering and management for ETL, so data analysts and engineers can spend less time on tooling and focus on getting value from data.

Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach to building reliable data pipelines and automatically managing your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. With DLT, engineers are able to treat their data as code and apply modern software engineering best practices like testing, error handling, monitoring and documentation to deploy reliable pipelines at scale.

*Accelerate ETL development*
- Unlike solutions that require you to manually hand-stitch fragments of code to build end-to-end pipelines, DLT makes it possible to declaratively express entire data flows in SQL and Python.
- In addition, DLT natively enables modern software engineering best practices like the ability to develop in environment(s) separate from production, the ability to easily test it before deploying, deploy and manage environments using parameterization, unit testing and documentation. As a result, you can simplify the development, testing, deployment, operations and monitoring of ETL pipelines with first-class constructs for expressing transformations, CI/CD, SLAs and quality expectations, and seamlessly handling batch and streaming in a single API.
- DLT provides a new, declarative API that lets users perform streaming CDC to generate SCD type 1 & 2, including built-in out of order handling

*Automatically manage infrastructure*
- Sizing clusters for optimal performance given changing, unpredictable data volumes can be challenging and lead to overprovisioning. DLT automatically scales compute to meet performance SLAs by providing the user with the option to set the minimum and maximum number of instances and let DLT size up the cluster according to cluster utilization.
- Error handling and recovery, and performance optimization are all handled automatically. With DLT, you can focus on data transformation instead of operations.

*Data confidence*
- DLT makes it easy to create trusted data sources by including first-class support for data quality management and monitoring tools using a feature called Expectations.
- Expectations help prevent bad data from flowing into tables, track data quality over time, and provide tools to troubleshoot bad data with granular pipeline observability so you get a high-fidelity lineage diagram of your pipeline, track dependencies, and aggregate data quality metrics across all of your pipelines.

*Simplified batch and streaming*
- Unlike other products that force you to deal with streaming and batch workloads separately, DLT supports any type of data workload with a single API so data engineers and analysts alike can build cloud-scale data pipelines faster and without needing to have advanced data engineering skills.

Speaker

Vuong Nguyen's Sessions

Air Traffic Control (Data Governance) with Databricks Unity Catalog, tips, tricks and best practices

Declarative ETL pipelines with Delta Live Tables