Using cloud data lakes in big data solutions comes with some baggage. It’s cheap, scalable and convenient, but there is a cost as well – it’s messy and has no transactional or metadata support which is so important when you work with data at scale. Common issues include:

  • Consistency problems with concurrent reading and writing into the data lake.
  • Coping with increased processing times of big data due to lack of indexing optimisations.
  • Spending precious time cleansing the solution if bad quality data disrupts the pipeline.

For these reasons and more, the Apache Spark team created a new Databricks functionality called Delta Lake. As an open-source innovation, Delta Lake brings new capabilities for transactions, version control and indexing to your data lakes. Running on top of existing data lake data, it provides snapshot isolation that tackles the issues of concurrent read and write operations and enables rollback of transactions thorough history tracking of data lake commits. Thanks to its built in optimisation mechanisms, enabling Delta Lake in a data engineering solution can significantly enhance query performance.

In this session we will showcase how Delta Lake works and how easily your modern data engineering pipelines can benefit from its implementation. This workshop would be of interest to anyone that deals with big data or creates modern data warehouse solutions and would like to learn the ways to solve common data lake challenges

Presented by Piotr Mucha at SQLBits 2020