Using cloud data lakes in big data solutions comes with some
baggage. It’s cheap, scalable and convenient, but there is a cost as well –
it’s messy and has no transactional or metadata support which is so important when
you work with data at scale. Common issues include:
- Consistency problems with concurrent reading and
writing into the data lake.
- Coping with increased processing times of big
data due to lack of indexing optimisations.
- Spending precious time cleansing the solution if bad
quality data disrupts the pipeline.
For these reasons and more, the Apache Spark team created a new
Databricks functionality called Delta Lake. As an open-source innovation, Delta
Lake brings new capabilities for transactions, version control and indexing to your
data lakes. Running on top of existing data lake data, it provides snapshot
isolation that tackles the issues of concurrent read and write operations and
enables rollback of transactions thorough history tracking of data lake
commits. Thanks to its built in optimisation mechanisms, enabling Delta Lake in
a data engineering solution can significantly enhance query performance.
In this session we will showcase how Delta Lake
works and how easily your modern data engineering pipelines can benefit from
its implementation. This workshop would be of interest to anyone that deals
with big data or creates modern data warehouse solutions and would like to
learn the ways to solve common data lake challenges