Migrating the Mammoth
Proposed session for SQLBits 2026TL; DR
Based on a real enterprise case, this session details how a world-leading tech company modernized a massive, highly fragmented data platform—migrating RC, ORC, Parquet, and Avro datasets to Delta Lake as part of a multi-petabyte transformation. Attendees will learn how the team designed the architecture, automation framework, and migration tooling needed to orchestrate thousands of pipelines, validate data at scale, and handle complex edge cases such as multi-format tables, massive (>50 TB) datasets, and long-running legacy workloads. Discover the patterns, challenges, and hard-earned lessons required to execute a migration of this magnitude successfully.
Session Details
This session is grounded in the real-world experience of a global technology company undertaking one of the largest data-platform migrations in industry—modernizing its fragmented data estate and moving to Delta Lake at extraordinary scale. Managing more than 60 PB of data spread across four different formats, 750+ ingestion pipelines, and over 5,000 ETL jobs, the organization needed a migration strategy engineered for both massive volume and operational complexity.
We will explore how the team executed this transformation using a structured “divide and conquer” model, separating the work into targeted migration workstreams and aligning them to OKRs. As tooling sophistication grew, migration velocity increased, supported by an agile delivery model and a dedicated “Migration Machine” designed to validate data, orchestrate dependency-heavy workloads, and provide complete visibility through custom dashboards.
Attendees will gain insight into the real challenges encountered at scale, including migrating 50+ TB tables, dealing with multiple source formats, orchestrating batch backfills, validating large datasets, and designing alternate paths for tables too slow or complex for standard migration flows.
We will conclude with the tangible outcomes: up to 1600× improvement on highly selective queries for petabyte-scale tables, significant gains in table-read performance, and major acceleration of BI workloads—results internal teams described as truly “game changing”.
This session offers a deeply practical look at what it takes to migrate a “mammoth” data estate to a modern lakehouse architecture, providing actionable lessons for organizations planning or scaling their own modernization efforts.
We will explore how the team executed this transformation using a structured “divide and conquer” model, separating the work into targeted migration workstreams and aligning them to OKRs. As tooling sophistication grew, migration velocity increased, supported by an agile delivery model and a dedicated “Migration Machine” designed to validate data, orchestrate dependency-heavy workloads, and provide complete visibility through custom dashboards.
Attendees will gain insight into the real challenges encountered at scale, including migrating 50+ TB tables, dealing with multiple source formats, orchestrating batch backfills, validating large datasets, and designing alternate paths for tables too slow or complex for standard migration flows.
We will conclude with the tangible outcomes: up to 1600× improvement on highly selective queries for petabyte-scale tables, significant gains in table-read performance, and major acceleration of BI workloads—results internal teams described as truly “game changing”.
This session offers a deeply practical look at what it takes to migrate a “mammoth” data estate to a modern lakehouse architecture, providing actionable lessons for organizations planning or scaling their own modernization efforts.
3 things you'll get out of this session
1. Understand how to design and execute a large-scale, multi-format data migration—including strategies for converting RC, ORC, Parquet, and Avro datasets into Delta Lake while maintaining data integrity, lineage, and operational continuity.
2. Learn the architectural patterns, automation frameworks, and tooling required to orchestrate thousands of ingestion and transformation pipelines during a multi-petabyte modernization effort, with a focus on scalability, validation, and reliability.
3. Identify common edge cases and technical challenges encountered during massive enterprise migrations—such as handling 50+ TB tables, mixed-format estates, legacy dependencies, and long-running workloads—and understand the practical solutions that enabled successful outcomes.
Speakers
Anna-Maria Wykes's other proposed sessions for 2026
MCP Unleashed From “Huh?” to “Heck Yeah!” Building Smarter AI Knowledge Bases - 2026
What we Learned Migrating a Financial Giant from Hudi to Delta (and Why Iceberg was in the Mix) - 2026
From “Who Wrote This ETL?” to Databricks, Claude Saves the Day via Microsoft Foundry - 2026
Getting Started with Claude in Microsoft Foundry - 2026
Anna-Maria Wykes's previous sessions
How to Run Code Clubs for Neurodiverse Children
Code Clubs offer an amazing opportunity to introduce our next generation to coding, with simple brightly colored drag-and-drop tooling to get them started, we are successfully inspiring many to join the tech industry.
In this session I want to talk you through my journey setting up a Code Club for neurodiverse children, what I found worked, and what doesn’t. I hope that from this session you will be inspired to follow the same path I have, using your amazing tech experience to empower some of the most vulnerable children, enabling them to become inspired not just by coding, but the tech industry itself.
Introduction to the wonders of Azure DevOps
Azure DevOps is the leading deployment tool for build and release solutions end to end. It helps you plan your Agile project, manages Git code, and deploys solutions using Continuous Integration (CI) and Continuous Deployment (CD) pipelines.
In this session we will cover some of the core components of Azure DevOps and show you how to implement a secure deployment pipeline, using unit tests and gating with your CI builds and CD releases.
Automate the deployment of Databricks components using Terraform
Introduction into Terraform, Databricks provider and steps required to build an automated solution to provision Databricks workspace and resources into Azure cloud platform using Terraform.
So you want to be a Data Engineer?
An introduction to becoming a Data Engineer, Anna, Mikey and Ust will introduce the technology stack, tools and development skills needed for data engineering and show you how and where to go to learn them. We'll also show you how the skills you already have can kickstart your journey to becoming a Data Engineer.
Scala for Big Data the Big Picture
An opportunity to explore Scala, and why it is truly a “Data Engineers language”. Using Azure Functions, Data Factory, Azure Data Lake Gen2 and Databricks the basics will be explored, followed by real world examples