Speed vs. Scale: DuckDB, Polars, Pandas, and PySpark in Practice

Regular 50 minute session for SQLBits 2026

TL; DR

Cut through Python data tool hype with practical benchmarks and decision frameworks for DuckDB, Polars, Pandas, and PySpark to pick the right tool for speed vs. scale.

Session Details

The Python data processing landscape has exploded with powerful alternatives to Pandas, each optimized for different use cases. But when should you reach for DuckDB's SQL analytics, Polars' lightning-fast DataFrame operations, traditional Pandas, or PySpark's distributed computing? This talk cuts through the hype with real-world benchmarks and practical decision frameworks.

We'll explore the fundamental trade-offs between speed and scale across these four essential tools. You'll see live performance comparisons on realistic datasets, from million-row analytics queries to multi-gigabyte data transformations. More importantly, you'll learn the decision criteria that separate these tools: when DuckDB's columnar engine shines, where Polars' Rust-powered speed matters most, why Pandas still has its place, and when you need PySpark's distributed architecture.

This isn't just another feature comparison; it's a practical and straight-to-the-point guide for Python developers who need to choose the right tool for their specific data challenges. You'll leave with clear mental models for tool selection, real performance expectations, and code examples you can immediately apply to your own projects.

The talk will address a critical pain point in today's Python ecosystem: tool choice paralysis in an increasingly crowded field of excellent options.

Target audience: Python developers with 1-3 years of experience who know Pandas basics but want to understand when and why to adopt newer tools. They're comfortable with DataFrames and basic data operations but need guidance on tool selection for different scenarios in the real world.

3 things you'll get out of this session

Understand speed vs. scale trade-offs across DuckDB, Polars, Pandas, and PySpark.
Learn practical decision criteria for choosing the right tool for real-world datasets.
Access code examples and performance benchmarks to apply immediately in projects.

Gift Ojeabulu's other proposed sessions for 2026

AI in Action: Developing Smarter, Faster Data Platforms with LLMs and Copilot - 2026

Code Your Own Sports Analytics Dashboard - 2026

Data Validation in Production ML: Preventing Silent Failures with Pandera, GE, DBT and Deepchecks - 2026

Speed vs. Scale: DuckDB, Polars, Pandas, and PySpark in Practice

TL; DR

Session Details

3 things you'll get out of this session

Speakers

Gift Ojeabulu

giftojeabulu.medium.com

Gift Ojeabulu's other proposed sessions for 2026