DuckDB

Here’s a comprehensive, forward-looking overview of DuckDB — a database that’s been rapidly gaining traction for analytics workloads, especially for data engineers, ML practitioners, and Python/R developers like yourself.
Author

Benedict Thekkel

🔹 What is DuckDB?

  • In-process OLAP database: It runs inside your application (like SQLite) but is optimized for analytics queries (OLAP), not transactional workloads (OLTP).
  • Zero-config: No server to spin up, no dependencies. You just import duckdb and start querying.
  • Columnar engine: Uses a column-store format with vectorized execution, making it blazing fast for aggregation and scanning large datasets.
  • Modern SQL dialect: Full SQL support (joins, windows, CTEs, JSON, UDFs).

🔹 Key Features

1. Data Access

  • Reads CSV, Parquet, JSON, SQLite, Pandas, Polars, Arrow, Iceberg, MotherDuck (cloud).
  • Can query files directly (e.g., SELECT * FROM 'file.parquet').
  • Can query in-memory dataframes without copying.

2. Performance

  • Vectorized execution: Processes chunks of columns in CPU cache.
  • Parallelism: Multithreaded query execution.
  • Efficient compression: Similar to Parquet, built for scan speed.
  • Benchmarks show it often beats Pandas/Polars for SQL-style analytics on millions-to-billions of rows.

3. Integration

  • Python / R / C++ / CLI bindings.
  • Native support for Pandas/Polars ↔︎ DuckDB interchange.
  • Can be used as a lightweight ETL or embedded query layer inside ML pipelines.
  • dbt + DuckDB is popular for lightweight data warehousing.

4. SQL Features

  • Joins, aggregations, window functions, subqueries.
  • JSON functions (json_extract, json_group_array).
  • User-defined functions in Python.
  • Time-travel style queries with Iceberg.

5. Persistence

  • Default is in-memory, but you can duckdb.connect("file.duckdb") to persist.
  • Tables can be native DuckDB storage or external (Parquet, CSV, Arrow).
  • No server process: library embedded in your app.

🔹 Typical Use Cases

Use Case Why DuckDB Works Well
Ad-hoc analytics Query Parquet/CSV files directly with SQL.
Replacement for Pandas Faster joins, group-bys, aggregations.
Data science prep Integrates with ML workflows (Pandas/Polars/Arrow).
Local OLAP engine Can handle 100M–1B rows in memory on a laptop.
Embedded analytics Works inside apps, no external server needed.
ETL lightweight warehouse Works with dbt, Iceberg, and MotherDuck for scaling.

🔹 Limitations

  • Not an OLTP database: No concurrent writes or high transaction rates.
  • Memory-bound: Runs in your process memory; not built for multi-terabyte datasets (unless combined with external storage).
  • Ecosystem still maturing: Great for analytics, but lacks advanced role/security features of Postgres.
  • No indexes (yet): Optimized for full scans, not point lookups.

🔹 Ecosystem

  • MotherDuck: Cloud service extending DuckDB (shared storage, scaling, persistence).
  • dbt-duckdb: Run dbt models against local files and Parquet with DuckDB.
  • Polars + DuckDB: Polars for Pythonic transformations, DuckDB for SQL joins/aggregations.
  • Data Fusion: Works well as a query layer over Iceberg/Parquet lakes.

🔹 Example Workflows

Query a Parquet file directly

import duckdb
con = duckdb.connect()
df = con.execute("SELECT station, AVG(temp) FROM 'measurements.parquet' GROUP BY station").df()

Interop with Pandas

import pandas as pd, duckdb
df = pd.read_csv("big.csv")
duckdb.query("SELECT col1, COUNT(*) FROM df GROUP BY col1").df()

Save to persistent DB

con = duckdb.connect("analytics.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS sales AS SELECT * FROM 'sales.parquet'")

🔹 Performance Outlook

  • Handles billions of rows on a laptop with enough memory.
  • Benchmarks: scanning a 1B row CSV can take ~30–60s; Parquet is much faster (~10–20s).
  • With Polars/DuckDB hybrid workflows, you can approach Rust-level performance with the flexibility of Python.

🔹 When to Use vs Alternatives

Tool Best For
DuckDB Local analytics, SQL over files, Pandas replacement
SQLite Embedded OLTP, small apps
Polars Pythonic fast dataframe ops (no SQL)
Pandas Legacy ecosystem, flexibility, but slower
Postgres/BigQuery/Snowflake Multi-user, transactional, massive data warehouse

In short: DuckDB is the SQLite of analytics. It’s ideal when you want SQL-powered analytics without a big warehouse or cluster, yet still want to handle 100M–1B rows efficiently. It’s becoming a cornerstone of the modern data stack, especially when paired with Parquet, Arrow, and dbt.

Back to top