Building a Delta Lake Pipeline Without Spark

Delta Lake was built for Spark. The documentation assumes you’re running on Databricks or EMR with a cluster and JVM memory tuning. But we had 250K rows, not 25M. We had a laptop, not a cluster. Installing Spark felt like bringing a semi-truck to pick up groceries.

Turns out you don’t need Spark at all. DuckDB can query Delta tables directly, and delta-rs (the Rust-based Python bindings) can write them. You get all the benefits of Delta Lake - transaction log, time travel, ACID guarantees - without touching the JVM.

The Spark Problem

Spark is amazing at scale. When you’re processing terabytes across hundreds of nodes, the complexity is worth it. But that same complexity is a burden when you’re processing megabytes on a single machine.

Installing Spark means installing Java, downloading multiple gigabytes of JARs, configuring memory settings, dealing with the driver vs executor mental model. Then you get to write PySpark code, which is Python that calls Java that runs your Python UDFs in a subprocess. The overhead is real.

For our healthcare billing data, we’re talking about tables that fit comfortably in RAM. A few thousand patients, daily updates, simple aggregations. Spark would work, but it would be like using a database server when a SQLite file would do.

DuckDB’s Delta Integration

DuckDB added native Delta Lake support through the delta_scan() function. Point it at a Delta table directory, and it reads the transaction log, figures out which parquet files belong to the current version, and queries them directly.

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL delta; LOAD delta;")

df = conn.execute("""
    SELECT *
    FROM delta_scan('lake/silver/billing_data_anonymized')
    WHERE date >= '2024-11-01'
""").df()

That’s it. No cluster configuration, no Java heap sizes, no Spark context. Just SQL against Delta tables, returning a pandas DataFrame.

The performance is excellent for small-to-medium data. DuckDB is an in-process database with columnar storage and SIMD vectorization. For aggregation queries on our 250K row table, DuckDB is 3x faster than loading into pandas and doing groupby operations. It’s 40x faster than starting up Spark.

Delta-rs for Writes

Reading is one thing, writing is another. DuckDB’s Delta integration is read-only as of 2024. To write Delta tables, we need delta-rs, the official Rust-based implementation with Python bindings.

from deltalake import write_deltalake

write_deltalake(
    table_or_uri="lake/bronze/billing_data",
    data=df,
    mode="overwrite",
    schema_mode="merge"
)

This writes parquet files and updates the Delta transaction log atomically. Same ACID guarantees as PySpark, but it’s all happening in Rust with Python bindings. No JVM involved.

Schema evolution works automatically with schema_mode="merge". If the incoming DataFrame has new columns, they get added to the Delta table schema. Existing columns are preserved. This is the feature that eliminates schema change incidents.

The Stack

Our entire pipeline is pure Python with two key dependencies: DuckDB for reading and querying Delta tables, delta-rs for writing and transaction management.

We use Dagster for orchestration, which is also pure Python. The assets define data dependencies, Dagster materializes them in the right order, and we get a complete data pipeline without installing Java.

The Bronze ingestion reads from Firebird (Python driver), creates a pandas DataFrame, anonymizes PII, and writes to Delta Lake via delta-rs. Silver transformations read from Bronze via DuckDB, apply pandas operations, write back via delta-rs. Gold aggregations use DuckDB for fast SQL queries, write results via delta-rs.

It’s all in-process. No cluster to configure, no executors to tune, no driver memory to allocate. The entire pipeline runs on a single machine with predictable resource usage.

Performance at Our Scale

For our 250K row billing table, write performance is excellent. Ingesting from Firebird to Bronze takes 0.3 seconds. Silver transformation takes 0.4 seconds. Gold aggregation takes 0.2 seconds. Total pipeline runtime is under 2 seconds.

DuckDB queries are fast enough for interactive use. Aggregating care minutes by facility and date across the full history: 0.08 seconds. Joining Silver and Gold for validation queries: 0.15 seconds. Time travel queries comparing current vs historical versions: 0.2 seconds.

Memory usage is reasonable. The entire pipeline peaks at about 200MB of RAM. DuckDB’s columnar processing means it doesn’t need to load full DataFrames into memory for aggregations. Delta-rs writes incrementally instead of buffering entire tables.

Compare that to Spark. Even the simplest PySpark job allocates gigabytes for driver and executor memory. Startup time is measured in seconds, not milliseconds. The overhead dominates the actual work for small datasets.

Time Travel Without Spark

One of Delta Lake’s killer features is time travel - the ability to query historical versions of tables. This is built into the transaction log and doesn’t require Spark.

DuckDB supports version-based time travel directly:

# Query version 5
historical = conn.execute("""
    SELECT facility_name, COUNT(*) as count
    FROM delta_scan('lake/silver/billing_data_anonymized', version = 5)
    GROUP BY facility_name
""").df()

Delta-rs supports it too:

from deltalake import DeltaTable

dt = DeltaTable("lake/silver/billing_data_anonymized")
dt_v5 = dt.load_as_version(5)
df_v5 = dt_v5.to_pyarrow_table().to_pandas()

We used this extensively during the data corruption recovery. Query the current version, query version 11 (before corruption), compare them with SQL, selectively recover corrupted rows. All without Spark.

Schema Evolution Example

The feature that sold me on Delta Lake was automatic schema evolution. When Firebird added a new column to one of the source views, I expected the pipeline to break. It didn’t.

The Bronze ingestion with schema_mode="merge" automatically added the new column to the Delta table. The Silver transformation ignored it (pandas doesn’t care about extra columns). The Gold aggregation never saw it.

No code changes, no deployment, no manual intervention. The new column just flowed through Bronze and stopped at Silver because nothing downstream needed it. This is the kind of resilience that prevents 3am pages.

With Spark, you’d get the same behavior, but you’d be managing Spark schemas and catalogs. With delta-rs, it’s just parquet files and a JSON transaction log. Simpler to understand, easier to debug when things do go wrong.

DuckDB’s Query Optimization

DuckDB isn’t just reading parquet files sequentially. It uses Delta Lake’s transaction log to do smart optimizations.

File pruning means DuckDB only reads parquet files that contain data matching your query predicates. If you filter on date >= '2024-11-01', it checks the Delta stats in the transaction log and skips files that only have older dates.

Column pruning means DuckDB only reads the columns you actually query. Parquet is columnar, so reading just facility_name and total_care_minutes is much faster than reading all 20 columns in the table.

Predicate pushdown applies filters as early as possible. Instead of reading all rows and filtering in memory, DuckDB uses parquet file statistics to skip entire row groups that can’t match your query.

For our Gold layer queries that aggregate by facility and date, these optimizations make the difference between 0.08 seconds and 0.5 seconds. That’s the speed gap between “interactive” and “waiting.”

Limitations to Know

DuckDB can’t write Delta tables. You need delta-rs or Spark for writes. This is fine for a medallion architecture where each layer writes once per run, but it wouldn’t work for streaming updates.

DuckDB runs in a single process on one machine. If your data is too large for RAM, you’ll have performance issues. Spark’s distributed processing is designed for that scale. Our 250K rows fit easily, but 25M rows might not.

Delta-rs doesn’t support all Delta Lake features. MERGE operations exist but are less mature than in PySpark. Deletion vectors, liquid clustering, and other advanced features might not be available yet. Check the current delta-rs version for feature support.

Transaction isolation is weaker than Spark. If you’re running multiple concurrent writers to the same Delta table, you might hit conflicts. Our pipeline is sequential, so this hasn’t been an issue.

When You Actually Need Spark

If your data is measured in gigabytes or terabytes, Spark is probably the right choice. DuckDB can handle larger-than-RAM datasets but it’s not designed for that.

If you need distributed processing across multiple machines, Spark is built for that. DuckDB is single-process.

If you’re already on Databricks or EMR and the infrastructure is there, use it. PySpark with Delta Lake is mature and well-supported.

If you need advanced Delta Lake features like MERGE with complex predicates or deletion vectors, PySpark has better support than delta-rs.

But if you’re processing megabytes or small gigabytes on a single machine for analytical workloads, DuckDB + delta-rs gives you Delta Lake without the complexity.

The Dagster Integration

Dagster assets work beautifully with this stack. Each asset is a Python function that reads from Delta tables via DuckDB and writes back via delta-rs.

@asset
def silver_ad_billing_anonymized(
    context,
    bronze_ad_billing,
    delta_lake,
    duckdb_delta
):
    # Read from Bronze via DuckDB
    df = duckdb_delta.query_delta_table("billing_data")

    # Transform in pandas
    df = transform_and_anonymize(df)

    # Write to Silver via delta-rs
    delta_lake.write_silver(df, "billing_data_anonymized")

    return {"rows": len(df)}

Dagster handles the dependency graph, materialization scheduling, and asset versioning. DuckDB handles fast queries. Delta-rs handles transactional writes. It’s a clean separation of concerns.

Key Lessons

You don’t need Spark to use Delta Lake. DuckDB for reads and delta-rs for writes cover most analytical pipeline use cases.

The complexity of Spark is justified at scale but painful at small-to-medium size. Know your data volume and choose accordingly.

DuckDB’s Delta integration is fast and feature-rich. Time travel, predicate pushdown, columnar processing all work without a cluster.

Schema evolution in delta-rs matches PySpark behavior. New columns are automatically added with schema_mode="merge".

In-process databases like DuckDB have much lower overhead than distributed systems. For data that fits in memory, they’re faster and simpler.

Delta Lake’s value isn’t tied to Spark. The transaction log, versioning, and ACID guarantees work with any implementation.