Why I Chose Overwrite Over Merge (And When You Should Too)

I spent three days implementing custom diff-and-merge logic for our Delta Lake pipeline. The result? 24,000 tiny parquet files and query times that made me want to cry. Then I deleted all that clever code and went back to simple overwrites. Best decision I made all week.

Here’s the thing nobody tells you about Delta Lake MERGE operations: they’re amazing when you need them, but most of the time you don’t actually need them. The complexity-to-benefit ratio is way worse than you’d think, especially if you’re working with delta-rs in Python instead of PySpark.

The Merge Trap

I fell into it hard. We had a 25K row billing table that updated daily. In my head, I thought “incremental updates are more efficient than rewriting everything.” That’s true at massive scale, but our scale wasn’t massive. It was medium at best.

So I built this elaborate system that would hash each row, compare the new data against the existing data, identify what changed, and write only the diffs. The code was beautiful. The execution was a disaster.

The problem was file count. Every time I identified a changed row, I’d append it to the Delta table. Sounds reasonable, right? Except Delta Lake creates a new parquet file for each write operation. After processing our historical data, I had 24,612 individual parquet files for a table that should’ve been a single 2MB file.

Query performance went from 0.1 seconds to 8 seconds. DuckDB was choking on the file metadata alone. I’d optimized for write efficiency and destroyed read performance in the process.

The Overwrite Revelation

When I switched to simple overwrites, everything got faster. Write time dropped from 45 seconds to 0.3 seconds. Read time went from 8 seconds back to 0.1 seconds. Storage dropped from hundreds of megabytes (all that duplicate data across tiny files) to 2MB.

The trick is understanding what Delta Lake actually gives you. It’s not primarily an upsert engine, it’s a transaction log that wraps parquet files. Every write creates a new version in the log, regardless of whether you’re merging or overwriting. The version history is free either way.

So when I overwrite a table, I’m not losing historical data. It’s all still there in previous versions, accessible via time travel queries. The old parquet files stick around until I run VACUUM. I get all the benefits of versioning without the complexity of merge logic.

The Scale Question

Here’s the decision matrix I wish I’d had from the start. For tables under 100K rows, just overwrite. The write is fast enough that you won’t notice, and you save yourself from debugging subtle merge bugs at 2am.

Between 100K and 1M rows, you start to think about append-plus-dedupe patterns. Write new data in append mode, then deduplicate at the Silver layer. You’re still not doing complex merges, but you’re avoiding rewriting unchanged data.

Over 1M rows, you need real MERGE support. But here’s the catch: delta-rs Python bindings don’t have mature MERGE support yet. You’d need to switch to PySpark, which brings the entire JVM stack into your pipeline. That might be worth it at scale, but it’s a big architectural shift.

Our billing table never got past 50K rows. We chose the wrong optimization target.

The Pattern That Works

I settled on a layered approach that plays to Delta Lake’s strengths. Bronze layer uses overwrite with schema merge mode. This means new columns from the source system get added automatically, but I’m rewriting the entire table on each ingestion run.

Silver layer also uses overwrite. I read from Bronze, apply transformations (anonymization, normalization, whatever), and write the complete result. No partial updates, no merge logic, no file fragmentation.

Gold layer is pure aggregations, also overwritten on each run. The queries are deterministic based on Silver data, so there’s no point in incremental updates. Recompute everything and be done with it.

This pattern means my Delta Lake code is almost trivially simple:

def write_bronze(self, df, table_name):
    write_deltalake(
        table_or_uri=str(self.lake_path / table_name),
        data=df,
        mode="overwrite",
        schema_mode="merge"
    )

That’s it. No row hashing, no diff algorithms, no merge predicates. It just works.

When Merge Actually Wins

I’m not saying MERGE is useless. There are real scenarios where you need it. Streaming pipelines with continuous small updates can’t afford to rewrite entire tables every few seconds. CDC (Change Data Capture) systems that track insert/update/delete events need proper MERGE support to apply those changes correctly.

Large tables where only a small percentage changes per update cycle benefit from merges too. If you’ve got 10M rows and only 50K change daily, rewriting the whole table is wasteful. But notice I said “large tables.” If your table fits comfortably in memory, overwrite is probably fine.

Real-time dashboards are another case. If you can’t afford the few seconds of inconsistency during an overwrite operation, you need atomic merges. But most analytical pipelines can tolerate eventual consistency.

The Library Maturity Factor

Here’s something that bit me: delta-rs (the Rust-based Python bindings) is evolving fast, but it’s not as mature as the PySpark implementation. MERGE support exists but it’s not as robust or well-documented as in PySpark.

When I tried to use MERGE operations in delta-rs, I hit edge cases that required digging through GitHub issues to understand. The error messages weren’t always helpful. The performance characteristics weren’t what I expected.

Meanwhile, overwrite mode is rock-solid. It’s been there since day one, it’s well-tested, and it does exactly what you’d expect. Sometimes boring is better.

The Performance Numbers

For our 25K row table, the numbers told the story. Diff-and-merge took 45 seconds to write and created 24,612 files. Overwrite took 0.3 seconds and created 1 file. Append-with-dedupe split the difference at 0.5 seconds.

Read performance mattered even more. Querying the merge-fragmented table took 8 seconds. Querying the overwritten table took 0.1 seconds. That’s an 80x difference! Every time we ran a validation query or debugging query, we paid that penalty.

Storage efficiency wasn’t even close. The merge approach used hundreds of megabytes because each tiny parquet file has overhead. The overwrite approach used 2.1MB. Delta Lake’s compression and columnar storage work best with reasonably-sized files, not thousands of tiny ones.

Key Lessons

Don’t optimize for scale you don’t have. I built a complex merge system for a table that never exceeded 50K rows. Simple overwrites would’ve been fast enough from day one.

Delta Lake’s value isn’t the MERGE operation. It’s the transaction log, the time travel, the schema evolution, and the ACID guarantees. You get all of that with overwrites.

File count matters more than you think. Thousands of tiny parquet files destroy performance. One reasonably-sized file per table is fine for small-to-medium data.

Know your library’s maturity curve. delta-rs is getting better fast, but PySpark has more mature MERGE support if you actually need it. Choose your tradeoffs consciously.

Version history is free. Whether you merge or overwrite, Delta Lake keeps the old versions around until VACUUM runs. You’re not losing anything by using the simpler approach.