Schema Evolution in the Wild: When Your Bronze Layer Grows New Columns

The Firebird database team added a new column to one of our source views. In the old CSV-based pipeline, this would’ve been a production incident. The pandas scripts expected a specific column list, and the new column would’ve crashed everything. We’d get paged at 3am, patch the code, redeploy, and re-run manually.

With Delta Lake’s schema merge mode, the pipeline didn’t even notice. The new column flowed into Bronze, got ignored by Silver because we didn’t reference it, and never made it to Gold. No code changes, no deployment, no incident. Just automatic handling of schema evolution.

The Old Way

Our legacy pipeline queried Firebird views into pandas DataFrames, applied transformations, and wrote CSVs. The transformation code explicitly referenced column names:

df = df[['etb', 'no_patient', 'date_prestation', 'duree_intervention']]

If Firebird added a column, the query would return it, but the code wouldn’t know what to do with it. Sometimes this would cause a KeyError when we tried to access a renamed column. Sometimes it would silently include the new column in outputs, breaking downstream systems that expected a specific schema.

The “fix” was updating the column list in the code, testing with sample data, deploying to production, and hoping we didn’t introduce bugs. For schema changes in multiple views, this meant coordinated deployments and careful sequencing.

It was fragile and expensive. Every schema change was a mini-project with testing, deployment, and validation steps. We spent more time handling schema evolution than we did adding new features.

Delta Lake’s Schema Mode

Delta Lake has a schema_mode parameter that controls how writes handle schema differences. The default is “overwrite”, which means the incoming data’s schema replaces the existing table schema. This is fine when you control both sides, but it’s brittle when the source schema is changing independently.

The “merge” mode is magic. It compares the incoming schema with the existing table schema. Columns that exist in both are preserved. New columns in the incoming data are added to the table. Columns that exist in the table but not in the incoming data are kept (with null values for new rows).

write_deltalake(
    table_or_uri="lake/bronze/billing_data",
    data=df,
    mode="overwrite",
    schema_mode="merge"  # This is the key
)

Now when Firebird adds a column, the Bronze write automatically adds it to the Delta table schema. The parquet files get a new column, the transaction log updates the schema definition, and life goes on.

The First Schema Change

We tested this accidentally. The Firebird team added a payment_method column to track how patients paid for services. They mentioned it in a Slack message, but I didn’t update any code because I was curious what would happen.

The nightly pipeline ran. Bronze ingestion read the new column from Firebird, saw it wasn’t in the existing Delta table schema, and added it automatically. The transaction log version incremented from 8 to 9. The new parquet file had 21 columns instead of 20.

Silver transformation read from Bronze and created customer keys based on the columns it knew about. The payment_method column wasn’t referenced in the transformation code, so pandas ignored it. The Silver write dropped it implicitly by not including it in the output DataFrame.

Gold aggregations ran against Silver data that didn’t have payment_method. The KPIs were unchanged because they aggregated care duration and patient counts, not payment information.

Nothing broke. No alerts fired. The new column just existed in Bronze and stopped there.

Schema Evolution Across Layers

This pattern is powerful. Bronze accepts everything from source systems with schema merge mode. New columns, renamed columns, changed data types - they all flow into Bronze automatically.

Silver is selective about what it keeps. The transformation logic explicitly constructs the output DataFrame with the columns it needs. New Bronze columns are available if you want to use them, but they don’t force their way into Silver unless you reference them.

Gold is even more selective. It’s aggregations and KPIs, so it only includes the specific metrics being computed. Schema changes in Silver don’t affect Gold unless they impact the aggregation logic.

This layered approach means schema evolution is localized. Bronze grows to match sources. Silver stays stable unless you choose to adopt new columns. Gold is insulated from most changes entirely.

Handling Column Renames

Renames are trickier than additions. If Firebird changes duree_intervention to intervention_duration, that’s a breaking change. The old column disappears, a new column appears, and schema merge mode treats it as a drop plus an add.

Bronze would end up with both columns after the rename. Old data has values in duree_intervention and nulls in intervention_duration. New data has nulls in duree_intervention and values in intervention_duration.

Our Silver transformation would break because it references duree_intervention explicitly:

df['total_minutes'] = df['duree_intervention']

This would fail on new data where that column is null. The fix is handling both column names:

if 'intervention_duration' in df.columns:
    df['total_minutes'] = df['intervention_duration']
elif 'duree_intervention' in df.columns:
    df['total_minutes'] = df['duree_intervention']
else:
    raise ValueError("Neither column found")

Not as clean as automatic handling of additions, but still better than the old approach where we’d have to coordinate the rename across source system, ETL code, and downstream consumers simultaneously.

Type Changes

Data type changes are the worst kind of schema evolution. If Firebird changes duree_intervention from integer to float or from numeric to string, Delta Lake has to deal with incompatible types.

Schema merge mode won’t save you here. Parquet files have strict types, and you can’t have one parquet file with integers and another with floats for the same column. Delta Lake will error on the type mismatch.

The solution is converting types explicitly in the ingestion code before writing to Delta Lake:

# Ensure consistent types
df['duree_intervention'] = df['duree_intervention'].astype('float64')

This forces the incoming data to match the expected type. If Firebird starts sending strings in a numeric column, the cast will fail explicitly rather than creating a mixed-type table.

For backward compatibility, you can create a new column with the new type and keep the old column for historical data:

if df['duree_intervention'].dtype == 'object':  # Strings
    df['duree_intervention_v2'] = pd.to_numeric(df['duree_intervention'])

Now Bronze has both columns. Silver can migrate to using duree_intervention_v2 while old data still has duree_intervention. Eventually you drop the old column after backfilling historical data.

Schema Discovery

One nice benefit of schema merge mode is that you can discover schema changes by comparing Delta table versions. The transaction log records schema changes with each version.

from deltalake import DeltaTable

dt = DeltaTable("lake/bronze/billing_data")

# Get schema at version 8
dt_v8 = dt.load_as_version(8)
schema_v8 = set(field.name for field in dt_v8.schema().fields)

# Get schema at version 9
dt_v9 = dt.load_as_version(9)
schema_v9 = set(field.name for field in dt_v9.schema().fields)

# Find added columns
new_columns = schema_v9 - schema_v8
print(f"New columns in v9: {new_columns}")

This is useful for auditing and documentation. When someone asks “when did we start tracking payment_method?”, you can query the transaction log and find the exact version and timestamp.

Automated Schema Change Notifications

We built a Dagster sensor that checks for schema changes and posts to Slack when they happen:

@sensor(job=bronze_ingestion_job)
def schema_change_sensor(context):
    dt = DeltaTable("lake/bronze/billing_data")
    history = dt.history()

    if len(history) >= 2:
        latest = history[0]
        previous = history[1]

        # Check for schema changes
        schema_current = dt.schema()
        dt_prev = dt.load_as_version(previous['version'])
        schema_prev = dt_prev.schema()

        if len(schema_current.fields) != len(schema_prev.fields):
            columns_added = set(f.name for f in schema_current.fields) - \
                          set(f.name for f in schema_prev.fields)

            if columns_added:
                post_to_slack(f"New columns in Bronze: {columns_added}")

Now we get notified when source systems change schemas, even though the pipeline handles it automatically. This keeps the team aware of upstream changes without requiring manual coordination.

The Bronze Contract

We established a contract for Bronze layer: accept everything from sources without validation. This is controversial because it means Bronze can have garbage data, but it’s the right tradeoff for schema resilience.

If a source system starts sending string values in a numeric column, Bronze ingests them. The error shows up in Silver when we try to use those values, not at the Bronze boundary. This keeps Bronze ingestion simple and delegates validation to the transformation layer.

The benefit is Bronze never fails due to unexpected data. It might create quality issues downstream, but the ingestion keeps running and the data keeps flowing. We can fix the quality issue in Silver without re-running Bronze.

Storage Implications

Schema evolution does increase storage slightly. Bronze parquet files with 20 columns are smaller than files with 21 columns, even if most rows have nulls in the new column.

But parquet compression handles sparse columns efficiently. If 90% of rows have null in a column, parquet stores that very compactly. The storage overhead is minimal compared to the operational benefit of automatic schema handling.

We ran numbers after accumulating several schema changes: Bronze storage grew by about 3% compared to what it would’ve been with a fixed schema. That’s negligible.

Key Lessons

Schema merge mode in Delta Lake eliminates schema change incidents. New columns in source systems flow through automatically without code changes.

Layer your schema strictness. Bronze accepts everything, Silver is selective, Gold is minimalist. This isolates changes and prevents cascading failures.

Handle column renames explicitly. Schema merge treats renames as drop-plus-add, so you need compatibility code in transformations.

Type changes still require manual handling. Convert types explicitly before writing to Delta Lake to avoid mixed-type disasters.

Monitor schema changes even though they’re automatic. Knowing when upstream systems evolve helps you decide when to adopt new columns in downstream layers.

The transaction log is a schema change audit trail. Use version history to track when columns were added, by whom, and with what data.