Migrating to Medallion Architecture: The Parquet-to-Delta Journey

Our legacy pipeline was pandas scripts that exported CSVs. It worked fine until it didn’t. Schema changes broke everything downstream. Data quality issues were discovered weeks after they happened. We had no way to recover from mistakes except “restore last night’s backup and lose a day of data.”

Migrating to Delta Lake with medallion architecture fixed all of that, but the migration itself was risky. Healthcare dashboards depended on our daily exports. We couldn’t afford downtime or data inconsistencies during the transition. The trick was running both pipelines in parallel until we’d proven the new system matched the old one exactly.

The Old System

Firebird database held billing data from multiple care facilities. Every night, a cron job ran pandas scripts that queried each view, applied some transformations, and wrote CSV files to a shared directory. Downstream systems read those CSVs.

When the schema changed in Firebird (new column added, data type changed, column renamed), the pandas scripts would crash. We’d get an alert in the morning, fix the code, re-run manually, and hope we hadn’t missed the downstream processing windows.

Version history didn’t exist. If we corrupted data during transformation, we wouldn’t know until someone noticed the wrong numbers in a report. Rolling back meant restoring from backup, which meant losing recent data.

Testing was basically non-existent. We’d change transformation logic, run it on production, and see what broke. Not ideal.

The Medallion Plan

Bronze layer would store raw data exactly as it came from Firebird. Everything, including PII, in Delta Lake parquet files. This became our source of truth.

Silver layer would clean and anonymize. Drop PII columns, normalize names, fix data types, create consistent customer keys. Still detailed row-level data, but ready for analytics.

Gold layer would pre-compute aggregations. Instead of making downstream systems calculate totals, we’d do it in the pipeline and expose finished KPIs. Care minutes per facility per day, unique resident counts, intervention frequencies.

Each layer would be a Delta table with full version history and time travel capabilities. Schema evolution would be automatic at the Bronze layer. Data quality checks would run at every layer transition.

The Risky Part

We couldn’t stop the old pipeline during migration. Finance reports ran on those CSVs every morning. Cutting over in one big bang would be reckless. We needed a phased approach where both systems ran in parallel until we trusted the new one.

But we also couldn’t access production Firebird for development. Security policies, network restrictions, the usual enterprise barriers. I needed to build the entire medallion pipeline without being able to test against real data sources.

Mock Data Sources

The solution was creating mock database resources that read from parquet files instead of Firebird. I exported sample data once from production (1000 rows from each view), saved as parquet files in a dummy_data/ directory, and built a mock resource with the same interface as the real database resource.

class MockFBDatabaseParquetResource(ConfigurableResource):
    dummy_data_path: str = "dummy_data"

    def fetch_data(self, view_name: str, limit: int = None):
        file_path = Path(self.dummy_data_path) / f"{view_name}.parquet"
        df = pd.read_parquet(file_path)
        if limit:
            df = df.head(limit)
        return df

This meant I could develop the entire Bronze-Silver-Gold pipeline locally without touching production systems. The mock resource had the same method signatures as the real one, so swapping them later was just a configuration change.

Layer by Layer

I built incrementally instead of trying to implement everything at once. Week one was just Bronze ingestion: read from mock Firebird, write to Delta Lake, verify the parquet files looked right.

@asset
def bronze_ad_billing(firebird_database, delta_lake):
    df = firebird_database.fetch_data("V_DETAILFACTCNS_FICHIER_AD")
    delta_lake.write_bronze(df, "billing_data")
    return {"rows_written": len(df)}

That’s it. No transformations, no Silver layer, just proving that we could write Delta tables and read them back with DuckDB.

Week two added Silver layer with anonymization:

@asset
def silver_ad_billing_anonymized(bronze_ad_billing, delta_lake):
    df = delta_lake.read_table("billing_data")

    # Normalize column names
    df.columns = [c.lower() for c in df.columns]

    # Create anonymized customer key
    df['customer_key'] = df.apply(
        lambda r: xxhash.xxh32(f"{r['etb']}_{r['no_patient']}").hexdigest(),
        axis=1
    )

    # Drop PII
    df = df.drop(columns=['nom_patient'])

    delta_lake.write_silver(df, "billing_data_anonymized")
    return {"rows_written": len(df)}

I tested that the customer keys were stable (same input always produced same hash) and that PII was actually gone from the Silver parquet files.

Week three added Gold aggregations:

@asset
def gold_care_home_kpis_daily(silver_ad_billing_anonymized, delta_lake):
    df = delta_lake.read_table("billing_data_anonymized")

    kpis = df.groupby(['seniorie', 'date']).agg({
        'duree_intervention': 'sum',
        'customer_key': 'nunique',
        'reference_prestation': 'count'
    }).reset_index()

    delta_lake.write_gold(kpis, "care_home_kpis_daily")
    return {"rows_written": len(kpis)}

Now I had a complete pipeline, all running on mock data, with version history and time travel working at every layer.

The Comparison Script

Before switching production to the new pipeline, I needed proof that it produced the same results as the old CSV-based system. So I built a comparison script that ran both pipelines and diffed the outputs.

The old pipeline wrote CSV files with specific column names and formats. The new Gold layer needed to match exactly. I exported both to pandas DataFrames and compared:

def compare_outputs():
    # Old pipeline output
    df_old = pd.read_csv("output/billing.csv")

    # New pipeline output
    df_new = pd.read_parquet("lake/gold/care_home_kpis_daily")

    # Compare shapes
    assert df_old.shape == df_new.shape, "Row/column count mismatch"

    # Compare values
    for col in df_old.columns:
        mismatches = df_old[col] != df_new[col]
        if mismatches.any():
            print(f"Column {col}: {mismatches.sum()} mismatches")

I ran this daily for two weeks. The first few days had mismatches because I’d gotten some transformation logic wrong. By week two, the outputs matched perfectly.

Parallel Production

Once the comparison script was clean, I deployed the medallion pipeline to production alongside the old CSV pipeline. Both ran every night. The old pipeline fed the existing dashboards. The new pipeline materialized Delta tables that nobody was using yet.

This gave us a safety net. If the new pipeline broke, it didn’t matter because nothing depended on it. We could debug in production with real data without risk.

It also built confidence with stakeholders. I could show them that the Delta Lake outputs matched the CSV outputs exactly. When it came time to switch, it wasn’t a leap of faith.

The Schema Evolution Test

The real test came when Firebird added a new column to one of the source views. In the old pipeline, this would’ve crashed the pandas scripts because they expected a specific column list. We’d have to update code, redeploy, and re-run.

With Delta Lake, the Bronze layer handled it automatically. The schema_mode="merge" parameter meant new columns were just added to the Delta table without breaking anything. The Silver layer ignored columns it didn’t care about. The Gold layer didn’t see them at all.

No code changes. No crashes. The new column flowed through Bronze, got ignored by Silver, and never affected downstream systems. That’s when I knew the medallion architecture was working.

Cutting Over

The cutover was anticlimactic, which is exactly what you want. We updated the dashboard configurations to read from Gold Delta tables instead of CSV files. We monitored for a few days. Everything worked. We turned off the CSV export scripts.

Total migration time from decision to full cutover: 12 weeks. Three months sounds long, but we did it without breaking anything, without losing data, and without stressing out the finance team. Slow and steady wins.

What We Gained

Before migration, schema changes were production incidents. After migration, they’re automatic. Bronze accepts new columns, Silver transforms what it needs, Gold aggregates what it’s configured for.

Before migration, recovering from data corruption meant restoring from backup and losing hours of data. After migration, time travel lets us query any historical version and do selective recovery in minutes.

Before migration, data quality issues were discovered by humans noticing wrong numbers. After migration, we have automated checks at every layer that fail fast when something looks wrong.

Before migration, regenerating historical data for a new report took hours because we had to re-query Firebird and re-run all the transforms. After migration, all historical data is in Delta tables with fast DuckDB queries. Thirty seconds instead of four hours.

What We Lost

The old pipeline was simple. It was easy to understand what a single pandas script did. The medallion architecture is more layers, more abstraction, more moving parts. New team members take longer to onboard.

CSV files are universal. Anyone can open them. Delta Lake parquet files require specific tooling. That’s a barrier for ad-hoc analysis by non-technical users, though DuckDB helps bridge that gap.

The old pipeline had minimal dependencies. Pandas and the Firebird driver. The new pipeline needs delta-rs, DuckDB, Dagster, and a bunch of Python packages. More dependencies means more potential for breakage.

Storage costs went up. Delta Lake keeps version history, which means more disk space. We mitigate with VACUUM retention policies, but it’s still more expensive than writing a single CSV file.

Key Lessons

Mock data sources enabled risk-free development. Being able to build the entire pipeline without production database access was crucial.

Incremental implementation reduced complexity. One layer at a time, validate each step, build confidence before moving forward.

Parallel running builds trust. Running both pipelines simultaneously let us prove correctness before switching over.

Comparison scripts catch mistakes early. Automated validation that new outputs match old outputs saved us from shipping bugs.

Schema evolution is worth the architectural complexity. Automatic handling of new columns eliminates a whole category of production incidents.

Slow migration beats big-bang cutover. Twelve weeks felt long but we never had a fire drill or emergency rollback.