Migrating a Dagster Pipeline from Pandas to Polars and Delta Lake

I’m in the middle of a migration that’s bigger than I initially thought. What started as “let’s try Polars instead of Pandas” turned into a complete architecture overhaul. And honestly? It’s the right move.

The current pipeline works. It ingests healthcare data from a Firebird database, processes it with Pandas, and stores it in SQLite and DuckDB. But it has problems - performance problems, data quality problems, and auditability problems. The medallion architecture with Delta Lake solves all of them.

Why We’re Migrating

The existing pipeline has been running in production for months. It processes billing data, patient records, geriatric assessments, and payment information. Everything works, but there are friction points:

Performance: Pandas is slow with larger datasets. The billing data grows every month, and processing times keep creeping up.

Data Quality: We have validation logic scattered throughout the pipeline. Sometimes data quality issues only show up in the gold layer, but by then we’ve lost context about where they originated.

Auditability: We overwrite SQLite databases on each run. If something goes wrong, we can’t trace back to see what the raw data looked like or when transformations were applied.

Incremental Processing: Everything runs as a full refresh. Want to process just yesterday’s data? Can’t do it efficiently.

The medallion architecture with Delta Lake addresses all of these issues. Bronze layer preserves raw data. Silver layer documents transformations. Gold layer contains business logic. And Delta Lake gives us ACID transactions, schema evolution, and time travel.

Polars is just the cherry on top - it’s 5-10x faster than Pandas for our workloads.

The Migration Strategy

I’m not doing a big bang migration. That’s a recipe for disaster!! Instead, I’m building the new pipeline in parallel on a feature branch (medallion-migration) while keeping the old one running.

The strategy has four phases:

Phase 1: Bronze Layer Foundation (in progress)

Implement Delta Lake resource
Create mock Firebird resource using parquet files
Build bronze assets for all data sources
Write tests for bronze layer

Phase 2: Silver Layer Implementation

Data cleansing transformations
Schema validation with Polars
Patient identifier hashing
Data enrichment logic

Phase 3: Gold Layer Implementation

Business aggregations
Reporting tables
Analytics-ready datasets
Integration with existing DuckDB outputs (for compatibility)

Phase 4: Cutover

Parallel run validation
Output compatibility verification
Legacy ETL deprecation

Right now I’m in Phase 1. The bronze billing assets are done and tested. The pattern is established. Now it’s just a matter of replicating it for patients, geriatric assessments, patient plans, and bill payments.

The Architecture Changes

The old architecture was simple - assets pulled from Firebird, transformed with Pandas, and wrote to SQLite/DuckDB:

@asset(group_name="billing")
def billing_data(
    context: AssetExecutionContext,
    firebird: Union[FBDatabaseResource, MockFBDatabaseResource],
) -> pd.DataFrame:
    """Asset to load billing data from the Firebird database."""
    con = firebird.get_connection()
    df = fetch_billing_data_by_source(con)
    con.close()
    return df

@asset(group_name="billing")
def processed_billing_data(
    context: AssetExecutionContext,
    billing_data: pd.DataFrame
) -> pd.DataFrame:
    """Asset to process and validate the billing data."""
    df = process_billing_data(billing_data)
    df = hash_column(df, column_name="PATIENT_NR")
    # ... validation logic
    return df

This worked, but mixing ingestion and transformation made it hard to track data lineage. If validation failed, was it a source data problem or a transformation bug?

The new architecture separates concerns into layers:

# Bronze layer - raw ingestion only
@asset(
    group_name="bronze_billing",
    compute_kind="delta_lake",
)
def bronze_billing_raw(
    context: AssetExecutionContext,
    firebird_resource: MockFirebirdParquetResource,
    delta_lake: DeltaLakeResource,
) -> dict:
    """Ingest raw billing data from Firebird to bronze Delta Lake."""
    df = firebird_resource.read_billing_data()

    # Only add metadata - no business logic
    df = df.with_columns([
        pl.lit("firebird").alias("_source_system"),
        pl.lit(context.run_id).alias("_dagster_run_id"),
    ])

    table_path = delta_lake.write_delta(
        df=df,
        table_name="billing_raw",
        layer="bronze",
        mode="overwrite",
    )

    return {"row_count": len(df), "table_path": table_path}

The bronze layer has one job: preserve raw data with minimal transformation. All metadata columns are prefixed with _ to distinguish them from source data.

The silver layer (coming in Phase 2) will handle all the transformations - hashing patient IDs, validating schemas, enriching with reference data. And the gold layer will contain business aggregations.

The Delta Lake Resource

The Delta Lake resource is where most of the complexity lives. It provides a clean interface for reading and writing Delta tables across the three layers:

class DeltaLakeResource(ConfigurableResource):
    """Resource for managing Delta Lake tables in a medallion architecture."""

    base_path: str

    def write_delta(
        self,
        df: pl.DataFrame,
        table_name: str,
        layer: Literal["bronze", "silver", "gold"],
        mode: Literal["overwrite", "append", "error"] = "overwrite",
        partition_by: Optional[list[str]] = None,
    ) -> str:
        """Write a Polars DataFrame to a Delta Lake table."""
        table_path = self._get_table_path(table_name, layer)
        Path(table_path).parent.mkdir(parents=True, exist_ok=True)

        arrow_table = df.to_arrow()
        write_deltalake(
            table_path,
            arrow_table,
            mode=mode,
            partition_by=partition_by,
        )

        return table_path

    def read_delta(
        self,
        table_name: str,
        layer: Literal["bronze", "silver", "gold"]
    ) -> pl.DataFrame:
        """Read a Delta Lake table into a Polars DataFrame."""
        table_path = self._get_table_path(table_name, layer)
        dt = DeltaTable(table_path)
        return pl.from_arrow(dt.to_pyarrow_table())

The resource handles the conversion between Polars DataFrames and Arrow tables (which Delta Lake requires). It manages the directory structure (base_path/bronze/table_name, base_path/silver/table_name, etc.). And it provides a consistent interface regardless of which layer you’re working in.

The Development Resource Pattern

One of the smartest decisions was creating a MockFirebirdParquetResource for development and testing. Instead of requiring a live Firebird database, it reads from parquet files:

class MockFirebirdParquetResource(ConfigurableResource):
    """Mock Firebird database resource that reads from parquet files."""

    parquet_path: str

    def read_billing_data(self) -> pl.DataFrame:
        """Read mock billing data from parquet file."""
        file_path = Path(self.parquet_path) / "billing_raw.parquet"
        return pl.read_parquet(file_path)

This resource has the exact same interface as the production FirebirdResource (which I’ll build in Phase 2). The assets don’t know or care which one they’re using. In development, I set DAGSTER_ENV=dev and get the mock. In production, it’ll use the real database.

The beauty of this pattern is that it makes testing trivial. Tests create temporary parquet files, instantiate the mock resource, and run the actual asset code. No mocking libraries needed.

What I’ve Learned So Far

Polars is fast, but the real win is expressiveness. Yes, Polars is faster than Pandas. But the bigger benefit is the query optimizer and the lazy evaluation. I write transformations as a chain of operations, and Polars figures out the optimal execution plan.

Delta Lake’s ACID guarantees matter more than I expected. In the old pipeline, if a job failed halfway through writing to SQLite, we’d have partial data. Delta Lake transactions mean either the full write succeeds or none of it does.

The medallion architecture enforces good habits. Separating raw ingestion (bronze) from transformations (silver) from aggregations (gold) makes it obvious where each piece of logic belongs. No more “should this validation go here or there?” - there’s a clear answer.

Testing becomes easier with better abstractions. The resource pattern means I can test assets with real Delta Lake writes (to temp directories) without needing a real database. The tests are fast and reliable.

What’s Next

I need to finish the bronze layer - patients, geriatric assessments, patient plans, and bill payments. That’s Phase 1.

Then comes the silver layer. That’s where I’ll port all the transformation logic from the old pipeline. Patient ID hashing, data validation, schema enforcement, enrichment with reference data. The existing Pandas code is good documentation for what needs to happen - I just need to rewrite it in Polars.

The gold layer is mostly aggregations and reporting. Some of those will be straightforward ports from the existing grouped_billing_data and pmb_geriatric assets. Others might be new analytics that weren’t feasible with the old architecture.

And then cutover. Run both pipelines in parallel, compare outputs, resolve discrepancies. Once I’m confident the medallion pipeline produces identical results, deprecate the old one.

It’s a lot of work. But it’s the right architecture for the long term. Bronze layer for auditability. Silver layer for quality. Gold layer for analytics. Delta Lake for ACID guarantees and time travel. Polars for performance.

The migration is worth it.