Nov 15, 2025
9 min read

From Firebird to Gold: ETL for Healthcare Analytics

Building a complete data pipeline from legacy Firebird databases to clean analytics with Delta Lake medallion architecture

Healthcare billing data lives in a Firebird database from the 1990s. Finance teams need daily KPI reports. Between those two endpoints is our entire pipeline, built on Delta Lake’s medallion architecture to transform raw database rows into clean analytics.

The journey from Firebird to Gold layer taught me that ETL is less about the Extract and Transform steps and more about building resilient boundaries between systems that change independently. The database evolves, the business requirements evolve, and the pipeline has to keep working through all of it.

The Source System

Firebird is an open-source relational database that’s been around forever. It’s solid, fast, and widely used in healthcare and finance where reliability matters more than trendiness. Our care facilities have been running on Firebird for 15+ years.

The database has dozens of tables with French column names and complex views that join across patient records, billing codes, facility information, and intervention details. The schema evolved over time as regulations changed and new types of care were added.

We don’t control this database. The facility management software updates it, adds columns when needed, occasionally renames things. Our pipeline consumes data from materialized views that the software maintains, but we have no say in schema changes or data quality.

This is the reality of enterprise data engineering. You don’t get clean, stable APIs. You get a database that changes on someone else’s schedule, and your job is to extract value from it regardless.

The Bronze Ingestion

Bronze layer is raw data from Firebird, stored as-is in Delta Lake parquet files. The only transformations are adding metadata like ingestion timestamp and source identifier. Everything else passes through unchanged.

@asset(group_name="bronze")
def bronze_ad_billing(
    context: AssetExecutionContext,
    firebird_database,
    delta_lake
):
    """
    Ingest AD billing data from Firebird to Bronze Delta table
    """
    # Query Firebird view
    df = firebird_database.fetch_data("V_DETAILFACTCNS_FICHIER_AD")

    # Add metadata
    df["_ingestion_ts"] = datetime.now()
    df["_source"] = "AD"
    df["_view_name"] = "V_DETAILFACTCNS_FICHIER_AD"

    # Write to Bronze with schema evolution
    delta_lake.write_bronze(
        df=df,
        table_name="billing_data/V_DETAILFACTCNS_FICHIER_AD",
        schema_mode="merge"
    )

    context.log.info(f"Ingested {len(df)} rows from Firebird AD billing")
    return {"rows": len(df)}

The key is schema_mode="merge". When Firebird adds a column, Bronze automatically adds it to the Delta table. No code changes, no deployment, no incidents. This turns schema evolution from a breaking change into a non-event.

We ingest multiple views from different facilities. Each becomes a separate Bronze table. Some views have 50+ columns with names like DUREE_INTERVENTION and NO_PATIENT. We don’t clean or normalize at this stage, we just capture exactly what Firebird gives us.

The Firebird Resource

Connecting to Firebird from Python uses the fdb library, which wraps the Firebird client libraries. We built a Dagster resource that handles connection pooling and query execution:

from dagster import ConfigurableResource
import fdb
import pandas as pd

class FBDatabaseResource(ConfigurableResource):
    """Production Firebird database resource"""

    host: str
    database: str
    user: str
    password: str

    def get_connection(self):
        return fdb.connect(
            host=self.host,
            database=self.database,
            user=self.user,
            password=self.password,
            charset='UTF8'
        )

    def fetch_data(self, view_name: str, limit: int = None):
        """Execute query and return DataFrame"""
        conn = self.get_connection()

        query = f"SELECT * FROM {view_name}"
        if limit:
            query += f" ROWS 1 TO {limit}"

        try:
            df = pd.read_sql(query, conn)
            return df
        finally:
            conn.close()

The resource pattern is powerful because we can swap implementations. In development, we use MockFBDatabaseParquetResource that reads from sample parquet files instead of querying Firebird. The asset code stays identical, just the resource changes.

This let us build the entire pipeline without production database access. Export samples once, develop against parquet files locally, deploy with confidence that the real database connection would work the same way.

The Silver Transformation

Silver layer is where the real work happens. We normalize column names to lowercase, drop PII by creating anonymized customer keys, fix data types, and standardize facility names.

@asset(group_name="silver", deps=["bronze_ad_billing"])
def silver_ad_billing_anonymized(
    context: AssetExecutionContext,
    delta_lake,
    duckdb_delta
):
    """
    Transform Bronze AD billing data into anonymized Silver layer
    """
    # Read from Bronze using DuckDB for fast filtering
    df = duckdb_delta.query_delta_table(
        table_name="billing_data/V_DETAILFACTCNS_FICHIER_AD"
    )

    # Normalize column names (Firebird uses uppercase)
    df.columns = [c.lower() for c in df.columns]

    # Create stable anonymized customer key
    df['customer_key'] = df.apply(
        lambda r: xxhash.xxh32(f"{r['etb']}_{r['no_patient']}").hexdigest(),
        axis=1
    )

    # Drop PII - patient names never reach Silver storage
    df = df.drop(columns=['nom_patient'], errors='ignore')

    # Normalize facility names
    df['facility_name'] = df['etb'].map({
        'CONT': 'CONT',
        'CONS': 'CONS',
        'PETA': 'PETA',
        'SJC': 'SJC'
    })

    # Fix data types
    df['date_prestation'] = pd.to_datetime(df['date_prestation'])
    df['duree_intervention'] = pd.to_numeric(df['duree_intervention'])

    # Write to Silver
    delta_lake.write_silver(
        df=df,
        table_name="billing_data_anonymized"
    )

    context.log.info(f"Transformed {len(df)} rows to Silver")
    return {"rows": len(df), "facilities": df['facility_name'].nunique()}

The customer key is critical. It’s a hash of facility code plus patient ID, which gives us a stable anonymous identifier. We can track individual patients across time without storing their names. The hash is irreversible (can’t go from hash back to patient ID), which satisfies GDPR and healthcare privacy requirements.

Dropping nom_patient means Silver storage never contains PII. Even if someone gets direct access to the Delta Lake files, they see hashed identifiers, not patient names. This is defense in depth - PII exists only in Firebird and briefly in memory during Bronze ingestion.

The Gold Aggregations

Gold layer pre-computes KPIs that downstream systems need. Instead of forcing dashboards to aggregate raw data, we do the aggregation once in the pipeline and expose finished metrics.

@asset(group_name="gold", deps=["silver_ad_billing_anonymized"])
def gold_care_home_kpis_daily(
    context: AssetExecutionContext,
    delta_lake,
    duckdb_delta
):
    """
    Daily KPI aggregations per care facility
    """
    # Aggregate using DuckDB for speed
    kpis = duckdb_delta.conn.execute("""
        SELECT
            facility_name,
            DATE(date_prestation) as date,
            SUM(duree_intervention) as total_care_minutes,
            COUNT(DISTINCT customer_key) as unique_residents,
            COUNT(*) as num_care_acts
        FROM delta_scan('lake/silver/billing_data_anonymized')
        GROUP BY facility_name, DATE(date_prestation)
        ORDER BY facility_name, date
    """).df()

    # Write to Gold
    delta_lake.write_gold(
        df=kpis,
        table_name="care_home_kpis_daily"
    )

    context.log.info(f"Computed KPIs for {len(kpis)} facility-days")
    return {
        "rows": len(kpis),
        "facilities": kpis['facility_name'].nunique(),
        "date_range": f"{kpis['date'].min()} to {kpis['date'].max()}"
    }

This aggregation turns 25K rows in Silver into about 500 rows in Gold (4 facilities × ~125 days of history). Dashboard queries against Gold are instant because the aggregation is already done.

We use DuckDB for the aggregation instead of pandas because DuckDB’s SQL engine is faster for groupby operations on columnar data. The delta_scan() function reads directly from the Silver Delta table, and the entire query runs in under 0.1 seconds.

Data Quality Checks

Each layer has validation checks that run after materialization. Bronze checks that row counts are reasonable and required columns exist. Silver checks that PII was actually dropped and customer keys are valid. Gold checks that aggregations make sense.

@asset(group_name="checks", deps=["gold_care_home_kpis_daily"])
def gold_data_quality_check(
    context: AssetExecutionContext,
    duckdb_delta
):
    """
    Validate Gold layer metrics are reasonable
    """
    df = duckdb_delta.query_delta_table("care_home_kpis_daily")

    # Check 1: No negative care minutes
    negative_minutes = df[df['total_care_minutes'] < 0]
    if len(negative_minutes) > 0:
        raise ValueError(f"Found {len(negative_minutes)} rows with negative care minutes")

    # Check 2: All expected facilities present
    expected_facilities = {'CONT', 'CONS', 'PETA', 'SJC'}
    actual_facilities = set(df['facility_name'].unique())
    missing = expected_facilities - actual_facilities
    if missing:
        context.log.warning(f"Missing facilities: {missing}")

    # Check 3: Unique residents count is reasonable
    max_residents = df['unique_residents'].max()
    if max_residents > 200:
        raise ValueError(f"Suspiciously high resident count: {max_residents}")

    context.log.info("All data quality checks passed")
    return {"checks_passed": 3}

These checks catch bugs in transformation logic before downstream systems see bad data. In the old CSV pipeline, quality issues were discovered by humans noticing wrong numbers in reports. Now they’re caught automatically.

The Complete Pipeline

Dagster materializes all the assets in dependency order. Bronze runs first, then Silver, then Gold, then checks. If any step fails, the downstream steps don’t run. The previous version’s data stays in place, so dashboards continue working with yesterday’s numbers until the issue is fixed.

The entire pipeline runs in under 5 seconds:

  • Bronze ingestion: 0.8 seconds
  • Silver transformation: 1.2 seconds
  • Gold aggregation: 0.3 seconds
  • Data quality checks: 0.2 seconds

We run it nightly on a cron schedule, but we could run it every hour if needed. The incremental nature of Delta Lake means we could even switch to streaming ingestion with minimal changes.

Resilience Patterns

The medallion architecture provides natural failure isolation. If Bronze ingestion fails (Firebird is down, network issues), Silver and Gold keep serving yesterday’s data. If Silver transformation fails (bug in anonymization logic), Bronze keeps ingesting and Gold keeps serving previous Silver data.

Version history means we can always roll back. If a deployment introduces a bug in Gold aggregations, we can point dashboards at the previous Gold version while we fix the code. Time travel queries make this trivial.

Schema evolution means Firebird changes don’t break our pipeline. New columns appear in Bronze automatically. Renamed columns need code changes in Silver, but we have time to make those changes without production outages.

Mock resources mean we can test the entire pipeline locally without database access. Export samples, run the full Bronze-Silver-Gold flow, verify outputs, all on a laptop. This is huge for development velocity.

Performance Characteristics

For our data volumes (25K rows per day), performance is excellent. The entire pipeline fits in memory. DuckDB aggregations are fast. Delta Lake writes are fast. The bottleneck is actually querying Firebird, which takes 0.6 of the 0.8 seconds in Bronze ingestion.

We could optimize Firebird queries by adding indexes or using incremental queries (only fetch new rows since last run), but we haven’t needed to. The current performance is good enough that optimization would be premature.

If data volumes grow 10x or 100x, we’d revisit architecture. DuckDB and delta-rs work great for medium data but eventually you need distributed processing. For now, simplicity wins.

Key Lessons

Start with Bronze that accepts everything. Schema evolution and data quality issues are much easier to handle when you have the raw data captured.

Use resources for external systems. Swap production databases for mock files in development. Test the entire pipeline locally.

Drop PII as early as possible. Create anonymized identifiers at the Bronze-to-Silver boundary and never store names in your data lake.

Pre-compute aggregations in Gold. Don’t make downstream systems re-aggregate the same data repeatedly.

Build data quality checks into the pipeline. Catch issues automatically before humans notice wrong numbers.

Version history is your safety net. Delta Lake’s time travel means you can always recover from mistakes.

References