Healthcare data has patient names. Analytics doesn’t need patient names. The question is how you get from “has PII” to “doesn’t have PII” without breaking regulations or losing the ability to track individuals over time.
I thought the answer was encryption. Keep the PII in Bronze, decrypt it in Silver just long enough to create anonymous identifiers, then drop it. The architecture was solid but complicated. Then someone asked why we were encrypting data we’d never decrypt again, and I realized anonymization was the answer all along.
The Real Requirement
We needed to track individual patients across time without knowing who they were. Care facility A provides services to patient X on multiple dates. We need to aggregate patient X’s total care minutes, count unique patients per facility, analyze patterns of care over time.
But we don’t need to know that patient X is “Marie Dubois” or any other real name. We just need a stable identifier that’s the same for patient X across all records and different from every other patient.
The naive approach would be using the facility’s internal patient ID. But those IDs aren’t globally unique. Patient “123” at facility A is a different person than patient “123” at facility B. We needed a composite key that combined facility and patient ID.
Anonymization with Hashing
The solution was hashing the composite key. Combine facility code and patient number, run it through a fast hash function, use the resulting hash as the customer key. Same input always produces the same hash. Different inputs produce different hashes.
import xxhash
customer_key = xxhash.xxh32(f"{facility_code}_{patient_id}").hexdigest()
This gives us a stable anonymous identifier. We can track “customer_7f3a2b91” across time without knowing who they are. The hash is irreversible - you can’t work backwards from the hash to recover the patient ID. (Well, you could brute force it, but that’s not practical for our threat model.)
The key insight was recognizing we don’t need reversibility. We’re doing analytics, not customer service. We’ll never need to “de-anonymize” and look up who a customer actually is. The facilities maintain that mapping in their own systems if they ever need it for patient care.
Encryption’s Unnecessary Complexity
Encryption would’ve given us reversibility. Keep encrypted patient names in Bronze, decrypt when needed, maintain the ability to recover the original PII. But “when needed” never actually happens in our pipeline.
The Silver layer creates customer keys and drops the PII. Gold layer aggregates based on those keys. Nobody downstream ever needs the original patient names. So why carry that encrypted data through Bronze and maintain decryption capabilities we’ll never use?
Encryption also means key management. Environment variables in development, Vault or KMS in production. Key rotation procedures. Decrypt operations that could fail or leak. Performance overhead from encryption and decryption. All for data we were going to throw away anyway.
The Security Difference
Anonymization through irreversible hashing is stronger than encryption for our use case. With encryption, anyone who gets the key can decrypt the data. The key becomes the single point of failure. Secure the key and you’ve secured the data, but lose the key and all that PII is exposed.
With irreversible hashing, there’s no key to lose. You can’t decrypt the customer key back to a patient name because the hash function doesn’t work that way. Even if an attacker gets complete access to our Delta Lake storage, they see hashed identifiers, not names.
The attack surface is smaller. No decrypt operations means no opportunity for accidental PII leakage in logs or error messages. No key storage means no key theft. No reversibility means no compliance questions about who can access the decryption key.
GDPR and Healthcare Regulations
This is where it gets interesting. GDPR has strong opinions about PII, and healthcare regulations add more layers. We needed to be sure anonymization was compliant.
Turns out anonymization is explicitly supported. If you irreversibly remove PII such that individuals can’t be re-identified, the data is no longer subject to GDPR restrictions. The customer key derived from hashing qualifies because you can’t work backwards to the patient name.
But there’s a catch: the anonymization must be genuinely irreversible. Using a deterministic hash like we did is fine because the hash space is large enough (2^32 for xxhash32) that brute forcing is impractical. If we’d used something weak like hash(patient_id) % 1000, that would be reversible and not compliant.
The other requirement is that the anonymized data can’t be combined with external data to re-identify individuals. Our customer keys are facility-scoped, meaning they’re meaningless outside the context of our analytics. You can’t take customer_7f3a2b91 and look them up in some public database because that key only exists in our system.
Healthcare regulations (HIPAA in the US, similar frameworks elsewhere) similarly allow de-identified data. Remove the PII, ensure you can’t re-identify individuals from the remaining data, and you’re clear to use it for analytics. Our approach qualified.
The Hash Function Choice
I chose xxhash32 because it’s fast, has good distribution properties, and produces 32-bit hashes. That gives us 4.3 billion possible customer keys, which is way more than we’ll ever need for the population we’re tracking.
Why not cryptographic hashes like SHA-256? Because we don’t need cryptographic strength. We’re not securing secrets, we’re creating anonymous identifiers. xxhash is 10x faster than SHA-256 for hashing short strings, and speed matters when you’re processing thousands of rows.
The deterministic property is crucial. The same facility + patient ID must always produce the same hash, even across different pipeline runs or different environments. This lets us track individuals over time. Random hashing or salt would break that continuity.
Collision Risk
With 32-bit hashes, collision probability is non-zero. Birthday paradox says you’d expect the first collision after about sqrt(2^32) ≈ 65,536 unique inputs. We’re tracking maybe 5,000 patients across all facilities. The collision risk is negligible.
Even if we did hit a collision (two different facility + patient ID combinations producing the same hash), the worst case is we’d incorrectly merge two patients’ data together in aggregations. This would make totals slightly wrong but wouldn’t expose PII or violate compliance.
We could mitigate this by switching to xxhash64 (64-bit hashes), which pushes the expected collision out to billions of unique inputs. But for our scale, 32-bit is plenty.
What We Keep vs What We Drop
Bronze layer reads patient names from Firebird but immediately creates customer keys and drops the names before writing to Delta Lake. The PII exists in memory for a fraction of a second, then it’s gone.
We keep the customer key, facility code, date of service, intervention type, duration, and all the non-PII billing details. Everything analytics needs to compute KPIs and track patterns.
We drop patient names, any other direct identifiers, and anything that could be used for re-identification. The line isn’t always clear (is birthdate PII if you only keep month and year?), so we err on the side of dropping questionable fields.
Silver layer works exclusively with customer keys. It never sees patient names because Bronze never stored them. Gold layer aggregates by facility and date, so even the customer keys disappear into counts and sums.
Testing Anonymization
How do you test that anonymization actually works? I built a few validation checks into the pipeline:
First, verify that PII columns are actually absent from Delta tables. Read the parquet schema and assert that “nom_patient” doesn’t exist.
dt = DeltaTable("lake/bronze/billing_data")
schema = dt.schema()
assert "nom_patient" not in [field.name for field in schema.fields]
Second, verify customer keys are stable. Hash the same inputs multiple times and confirm you get the same key.
key1 = xxhash.xxh32(f"CONT_12345").hexdigest()
key2 = xxhash.xxh32(f"CONT_12345").hexdigest()
assert key1 == key2
Third, verify customer keys are unique per patient but consistent across time. Group by customer key and check that all records for a given key have the same facility code (since the key includes facility).
df = delta_lake.read_table("billing_data_anonymized")
multi_facility = df.groupby('customer_key')['facility'].nunique()
assert (multi_facility == 1).all(), "Customer keys span multiple facilities"
When Encryption Is Actually Needed
There are scenarios where encryption makes sense instead of anonymization. If you need to be able to reverse the anonymization later, encryption is the only option. Some compliance requirements mandate retaining PII for a certain period, which means encrypted storage in Bronze.
If downstream systems might legitimately need access to patient names (e.g., a care coordination tool that needs to contact patients), encryption lets you control access. Decrypt for authorized users, keep it encrypted for everyone else.
If you’re building a general-purpose data platform that doesn’t know all future use cases, encryption preserves optionality. You can always decrypt and re-anonymize later. Once you hash and drop PII, it’s gone forever.
But for a purpose-built analytics pipeline where you know you’ll never need the PII again, anonymization is simpler and more secure.
The Code Pattern
The pattern I settled on is doing anonymization at the boundary:
@asset
def bronze_ad_billing(firebird_database, delta_lake):
# Fetch raw data (includes PII)
df = firebird_database.fetch_data("V_DETAILFACTCNS_FICHIER_AD")
# Immediately create anonymous identifier
df['customer_key'] = df.apply(
lambda r: xxhash.xxh32(f"{r['ETB']}_{r['NO_PATIENT']}").hexdigest(),
axis=1
)
# Drop PII before it touches disk
df = df.drop(columns=['NOM_PATIENT'])
# Write to Bronze (no PII in storage)
delta_lake.write_bronze(df, "billing_data")
This keeps the PII lifetime as short as possible. It exists in memory during the database query and the hashing operation, then it’s gone. It never reaches the Delta Lake storage layer.
Key Lessons
Anonymization with irreversible hashing is often stronger than encryption for analytics pipelines. No keys to manage, no decrypt operations to secure, no reversibility to worry about.
Know your actual requirements. We thought we needed encryption because healthcare data, but we actually needed tracking without identification. Hashing solved that.
The earlier you drop PII, the simpler your pipeline. Don’t carry it through Bronze, Silver, and Gold if you don’t need it past the ingestion boundary.
Compliance frameworks support anonymization explicitly. GDPR and HIPAA both allow de-identified data if it’s genuinely irreversible.
Choose hash functions for properties not cryptographic strength. Fast and deterministic matters more than security theater.
Test your anonymization. Verify PII is actually gone from storage, keys are stable, and identifiers work as expected.