I built a complete encryption system for our Bronze layer. Column-level encryption with Fernet, key management with environment variables, decrypt-transform-drop pattern in Silver. The architecture was solid, the code was clean, and then I realized we didn’t actually need any of it.
The breakthrough came during a code review when someone asked “why are we encrypting patient names if we’re just going to drop them in the next step?” Good question. Really good question.
The Healthcare Data Problem
Our pipeline ingested billing data from multiple care facilities. The raw data included patient names, which is PII that needs careful handling. The Bronze layer stored everything raw from the source Firebird database. The Silver layer anonymized the data by creating hashed customer keys. The Gold layer aggregated everything for analytics.
Patient names appeared in Bronze, got anonymized in Silver, and never made it to Gold. Standard medallion architecture pattern for PII handling.
But Bronze stored the raw names in parquet files on disk. If someone got access to the storage layer, they’d have plaintext patient names. That’s a compliance problem and a security problem.
The Encryption Solution
My first approach was Parquet Modular Encryption (PME). It’s built into the Parquet file format, supports column-level encryption, and has good integration with Spark. Perfect, right?
Wrong. PME requires PySpark or the JVM-based Spark stack. We were using delta-rs, the Python bindings built on Rust. PME support in delta-rs was listed as “under construction” in the documentation. I could’ve switched our entire stack to PySpark, but that felt like massive overkill for a single encryption requirement.
So I went with application-level encryption instead. Encrypt the sensitive columns before writing to Delta Lake, decrypt them when reading. Use Python’s cryptography library with Fernet for symmetric encryption. Keep the encryption key in an environment variable during development, plan for HashiCorp Vault or AWS KMS in production.
The code was straightforward:
def _encrypt_column(self, series):
return series.apply(
lambda x: self.cipher.encrypt(str(x).encode()).decode()
if pd.notna(x) else None
)
def write_bronze(self, df, table_name, encrypted_columns=None):
df_encrypted = df.copy()
for col in encrypted_columns or []:
if col in df.columns:
df_encrypted[col] = self._encrypt_column(df[col])
write_deltalake(str(self.lake_path / table_name), df_encrypted)
Bronze asset encrypted the NOM_PATIENT column before writing. Silver asset decrypted it just long enough to create the anonymized customer key, then dropped it. The PII existed in memory briefly, but never hit disk in plaintext.
The Performance Cost
Encryption isn’t free. For our 25K row table with one encrypted column, write time went from 0.3 seconds to 0.5 seconds. Read-and-decrypt added another 0.1 seconds. Not terrible, but not free either.
AWS KMS would’ve been worse. API calls for encrypt/decrypt operations add network latency. I measured 1.2 seconds for writes and 0.6 seconds for reads using KMS. That’s 4x slower than unencrypted writes. At scale, that adds up.
Storage overhead was minimal. Encrypted strings are slightly larger than plaintext, but we’re talking bytes per row. The parquet files grew from 2.1MB to 2.4MB. Negligible.
The real cost was complexity. Now I had encryption keys to manage, decrypt operations to test, and failure modes to handle. What happens if the encryption key rotates? What if decrypt fails on a corrupted value? What if someone needs to access Bronze data directly?
The Simpler Solution
That code review question stuck with me. Why encrypt data we’re about to delete?
The alternative was obvious once I stopped thinking about encryption as the goal. The goal was “don’t store PII in Bronze layer.” Encryption was one approach. Not storing it at all was another.
So I changed the Bronze ingestion to create the anonymized customer key immediately, before writing to Delta Lake:
def bronze_ad_billing(firebird_database, delta_lake):
df = firebird_database.fetch_data("V_DETAILFACTCNS_FICHIER_AD")
# Create anonymized key from PII
df['customer_key'] = df.apply(
lambda r: xxhash.xxh32(f"{r['ETB']}_{r['NO_PATIENT']}").hexdigest(),
axis=1
)
# Drop PII before writing
df = df.drop(columns=['NOM_PATIENT'])
# Write to Bronze (no encryption needed)
delta_lake.write_bronze(df, "billing_data")
Now Bronze never sees the patient names. They’re read from Firebird, hashed immediately, and dropped. The hash goes into Bronze as the customer key. Silver uses that key directly, no decryption needed.
No encryption library. No key management. No decrypt operations. No performance overhead. The PII never touches the disk layer at all.
The Tradeoff
This simpler approach has one significant downside: we can’t recover patient names from Bronze data. If we ever needed to reverse the anonymization or correlate customer keys back to real names, we can’t.
For our use case, that’s acceptable. We’re doing analytics, not customer service. We need to track individual patients across time (hence the customer key), but we don’t need their actual names. The facilities have that data in their source systems if they ever need it.
If we were building a system where you might need to “de-anonymize” data later, encryption would be the right choice. Keep the encrypted PII in Bronze, decrypt when needed, maintain the reversibility. But that wasn’t our requirement.
When Encryption Actually Makes Sense
After going through this, I have clearer ideas about when to encrypt vs when to drop PII earlier.
Use encryption when you need to retain the PII for potential future use. Regulatory compliance might require you to be able to reconstruct historical data with real identifiers. Encryption lets you keep it safely until needed.
Use encryption when different downstream consumers need different access levels. Some users can decrypt and see real names, others only see anonymized data. The encryption layer provides access control.
Use encryption when you’re ingesting from multiple sources with different PII sensitivity levels. Encrypt everything at Bronze, selectively decrypt in Silver based on data classification rules.
Drop PII earlier when you genuinely don’t need it. If your entire pipeline is analytics and aggregation, why carry the PII through the Bronze layer at all? Create the anonymous identifiers as early as possible.
The Security Perspective
From a pure security standpoint, not storing PII beats encrypting it. You can’t decrypt what was never encrypted. You can’t leak what doesn’t exist. The attack surface is smaller.
Encryption protects data at rest, but you still have to manage keys, handle decrypt operations securely, and audit access. Every decrypt operation is a potential leak point. Every key storage mechanism is a potential vulnerability.
Defense in depth says you should encrypt anyway, just in case. But that assumes the cost of encryption is low. In our case, the complexity cost outweighed the security benefit because we didn’t actually need the data we were protecting.
Key Management Complexity
I built out the encryption system far enough to understand the operational burden. In development, environment variables work fine. You generate a Fernet key, export it, and you’re done.
In production, you need real key management. HashiCorp Vault adds another service to run and monitor. AWS KMS adds API dependencies and failure modes. Both add latency to decrypt operations.
Key rotation becomes a migration project. You can’t just change the key, you have to decrypt-reencrypt your entire Bronze layer. That means reading every row, decrypting with the old key, encrypting with the new key, and writing back. For large tables, that’s hours of downtime.
With the drop-PII-early approach, there’s no key management at all. Nothing to rotate, nothing to store securely, nothing to audit. Simple wins.
The Pattern I’d Recommend
Start by questioning whether you need the PII in your data lake at all. Can you create anonymous identifiers at the ingestion boundary? If yes, do that. It’s simpler and more secure than any encryption scheme.
If you genuinely need reversible anonymization, use column-level encryption with Fernet in Python or PME if you’re on PySpark. Keep the encrypted data in Bronze, decrypt briefly in Silver, drop after anonymization.
If you’re on delta-rs and need encryption, stick with application-level encryption using cryptography library. PME support isn’t mature enough yet. Plan for the performance overhead and key management complexity.
If you’re handling seriously sensitive data where breach would be catastrophic, consider not using a data lake at all. Sometimes the right answer is keeping PII in the source system and only exporting aggregates or anonymized views.
What I Built vs What I Shipped
I built a full encryption system with key management and decrypt-transform patterns. It worked, it was tested, and the code was solid.
I shipped the drop-PII-early approach because it was simpler and accomplished the same security goal. The encryption code lives in a branch somewhere as a reference implementation if we ever need it.
That’s the lesson: build enough to understand the problem space, but don’t ship complexity you don’t need. Encryption is powerful, but it’s also overhead. Make sure the benefit justifies the cost.