De-Identification and Anonymization

9 min readLast verified: March 2026By Isaac SilvermanOur methodology

HICP Practice 7.3 requires you to implement repeatable de-identification and anonymization processes whenever PHI is used for research, analytics, quality improvement, or other non-clinical secondary purposes. Operationally, this means you must route those datasets through an approved HIPAA de-identification method (Safe Harbor or Expert Determination), control any re-identification risk, and retain evidence that the method was applied correctly ¹.

Key takeaways:

Build a governed intake-to-release workflow for secondary-use datasets, not a one-off scrub step.
Standardize on HIPAA Safe Harbor and/or Expert Determination, and document which is allowed for which use case.
Treat third parties and internal analytics teams the same: contract, technical controls, and audit evidence must align.

De-identification and anonymization failures usually happen the same way: a well-meaning analytics, research, or product team exports “de-identified” PHI from an operational system, then shares it internally or with a third party without a consistent method, approvals, or proof. HICP Practice 7.3 is trying to prevent that pattern by forcing a controlled, documented process for secondary use of PHI ¹.

For a Compliance Officer, CCO, or GRC lead, the fastest path to compliance is to treat de-identification as a production process with entry criteria, method selection, tooling, quality checks, release gates, and retention of artifacts. You are not being asked to invent a new privacy theory. You are being asked to make sure every secondary-use dataset has (1) a defined purpose, (2) an approved de-identification method (HIPAA Safe Harbor or Expert Determination), (3) controls that prevent backsliding into identifiable data, and (4) evidence you can show to auditors and partners ¹.

Regulatory text

Requirement (HICP Practice 7.3): “Implement de-identification and anonymization processes for PHI used in research, analytics, and non-clinical purposes.” ¹

Operator interpretation (what you must do):

Identify where PHI is used beyond direct care operations (research, analytics, quality improvement, product development, reporting).
Ensure those datasets are de-identified through an accepted approach (commonly framed as HIPAA Safe Harbor or Expert Determination) before they are used for those secondary purposes.
Make the process consistent, repeatable, and provable with documented procedures and retained artifacts ¹.

Plain-English requirement interpretation

You need a controlled pipeline that takes a request for “PHI for analytics/research,” applies an approved de-identification method, validates the result, and only then releases the dataset to the requestor (internal team or third party). If you cannot demonstrate which method you used and how you confirmed it worked, you should assume you will fail an audit inquiry on this control.

De-identification here is not a promise that re-identification is impossible. It is a governance and process requirement: define the method, apply it consistently, minimize residual risk, and document what you did ¹.

Who it applies to

Entity types

Healthcare organizations handling PHI (providers, payers, integrated delivery networks).
Health IT vendors that process PHI and create secondary-use datasets (analytics platforms, population health tools, clinical data warehouses, digital health apps) ¹.

Operational contexts where this shows up

Internal analytics and reporting (dashboards, KPI reporting, outcomes analysis).
Research and quality improvement initiatives.
Data science/ML model development and testing.
Sharing datasets with third parties (consultants, academic partners, cloud analytics firms) for non-clinical purposes.
Building synthetic test datasets derived from production data (still needs governance, because the source is PHI).

What you actually need to do (step-by-step)

1) Inventory and classify secondary-use data flows

Create a simple register of:

Source systems (EHR, claims, CRM, patient portal, call center).
Destination environments (data warehouse, BI tool, research enclave, third party SFTP, cloud bucket).
Purpose (analytics, research, quality, product development).
Data elements involved (direct identifiers, quasi-identifiers, free text).
Owners (data steward, system owner, requestor).

Practical tip: Start with where exports happen (SQL queries, ETL jobs, data extracts, API pulls). Exports are where “de-identified” shortcuts appear.

2) Define your approved de-identification methods and when each is allowed

Write a short standard that answers:

Which method(s) are permitted for secondary uses: HIPAA Safe Harbor and/or Expert Determination ¹.
Which method is required for which risk level (example decision rule):
- Safe Harbor for broad internal analytics and routine reporting where you can remove standard identifiers reliably.
- Expert Determination when you need to retain more analytical utility (dates, geography granularity, rare conditions) and will manage risk through expert review and documented determination.

Avoid vague language like “remove identifiers where possible.” Auditors look for a declared method and proof of application.

3) Build a request-and-approval workflow (intake → review → release)

Minimum workflow gates:

Request intake: requestor states purpose, dataset description, recipients, retention period, and whether any third party receives the data.
Privacy/security review: confirm secondary use, verify minimum necessary, select method (Safe Harbor vs Expert Determination), confirm allowable environment.
Execution: data engineering applies the transformation through approved tooling/scripts.
Validation: a second person or automated tests confirm required fields removed/modified; confirm no free-text PHI leakage if notes are included.
Release approval: data steward signs off; dataset is published to the approved location with access controls.

If you need a system to run this consistently, this is a common place where Daydream can help by tracking intake, approvals, evidence, and third-party handoffs in one workflow, so your de-identification program is auditable without chasing screenshots across teams.

4) Implement technical controls that prevent “PHI drift” back into the dataset

Controls should align to the way data actually moves:

Approved de-identification code paths: centrally managed transformation jobs; version-controlled scripts; restricted ability to run ad hoc extracts from production.
Access controls: role-based access to de-identified zones; separation between PHI raw zone and de-identified analytics zone.
Egress controls: restrict outbound sharing locations; require approvals for external transfers; log downloads and shares.
Data loss checks: pattern scans for emails, SSNs, MRNs, phone numbers; alerting on detections.
Environment controls for research: segmentation, no internet egress if feasible, monitored endpoints, and controlled export process.

5) Address third-party sharing explicitly

If a third party receives de-identified data:

Document what you are sending, the de-identification method applied, and the permitted purpose.
Ensure the contract language matches the operational reality (allowed uses, restrictions on re-identification, onward disclosure expectations, incident notification expectations).
Verify the transfer method and storage location match your security expectations (encrypted transfer, restricted access, logging).

Even if data is “de-identified,” uncontrolled sharing creates reputational and operational risk. Treat this as third-party data governance, not only a privacy checkbox.

6) Train the teams that trigger the risk

Target training to:

Analytics/BI teams
Data engineering
Research coordinators
Product and data science
Vendor management / procurement (for data sharing)

Training should be job-specific: “Here is how you request a dataset, what you cannot do, and what evidence will be retained.”

Required evidence and artifacts to retain

Keep artifacts that prove method selection, execution, and validation:

Governance and standards

De-identification standard (methods allowed, decision criteria).
Data sharing standard for secondary uses (internal + third party).
RACI for approvals (privacy, security, data steward, requestor).

Operational records

Dataset requests and approvals (ticketing/workflow record).
Method selection justification (Safe Harbor vs Expert Determination) tied to the request.
Transformation job logs and version references (commit hash, job run ID).
Validation results (automated test output; peer review sign-off).
Data flow map showing where de-identified datasets are stored and who can access them.

Third-party artifacts (if applicable)

Third-party inventory entries indicating de-identified data sharing.
Contract addenda or data-sharing terms that reflect restrictions and responsibilities.
Evidence of transfer controls (secure transfer configuration, access logs).

Common exam/audit questions and hangups

Auditors and assessors tend to focus on repeatability and proof:

“Show me your de-identification standard and approved methods.” ¹
“How do you ensure all analytics datasets follow the method, not just some?”
“Who approves secondary-use dataset releases?”
“How do you detect PHI in free-text fields and unstructured exports?”
“Prove that the dataset provided to this third party was de-identified and validated.”
“What prevents an analyst from exporting PHI directly from the EHR/reporting database?”

Hangups you should anticipate:

Multiple de-identification scripts across teams with no ownership.
“De-identified” datasets that still contain quasi-identifiers at enough granularity to create re-identification risk, with no documented Expert Determination.
Poor traceability between request, dataset version, and delivery.

Frequent implementation mistakes and how to avoid them

Treating de-identification as a manual spreadsheet step
Fix: Require execution through an approved, logged pipeline with peer review or automated checks.
No decision rule for Safe Harbor vs Expert Determination
Fix: Add a simple decision matrix and force method selection at intake ¹.
Ignoring unstructured data (notes, attachments, chat transcripts)
Fix: Either exclude unstructured fields by default or implement scanning/redaction and validate outputs.
Letting third parties “de-identify on their side”
Fix: Do de-identification before transfer, unless you have a tightly controlled arrangement and documented method.
No evidence retention
Fix: Make artifact retention part of the workflow definition. If you cannot reproduce what was sent, you cannot defend it.

Enforcement context and risk implications

No public enforcement cases were provided in the source catalog for this requirement, so treat this section as risk context, not enforcement prediction.

Operational risk is straightforward: a secondary-use dataset that is incorrectly de-identified can expose patient privacy, trigger breach response obligations depending on facts, and create contract and reputational fallout. Compliance risk shows up during audits when you cannot prove a consistent method or demonstrate control over data exports ¹.

Practical execution plan (30/60/90)

Use phased execution without calendar promises. The goal is to reach controlled, repeatable operation fast, then harden.

First 30 days (Immediate triage and control points)

Identify top secondary-use pathways (largest exports, most recipients, most frequent queries).
Freeze ad hoc external sharing of “de-identified” datasets unless they go through review.
Publish a one-page interim standard: approved methods (Safe Harbor/Expert Determination) and required approval roles ¹.
Stand up a request intake mechanism (ticket queue) and require it for new requests.

By 60 days (Operationalize and prove it works)

Build the workflow gates (intake, review, execute, validate, release) and assign owners.
Standardize scripts/tools and put them under change control.
Create validation checks for identifiers and common leakage points (including free text exclusions or scans).
Start evidence retention: sample packets for each dataset release (request, method, validation, approval).

By 90 days (Scale and reduce “PHI drift”)

Expand coverage to all secondary-use pipelines and all business units.
Add monitoring: alerting for unauthorized exports and unexpected egress paths.
Integrate third-party governance: update intake questions to flag third-party recipients and attach contract requirements.
Run an internal audit-style walkthrough: pick a released dataset and reconstruct end-to-end evidence within the workflow system.

Frequently Asked Questions

What’s the difference between de-identification and anonymization for this requirement?

HICP Practice 7.3 focuses on implementing processes that remove identifiability for secondary uses of PHI. In practice, your program should define the approved method (commonly HIPAA Safe Harbor or Expert Determination) and treat “anonymization” as an outcome you must substantiate with evidence ¹.

Can our analytics team access raw PHI if the outputs are aggregated?

That creates risk because the processing still touches PHI. Route analytics through controlled environments and restrict raw PHI access to approved roles, then de-identify datasets before broader analytics use ¹.

Do we need Expert Determination for every dataset?

No. Use a decision rule: Safe Harbor for standard reporting where you can remove identifiers reliably, and Expert Determination for higher-utility datasets where Safe Harbor would break the analysis ¹.

How do we handle free-text clinical notes in research or analytics datasets?

Default to excluding free text unless you have a validated redaction/scanning approach and a review step that checks for PHI leakage. Treat notes as high-risk because identifiers can appear anywhere.

What evidence should we provide to an auditor for one specific dataset release?

Provide the request record, the approved method (Safe Harbor or Expert Determination), the transformation run details, validation results, and the release approval record. If a third party received it, include proof of controlled transfer and the applicable contract terms.

A third party says they only need “de-identified data,” so do we still need strong contracts and oversight?

Yes. You still need governance over what was shared, for what purpose, and what controls prevent misuse or onward sharing. Your process should treat third-party recipients as part of the same auditable release workflow.

HICP 2023 - 405(d) Health Industry Cybersecurity Practices

Frequently Asked Questions

What’s the difference between de-identification and anonymization for this requirement?

Can our analytics team access raw PHI if the outputs are aggregated?

Do we need Expert Determination for every dataset?

How do we handle free-text clinical notes in research or analytics datasets?

Default to excluding free text unless you have a validated redaction/scanning approach and a review step that checks for PHI leakage. Treat notes as high-risk because identifiers can appear anywhere.

What evidence should we provide to an auditor for one specific dataset release?

A third party says they only need “de-identified data,” so do we still need strong contracts and oversight?

Authoritative Sources

Health Industry Cybersecurity Practices (HICP)

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream

Regulatory text

Plain-English requirement interpretation

Who it applies to

What you actually need to do (step-by-step)

1) Inventory and classify secondary-use data flows

2) Define your approved de-identification methods and when each is allowed

3) Build a request-and-approval workflow (intake → review → release)

4) Implement technical controls that prevent “PHI drift” back into the dataset

5) Address third-party sharing explicitly

6) Train the teams that trigger the risk

Required evidence and artifacts to retain

Common exam/audit questions and hangups

Frequent implementation mistakes and how to avoid them

Enforcement context and risk implications

Practical execution plan (30/60/90)

First 30 days (Immediate triage and control points)

By 60 days (Operationalize and prove it works)

By 90 days (Scale and reduce “PHI drift”)

Frequently Asked Questions

What’s the difference between de-identification and anonymization for this requirement?

Can our analytics team access raw PHI if the outputs are aggregated?

Do we need Expert Determination for every dataset?

How do we handle free-text clinical notes in research or analytics datasets?

What evidence should we provide to an auditor for one specific dataset release?

A third party says they only need “de-identified data,” so do we still need strong contracts and oversight?

Footnotes

Frequently Asked Questions

Authoritative Sources

Related Resources

Operationalize this requirement