SI-19: De-identification
To meet the si-19: de-identification requirement, you must de-identify datasets by removing specified elements of personally identifiable information (PII) before sharing, using, or analyzing the data outside the approved context. Operationalize SI-19 by defining the PII elements to remove, implementing repeatable de-identification procedures in your data pipelines, and retaining testable evidence that de-identification is consistently performed. 1
Key takeaways:
- SI-19 requires PII element removal from datasets based on an organization-defined list of identifiers. 2
- “De-identified” must be provable in practice, with pipeline controls, verification checks, and audit-ready artifacts.
- The most common gap is missing evidence (no logged runs, no data samples, no definition of what gets removed), not missing intent. 2
SI-19 sits in NIST SP 800-53’s System and Information Integrity family and is straightforward on paper: remove defined PII elements from datasets. In real programs, it becomes a coordination problem across privacy, security engineering, data engineering, analytics teams, and third parties that receive data extracts. The control’s practical purpose is to prevent “secondary use” datasets (analytics, testing, research, sharing, training, troubleshooting) from silently carrying direct identifiers or quasi-identifiers that re-link records to people.
The excerpt you have is parameterized, which matters operationally: you are expected to decide and document which PII elements must be removed (for your mission, data types, and threat model), then implement a consistent method to remove them. SI-19 will be assessed by whether your requirement is clear, implemented in the places data actually moves, and supported by evidence an auditor can test.
This page gives requirement-level guidance you can implement quickly: scope, owners, step-by-step execution, evidence to retain, common audit questions, and a practical phased plan. 1
Regulatory text
Control requirement (excerpt): “Remove the following elements of personally identifiable information from datasets: {{ insert: param, si-19_odp.01 }} ; and” 2
Operator interpretation of the text:
- You must maintain an explicit list of PII elements that will be removed for de-identification (the parameter). 2
- You must implement a repeatable mechanism to remove those elements from datasets that are designated to be de-identified (exports, downstream analytics, sandboxes, test data, research datasets, shared files). 1
- You must be able to demonstrate removal through artifacts: procedures, technical configurations, logs, and validation results that show the de-identification happened and is monitored. 2
Plain-English requirement
Define what “PII elements” means for your environment, then remove those elements from datasets whenever the dataset is classified/marked for de-identification. Make it hard to bypass. Prove it with evidence.
Who it applies to (entity and operational context)
SI-19 commonly applies to:
- Federal information systems and programs assessed against NIST SP 800-53. 1
- Contractor systems handling federal data, including service providers and third parties that process or store government-related datasets. 2
Operational contexts where SI-19 shows up in audits:
- Data extracts provided to third parties (analytics firms, support providers, testing vendors, research partners).
- Internal data use outside primary processing: BI/analytics, data science, ML model training, dev/test, customer support troubleshooting.
- Replicated environments: data warehouses, lakes, log aggregation, SIEM/SOAR, APM tools, ticketing attachments.
What you actually need to do (step-by-step)
Step 1: Assign ownership and define control boundaries
- Name a control owner (often Privacy, Security, or Data Governance) and an implementation owner (Data Engineering or Platform).
- Define where SI-19 applies: which systems, pipelines, and distribution channels produce “de-identified datasets.”
- Establish an intake rule: no dataset gets labeled “de-identified” unless it goes through the approved process.
Practical tip: If ownership is unclear, SI-19 fails as an evidence problem, even if teams “usually remove names.” 2
Step 2: Define “elements of PII to remove” (the parameter)
Create your SI-19 de-identification specification that includes:
- Direct identifiers (examples: full name, SSN, driver’s license number, passport number, personal email, phone number).
- Persistent identifiers (examples: internal customer IDs, device identifiers) when they can be used to re-link a person.
- Free-text fields that can contain PII (notes, chat transcripts, tickets).
- Location and time granularity rules (for example, reduce precision if exact values can identify individuals in your context).
Keep it implementable: specify fields, patterns, and sources (tables/columns, log fields, JSON keys). SI-19 is hard to execute if the “PII elements” list reads like a policy statement instead of an engineering spec. 2
Step 3: Choose the de-identification method per dataset type
Build a decision matrix and record it:
| Dataset type | Recommended approach | Evidence you should expect |
|---|---|---|
| Structured tables | Drop columns; tokenize or hash where needed; generalize dates/locations | Pipeline code/config, before/after schema, job logs |
| Semi-structured (JSON) | Key-based redaction and transformation; schema enforcement | Transformation rules, sample output, test results |
| Unstructured text | Redaction rules + human review for high-risk exports | Redaction logs, sampling QA record |
Focus on “remove” per SI-19’s wording, then add transformations (tokenization/generalization) only where business needs require linkage. 2
Step 4: Implement technical controls in the pipeline (not in spreadsheets)
Minimum viable implementation patterns auditors can test:
- Central transformation job (ETL/ELT step) that produces a de-identified output dataset.
- Policy-as-code checks that block publishing if prohibited fields exist.
- Automated scanning for PII in output samples, including free text where feasible.
- Access controls to prevent raw datasets being exported through the “de-identified” channel.
If your process relies on analyst discipline (“remember to delete columns”), treat that as a control failure waiting to happen.
Step 5: Verification and ongoing monitoring
- Create a validation checklist: schema checks, spot-check sampling, and exception handling.
- Run validation on each dataset release or on a defined recurring trigger (for example, each pipeline run or each export request; choose what matches your operating model).
- Track exceptions: if a field can’t be removed due to mission need, require documented approval and compensating controls.
Step 6: Third-party data sharing alignment
If de-identified datasets go to a third party:
- Contractually define “de-identified” as your SI-19 spec, not a vague term.
- Require the third party to avoid re-identification and to apply equivalent controls to downstream sharing.
- Confirm transfer mechanisms and data handling align with your dataset classification.
Step 7: Map SI-19 to assessable artifacts (make the audit easy)
NIST assessments tend to reward clarity and testability. Build a simple control record:
- Control statement (what you do)
- Systems in scope
- Owners
- How often it runs (trigger-based is fine)
- Evidence list and where it lives
Daydream can help by mapping SI-19 to a named owner, a documented procedure, and a recurring evidence set so you can answer assessors quickly and consistently. 2
Required evidence and artifacts to retain
Retain artifacts that prove definition, implementation, and operation:
- SI-19 de-identification specification (the PII elements list; version-controlled).
- Data flow diagrams showing where raw vs de-identified datasets are created and stored.
- Pipeline implementation evidence: code repositories, configuration screenshots/exports, transformation rules.
- Execution logs for de-identification jobs (job IDs, timestamps, success/failure).
- Validation results: automated scan reports, schema check outputs, sampling records.
- Exception register: approved deviations, risk acceptance, compensating controls.
- Third-party sharing records: contracts/DPAs addenda language, data sharing approvals, transfer tickets.
Common exam/audit questions and hangups
Assessors commonly test these areas for SI-19: 1
- “Show me the exact PII elements you remove. Where is the list maintained?”
- “Which datasets are labeled de-identified, and where are they stored?”
- “Demonstrate the technical mechanism that removes the fields.”
- “How do you know an analyst didn’t export raw data to the de-identified location?”
- “Show evidence from a recent run: logs + sample output.”
- “How are exceptions handled and approved?”
Hangups that slow audits:
- The PII list exists, but it is not actionable (no field mappings).
- The process exists, but there is no operational evidence (no logs, no retained scans). 2
Frequent implementation mistakes and how to avoid them
-
Treating de-identification as a one-time cleanup.
Fix: enforce it as a pipeline step with repeated runs and retained logs. -
Only removing obvious identifiers (name/email) and ignoring IDs or free text.
Fix: include persistent IDs and unstructured fields in scope decisions; document what you do for each. -
No boundary definition for “dataset.”
Fix: define which outputs count (exports, marts, sandboxes, shared buckets) and label them. -
No exception process.
Fix: keep an exception register with approvals and compensating controls tied to datasets. -
Evidence scattered across teams.
Fix: keep an SI-19 evidence index (what, where, owner). Daydream can standardize this evidence map across control owners.
Enforcement context and risk implications
No public enforcement cases were provided in the source catalog for this requirement, so treat SI-19 primarily as an assessment and breach-impact control rather than a case-law-driven one. 2
Risk implications you should plan for:
- If “de-identified” datasets still contain PII, you expand breach scope, incident response complexity, and third-party exposure.
- If you cannot prove de-identification, audits often rate the control ineffective due to missing evidence, even if teams believe they follow the practice. 2
Practical execution plan (30/60/90-day plan without dates)
Use these phases to move from “policy intent” to “auditable operation” without guessing calendar duration.
Immediate (stabilize scope and decisions)
- Assign SI-19 owner and implementation owner.
- Inventory where de-identified datasets exist or are claimed to exist.
- Draft the PII elements removal list (the SI-19 parameter) and get stakeholder sign-off.
- Freeze new “de-identified” dataset creation unless it goes through the defined process.
Near-term (build and enforce the mechanism)
- Implement the de-identification step in the pipeline for the highest-risk datasets first (those shared externally or widely internally).
- Add automated checks to block prohibited fields from entering de-identified destinations.
- Stand up a lightweight validation process and an exception workflow.
- Create the evidence index and begin retaining run logs and scan outputs.
Ongoing (operate, test, and improve)
- Expand coverage to remaining datasets and unstructured sources.
- Review and update the PII elements list as new identifiers and data sources appear.
- Sample outputs periodically and document results.
- Include SI-19 checks in change management for schema changes and new data integrations.
Frequently Asked Questions
What does SI-19 require me to remove if the control text is parameterized?
You must define the specific PII elements to remove in your environment and document that list as the control parameter. Then you must show that your pipelines remove those elements from datasets designated as de-identified. 2
Does tokenization or hashing satisfy SI-19?
SI-19’s excerpt is framed as “remove” PII elements, so start with field removal for de-identified outputs. If you tokenize to preserve linkage, document the rationale, where tokens are stored, and how you prevent re-identification through access and separation controls. 2
How do I prove de-identification to an assessor?
Provide the PII elements list, the implementation procedure, pipeline configurations or code, and run evidence (logs plus validation results). Auditors typically want to trace one dataset end-to-end from source to de-identified output. 1
What systems should be in scope first?
Start where de-identified datasets leave controlled environments: third-party transfers, shared storage buckets, analytics warehouses, and dev/test data refresh processes. Prioritize datasets with wide access and frequent distribution.
How should we handle free-text fields that may contain PII?
Treat free text as high-risk by default. Apply redaction or exclude the field from de-identified outputs, then add sampling-based QA for exports where automated detection is imperfect.
What evidence is most often missing for SI-19?
Teams often have a policy statement but no repeatable evidence trail showing execution (job logs, scan results, and a clear mapping from “PII elements to remove” to actual fields). 2
Footnotes
Frequently Asked Questions
What does SI-19 require me to remove if the control text is parameterized?
You must define the specific PII elements to remove in your environment and document that list as the control parameter. Then you must show that your pipelines remove those elements from datasets designated as de-identified. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
Does tokenization or hashing satisfy SI-19?
SI-19’s excerpt is framed as “remove” PII elements, so start with field removal for de-identified outputs. If you tokenize to preserve linkage, document the rationale, where tokens are stored, and how you prevent re-identification through access and separation controls. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
How do I prove de-identification to an assessor?
Provide the PII elements list, the implementation procedure, pipeline configurations or code, and run evidence (logs plus validation results). Auditors typically want to trace one dataset end-to-end from source to de-identified output. (Source: NIST SP 800-53 Rev. 5)
What systems should be in scope first?
Start where de-identified datasets leave controlled environments: third-party transfers, shared storage buckets, analytics warehouses, and dev/test data refresh processes. Prioritize datasets with wide access and frequent distribution.
How should we handle free-text fields that may contain PII?
Treat free text as high-risk by default. Apply redaction or exclude the field from de-identified outputs, then add sampling-based QA for exports where automated detection is imperfect.
What evidence is most often missing for SI-19?
Teams often have a policy statement but no repeatable evidence trail showing execution (job logs, scan results, and a clear mapping from “PII elements to remove” to actual fields). (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)
Operationalize this requirement
Map requirement text to controls, owners, evidence, and review workflows inside Daydream.
See Daydream