SI-19(4): Removal, Masking, Encryption, Hashing, or Replacement of Direct Identifiers

To meet the si-19(4): removal, masking, encryption, hashing, or replacement of direct identifiers requirement, you must ensure datasets used for analytics, testing, sharing, or secondary purposes do not contain direct identifiers in clear form. Operationally, that means identifying direct identifier fields, selecting an approved transformation method per use case, implementing it in repeatable pipelines, and retaining evidence that transformations are applied and monitored.

Key takeaways:

  • Treat “direct identifiers” as specific fields that identify a person on their own (not “all sensitive data”).
  • Choose the transformation method (removal, masking, encryption, hashing, replacement) based on the dataset’s use and re-identification risk.
  • Auditors will focus on repeatability and evidence: mapping, procedures, and recurring proof that pipelines enforce the rule.

SI-19(4) is a dataset hygiene requirement that becomes urgent the moment data leaves its original operational boundary: analytics sandboxes, test environments, machine learning training sets, third-party transfers, and cross-team sharing are where direct identifiers tend to spread. The control does not require one single technique. It gives you options, but it also creates a decision you must govern: when do you remove identifiers entirely, when do you mask them, when do you encrypt them, when do you hash them, and when do you replace them with tokens or surrogates?

Compliance officers and GRC leads typically struggle with two things here: (1) scoping, because teams disagree on what counts as a “direct identifier,” and (2) auditability, because transformations happen in scripts, ETL jobs, notebooks, and vendor tools that are not documented in a control-friendly way. This page translates SI-19(4) into an implementation pattern you can assign, test, and continuously evidence across data pipelines and third parties aligned to NIST SP 800-53 Rev. 5. 1

Regulatory text

Requirement (excerpt): “Remove, mask, encrypt, hash, or replace direct identifiers in a dataset.” 2

Operator interpretation: For any dataset in scope, you must ensure direct identifiers are not exposed in their original, readable form unless you have a defined operational need and compensating controls. You can satisfy the requirement by:

  • Removing the identifier field entirely,
  • Masking (redaction, truncation, partial reveal),
  • Encrypting the identifier so it is unreadable without keys,
  • Hashing the identifier (typically for matching/dedup use cases),
  • Replacing the identifier with a surrogate value (tokenization, synthetic IDs).

What examiners expect: you can point to (1) where identifiers exist, (2) the approved transformation approach per dataset/use case, and (3) evidence the approach is consistently executed.

Plain-English requirement: what “direct identifiers” means in practice

A direct identifier is a data element that uniquely identifies a person without needing extra information. Your organization should define this list explicitly, but common examples include:

  • Full name (in combination can be direct; some organizations treat full name alone as direct in many contexts)
  • Email address
  • Phone number
  • Government-issued ID numbers
  • Account numbers tied to an individual
  • Customer/employee ID numbers (if they map directly back to a person in your systems)
  • Precise physical address

Practical scoping rule: If a field is routinely used to look someone up in a system of record, treat it as a direct identifier unless your data governance standard says otherwise.

Who it applies to (entity and operational context)

Entity scope: Federal information systems and contractor systems handling federal data commonly inherit NIST SP 800-53 requirements through ATO packages, agency policy, or contract clauses. 1

Operational scope (where SI-19(4) usually bites):

  • Non-production environments: dev/test/staging, QA databases, performance testing copies
  • Data platforms: warehouses, lakes, lakehouses, analytics marts
  • ML/AI workflows: training, evaluation, prompt logs, feature stores
  • Third party transfers: SFTP extracts, API feeds, data clean rooms, managed services
  • Internal sharing: ad hoc extracts, spreadsheets, “temporary” CSVs in tickets and chat tools

If you are a CCO/GRC lead, assume the control is “live” anywhere data is copied or repurposed, not just the primary application database.

Operationalizing SI-19(4): what you actually need to do

Step 1: Appoint an accountable owner and define “direct identifiers”

Owner: Usually Data Governance, Security Engineering, or Privacy Engineering; GRC coordinates and sets evidence expectations.

Deliverable: A short, enforceable Direct Identifier Standard:

  • The list of direct identifier fields/types your org recognizes
  • Approved transformation methods and when to use each
  • Minimum logging and evidence expectations

Keep it short enough that engineering teams will follow it.

Step 2: Build an inventory of datasets and identifier fields (minimum viable)

Start with the highest-risk data flows:

  • Data extracts leaving production
  • Non-prod refresh jobs
  • Data sent to third parties
  • Analytics/BI datasets with broad access

For each dataset, record:

  • Dataset name and platform (DB, S3 bucket, warehouse schema, etc.)
  • Source system of record
  • Consumers (internal teams and third parties)
  • Columns/fields that are direct identifiers
  • Current transformation status (none, partial, enforced)

Evidence tip: Auditors respond well to a table that ties datasets to identifier handling and owners.

Step 3: Choose the transformation method per use case (decision matrix)

Use a simple decision matrix you can defend:

Use case Preferred method Why it fits Common constraint
Analytics that doesn’t need identity Removal or replacement Prevents unnecessary exposure Analysts may request identifiers “just in case”
Non-prod testing Replacement (tokenization) or masking Preserves realism without real IDs Some tests require stable joins
Record linkage / dedup Hashing (with controls) or replacement Enables matching Hashing can be reversible via guessing for low-entropy IDs
Data transfer to a third party Removal/replacement; encryption if identifiers are required Minimizes third-party exposure Third party may insist on raw identifiers
Regulated operational need (contacting customers) Encryption (at rest/in transit) with strict access Maintains function Key management and access control become critical

GRC control point: Require a documented justification when teams choose encryption instead of removal/replacement, because encryption retains identifiers and increases key/access control burden.

Step 4: Implement transformation in repeatable pipelines (not manual handling)

Where teams fail is one-off scripts and “temporary” extracts. Push for enforcement mechanisms:

  • ETL/ELT jobs: apply transformations at ingestion into analytics zones
  • Views/access layers: expose only de-identified columns to broad audiences
  • Data egress controls: block exports containing identifier columns unless approved
  • Non-prod refresh automation: de-identify as part of refresh, not after the copy

Minimum operational requirement: The default path for dataset creation and refresh applies SI-19(4) transformations without a human remembering to do it.

Step 5: Add approvals and exceptions (keep them rare, time-bound, and reviewable)

Some workflows legitimately require identifiers. Your exception process should include:

  • Requestor, dataset, fields, purpose
  • Approved method (or reason for temporary noncompliance)
  • Compensating controls (restricted access, monitoring, retention limits)
  • Expiration and re-approval requirement
  • Evidence of execution (who approved, when, and what changed)

Avoid open-ended exceptions. They turn SI-19(4) into a paper policy.

Step 6: Monitor and test (continuous evidence)

Testing does not need to be complex:

  • Schema/column checks: identify presence of prohibited identifier columns in governed zones
  • Sampling checks: confirm masked values look masked, tokens look like tokens
  • Pipeline job logs: show de-identification step ran successfully
  • Access reviews: confirm only authorized roles can see encrypted identifiers or token vault mappings

If you use Daydream to manage third-party risk and control evidence, map SI-19(4) to a control owner, implementation procedure, and recurring evidence artifacts so the control stays assessment-ready across teams and vendors. 2

Required evidence and artifacts to retain

Aim for artifacts that prove design and operation:

Design evidence (what you intended):

  • Direct Identifier Standard (definition + approved methods)
  • Data classification/handling standard references (if you have them)
  • Dataset inventory with identifier mapping and owners
  • Architecture or data-flow diagrams for key pipelines (high-risk only)

Operational evidence (what actually happens):

  • ETL/ELT job configs showing transformations
  • Code snippets or repository links for de-identification modules
  • Job run logs (successful runs + failures and remediation)
  • Exception tickets and approvals with expiration
  • Third-party data sharing agreements / data specs showing identifiers removed or transformed

Assessment mapping:

  • A control narrative that explicitly maps SI-19(4) to systems, datasets, and evidence locations 2

Common exam/audit questions and hangups

Expect these lines of questioning:

  1. “Show me where direct identifiers exist and where they don’t.” Auditors want an inventory, not an interview.
  2. “How do you prevent identifiers from entering the analytics zone?” They will look for automated controls, not guidance.
  3. “Who approved the method choice for this dataset?” Governance and accountability.
  4. “How do you handle exceptions and how do they expire?” Time-bound exceptions with reviews.
  5. “Prove it’s consistent.” They will ask for job runs, samples, and monitoring outputs.

Hangup: teams say “we encrypt” but cannot show key management, access restrictions, or a clear boundary where decrypted identifiers are exposed. If you pick encryption as your main approach, be prepared to evidence the surrounding controls.

Frequent implementation mistakes (and how to avoid them)

  1. Treating hashing as anonymization by default. Hashing may still allow re-identification via guessing for predictable identifiers. Use salts/pepper and governance, or prefer tokenization for stable joins.
  2. Masking only in the UI while raw exports stay intact. Apply transformations in the dataset or access layer used for exports and downstream sharing.
  3. Relying on manual redaction for extracts. Manual steps fail under time pressure; enforce transformations in pipelines.
  4. Skipping non-prod. Test and staging environments often have weaker access controls; de-identify during refresh.
  5. No evidence trail. A working implementation without logs, tickets, and mapping artifacts still fails audits. Make evidence a deliverable, not an afterthought.

Enforcement context and risk implications

No public enforcement cases were provided in the source catalog for SI-19(4). Your practical risk is assessment failure (ATO friction, contract noncompliance, or audit findings) plus increased breach impact if datasets used broadly contain direct identifiers. SI-19(4) reduces blast radius by limiting what an attacker or unauthorized user can learn from secondary datasets. 1

Practical 30/60/90-day execution plan

First 30 days (stabilize scope and stop the worst leaks)

  • Assign a control owner and publish the Direct Identifier Standard.
  • Identify the top data flows: production extracts, non-prod refreshes, and third-party transfers.
  • Create a minimal dataset inventory for those flows, with identifier columns flagged.
  • Implement quick wins: remove identifier columns from common exports; add masking in shared views.

By 60 days (make it repeatable)

  • Implement pipeline-based de-identification for priority datasets (ETL steps or governed views).
  • Stand up an exceptions workflow with approval + expiration.
  • Define and begin recurring evidence collection: job run logs, sampling checks, and access checks.
  • For third parties receiving data, update data specs to reflect transformed identifiers and retain supporting documentation.

By 90 days (make it auditable and scalable)

  • Expand inventory to remaining datasets and departments.
  • Add automated detection for new datasets/columns that look like identifiers.
  • Perform an internal audit-style walkthrough: pick sample datasets and demonstrate end-to-end compliance evidence.
  • Centralize control mapping and evidence requests in Daydream so audits do not turn into multi-team email threads. 2

Frequently Asked Questions

Does SI-19(4) require a specific method like encryption?

No. The text allows removal, masking, encryption, hashing, or replacement, and you choose based on the dataset’s purpose and risk. Your obligation is to implement a defensible method consistently and retain evidence. 2

Are “direct identifiers” the same as all personal data?

No. Direct identifiers identify a person on their own; other fields may be indirect identifiers and still risky in combination. Define direct identifiers explicitly in your standard, then treat indirect identifiers through broader privacy/security controls.

If we encrypt direct identifiers, are we done?

Encryption can satisfy the control, but auditors will look for key management, access restrictions, and proof that identifiers are not routinely decrypted in broad-access environments. If many users can access decrypted values, removal or replacement is usually easier to defend.

Can we hash emails or SSNs for matching across datasets?

Hashing supports matching, but it can be vulnerable to guessing for predictable inputs. Prefer replacement/tokenization for stable joins when feasible, and document the rationale and safeguards when hashing is used.

How do we handle third parties that demand raw identifiers?

Treat it as an exception that requires documented purpose, approval, and compensating controls, and limit the fields to the minimum required. Update the data sharing specification and retain evidence of the decision and controls.

What evidence is most likely to make or break an assessment?

A dataset inventory with identifier mapping, pipeline configs showing transformations, and recurring operational proof (job logs, samples, exceptions with expirations). Missing implementation evidence is a common failure mode. 2

Footnotes

  1. NIST SP 800-53 Rev. 5

  2. NIST SP 800-53 Rev. 5 OSCAL JSON

Frequently Asked Questions

Does SI-19(4) require a specific method like encryption?

No. The text allows removal, masking, encryption, hashing, or replacement, and you choose based on the dataset’s purpose and risk. Your obligation is to implement a defensible method consistently and retain evidence. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)

Are “direct identifiers” the same as all personal data?

No. Direct identifiers identify a person on their own; other fields may be indirect identifiers and still risky in combination. Define direct identifiers explicitly in your standard, then treat indirect identifiers through broader privacy/security controls.

If we encrypt direct identifiers, are we done?

Encryption can satisfy the control, but auditors will look for key management, access restrictions, and proof that identifiers are not routinely decrypted in broad-access environments. If many users can access decrypted values, removal or replacement is usually easier to defend.

Can we hash emails or SSNs for matching across datasets?

Hashing supports matching, but it can be vulnerable to guessing for predictable inputs. Prefer replacement/tokenization for stable joins when feasible, and document the rationale and safeguards when hashing is used.

How do we handle third parties that demand raw identifiers?

Treat it as an exception that requires documented purpose, approval, and compensating controls, and limit the fields to the minimum required. Update the data sharing specification and retain evidence of the decision and controls.

What evidence is most likely to make or break an assessment?

A dataset inventory with identifier mapping, pipeline configs showing transformations, and recurring operational proof (job logs, samples, exceptions with expirations). Missing implementation evidence is a common failure mode. (Source: NIST SP 800-53 Rev. 5 OSCAL JSON)

Operationalize this requirement

Map requirement text to controls, owners, evidence, and review workflows inside Daydream.

See Daydream